The challenges of soft delete

atlas9.dev

73 points by buchanae 3 hours ago


MaxGabriel - 2 hours ago

This might stem from the domain I work in (banking), but I have the opposite take. Soft delete pros to me:

* It's obvious from the schema: If there's a `deleted_at` column, I know how to query the table correctly (vs thinking rows aren't DELETEd, or knowing where to look in another table)

* One way to do things: Analytics queries, admin pages, it all can look at the same set of data, vs having separate handling for historical data.

* DELETEs are likely fairly rare by volume for many use cases

* I haven't found soft-deleted rows to be a big performance issue. Intuitively this should be true, since queries should be O log(N)

* Undoing is really easy, because all the relationships stay in place, vs data already being moved elsewhere (In practice, I haven't found much need for this kind of undo).

In most cases, I've really enjoyed going even further and making rows fully immutable, using a new row to handle updates. This makes it really easy to reference historical data.

If I was doing the logging approach described in the article, I'd use database triggers that keep a copy of every INSERT/UPDATE/DELETEd row in a duplicate table. This way it all stays in the same database—easy to query and replicate elsewhere.

tracker1 - an hour ago

I like having archive/history tables. I often do similar with job queues when persisting to a database, in this way the pending table can stay small and avoid full scans to skip the need for deleted records...

Aside, another idea that I've kicked forward for event driven databases is to just use a database like sqlite and copy/wipe the whole thing as necessary after an event or the work that's related to that database. For example, all validation/chain of custody info for ballot signatures... there's not much point in having it all online or active, or even mixed in with other ballot initiatives and the schema can change with the app as needed for new events. Just copy that file, and you have that archive. Compress the file even and just have it hard archived and backed up if needed.

3rodents - an hour ago

Soft deletes are an example of where engineers unintentionally lead product instead of product leading engineering. Soft delete isn’t language used by users so it should not be used by engineers when making product facing decisions.

“Delete” “archive” “hide” are the type of actions a user typically wants, each with their own semantics specific to the product. A flag on the row, a separate table, deleting a row, these are all implementation options that should be led by the product.

rorylaitila - an hour ago

Databases store facts. Creating a record = new fact. "Deleting" a record = new fact. But destroying rows from tables = disappeared fact. That is not great for most cases. In rare cases the volume of records may be a technical hurdle; in which case, move facts to another database. The times I've wanted to destroy large volume of facts is approximately zero.

talesmm14 - 2 hours ago

I've worked at companies where soft delete was implemented everywhere, even in irrelevant internal systems... I think it's a cultural thing! I still remember a college professor scolding me on an extension project because I hadn't implemented soft delete... in his words, "In the business world, data is never deleted!!"

clickety_clack - an hour ago

We have soft delete, with hard delete running on deletions over 45 days old. Sometimes people delete things by accident and this is the only way to practically recover that.

jamilbk - an hour ago

At Firezone we started with soft-deletes thinking it might be useful for an audit / compliance log and quickly ran into each of the problems described in this article. The real issue for us was migrations - having to maintain structure of deleted data alongside live data just didn't make sense, and undermined the point of an immutable audit trail.

We've switched to CDC using Postgres which emits into another (non-replicated) write-optimized table. The replication connection maintains a 'subject' variable to provide audit context for each INSERT/UPDATE/DELETE. So far, CDC has worked very well for us in this manner (Elixir / Postgrex).

I do think soft-deletes have their place in this world, maybe for user-facing "restore deleted" features. I don't think compliance or audit trails are the right place for them however.

maxchehab - 2 hours ago

How do you handle schema drift?

The data archive serialized the schema of the deleted object representative the schema in that point in time.

But fast-forward some schema changes, now your system has to migrate the archived objects to the current schema?

IgorPartola - 14 minutes ago

I have a love/hate relationship with soft deleted. There are cases where it’s not really a delete but rather a historical fact. For example, let’s say I have a table which stores an employee’s current hourly rate. They are hired at say $15/hour, then go to $17 six months later, then to $20/hour three months later. All of these three things are true and I want to be able to query which rate the employee had on a specific date even after their rate had changed. When I have a starts_on and an ends_on dates and the latter is nullable, with some data consistency logic I can create a linear history of compensation and can query historical and current data the same exact way. I also get

But this is such a huge PITA because you constantly have to mind if any given object has this setup or not and what if related objects have different start/end dates? And something like a scheduled raise for next year to $22/hour can get funny if I then try to insert that just for July it will be $24/hour (this would take my single record for next year and split it into two and then you gotta figure out which gets the original ID and which is the new row.

Another alternative to this is a pattern where you store the current state and separately you store mutations. So you have a compensation table and a compensation_mutations table which says how to evolve a specific row in a compensation table and when. The mutations for anything in the future can be deleted but the past ones cannot which lets you reconstruct who did what, when, and why. But this also has drawbacks. One of them is that you can’t query historical data the same way as current data. You also have to somehow apply these mutations (cron job? DB trigger?)

And of course there are database extensions that allow soft deletes but I have never tried them for vague portability reasons (as if anyone ever moved off Postgres).

iterateoften - 38 minutes ago

I used to be pretty adamant about implementing soft delete for core business objects.

However after 15 years I prefer to just back up regularly, have point in time restores and then just delete normally.

The amount of times I have “undeleted” something are few and far between.

theLiminator - 2 hours ago

Privacy regulations make soft delete unviable in many of the cases where it's useful.

LorenPechtel - an hour ago

The % of records that are deleted is a huge factor.

You keep 99%, soft delete 1%, use some sort of deleted flag. While I have not tried it whalesalad's suggestion of a view sounds excellent. You delete 99%, keep 1%, move it!

cj - 2 hours ago

We deal with soft delete in a Mongo app with hundreds of millions of records by simply moving the objects to a separate collection (table) separate from the “not deleted” data.

This works well especially in cases where you don’t want to waste CPU/memory scanning soft deleted records every time you do a lookup.

And avoids situations where app/backend logic forgets to apply the “deleted: false” filter.

ntonozzi - an hour ago

I've given up on soft delete -- the nail in the coffin for me was my customers' legal requirements that data is fully deleted, not archived. It never worked that well anyways. I never had a successful restore from a large set of soft-deleted rows.

nemothekid - 2 hours ago

The trigger architecture is actually quite interesting, especially because cleanup is relatively cheap. As far as compliance goes, it's also simply to declare that "after 45 days, deletions are permanent" as a catch all, and then you get to keep restores. For example, I think (IANAL), the CCPA gives you a 45 day buffer for right to erasure requests.

Now instead of chasing down different systems and backups, you can simply set ensure your archival process runs regularly and you should be good.

whalesalad - 2 hours ago

A good solution here (can be) to utilize a view. The underlying table has soft-delete field and the view will hide rows that have been soft deleted. Then the application doesn't need to worry about this concern all over the place.

pjs_ - an hour ago

Tried implementing this crap once. Never again

nerdponx - 2 hours ago

One thing that often gets forgotten in the discussions about whether to soft delete and how to do it is: what about analysis of your data? Even if you don't have a data science team, or even a dedicated business analyst, there's a good chance that somebody at some point will want to analyze something in the data. And there's a good chance that the analysis will either be explicitly "intertemporal" in that it looks at and compares data from various points in time, or implicitly in that the data spans a long time range and you need to know the states of various entities "as of" a particular time in history. If you didn't keep snapshots and you don't have soft edits/deletes you're kinda SoL. Don't forget the data people down the line... which might include you, trying to make a product decision or diagnose a slippery production bug.

cyberax - an hour ago

Soft deletes + GC for the win!

We have an offline-first infrastructure that replicates the state to possibly offline clients. Hard deletes were causing a lot of fun issues with conflicts, where a client could "resurrect" a deleted object. Or deletion might succeed locally but fail later because somebody added a dependent object. There are ways around that, of course, but why bother?

Soft deletes can be handled just like any regular update. Then we just periodically run a garbage collector to hard-delete objects after some time.