Postgres transactions are a distributed systems superpower

102 points by KraftyOne 6 hours ago

The part about durable workflows is technically correct, but it's focusing on different things than what I've ever run into in practice. Any mildly complex system will have side effects outside your DB, then you want idempotency. If you have no side effects, you probably don't need a durable workflow in the first place? Maybe there's a more concrete example.

I have rolled my own little durable workflows in Postgres before, in fact before I even knew durable workflows were a thing with solutions like Temporal. That's fine for many cases where you aren't doing enough steps for it to be tedious, and/or you want permanent records. Would do it again, but not for atomicity reasons.

Other comments have already discussed the issue with the outbox UDF, your external system has to poll and retry either way. Maybe I'm misunderstanding this?

KraftyOne - 2 minutes ago

You still need idempotency for side effects outside your database, that's true (and fundamental to durability). But now you get exactly-once semantics for operations on your database, which can be quite valuable if your workflow performs many such updates or they're particularly complicated or stateful.

mrkeen - 3 hours ago

I walked away from a job interview a few years ago on this point.

One of the technical questions was "if you have a db and a message queue, how do you get your update to alter both or neither (i.e. transactionally)"?

I thought about it for a couple of minutes, then came back with something like "I can't, and you can't either." Then I proposed the usual spiel about using a replicated-state-machine/write-ahead-log/event-sourcing (whatever it might be called at the time) and leaning into eventual consistency as the only practical solution.

He asked if I'd heard about the outbox pattern, so I let him describe it. Sure enough it sounded like this article. The secret to transacting across the database D and the message queue Q:

  (D,Q)

is to split D into two parts (the State and the Outbox), transact across those instead

  (S,O)    Q

and then just pretend that you have a transaction across D and Q.

andix - 2 hours ago

With an inbox/outbox pattern it's possible. The incoming message might be processed more than once, and an outgoing message might be sent more than once. That's the limitation, and the system needs to be able to handle it.
If you can't de-duplicate messages it's not possible, that's true.
vlovich123 - 2 hours ago

> Sure enough it sounded like this article
FWIW The article literally talks about the challenges with getting this to actually work and recommends removing it and just using the DB for everything.
- mrkeen - 2 hours ago
  But that's what the outbox pattern is. You take the problem of transacting between more than one system, and by "just using the db", you declare the problem solved, leaving the communication with other systems as an exercise for the reader.
  From the end of the article:
  The enqueue_workflow UDF creates this row in the same transaction as the user database update, guaranteeing atomicity
  - koverstreet - an hour ago
    
    The point is that the "outbox pattern" is not an atomic transaction. You fundamentally don't have those in the distributed world, via the CAP theorem, and if you want anything close to the guarantees a local transactional database gives you in a distributed system you have to design your schema for it.
    Distributed coherency is not something you can abstract away, the abstractions all leak.
  - sarchertech - 2 hours ago
    
    Your right! But the outbox pattern is good enough for a lot of purposes. The outbox pattern works if the only reason the write to the 2nd system can fail is because of transient issues. It will keep trying until the system is back up.
    If the 2nd system write can fail for non-transient reasons, the outbox pattern doesn’t work and you need either 2 phase commit or a distributed saga.
    I wrote about this here a few years ago.
    https://linuxblog.io/the-two-generals-problem/
pylua - 2 hours ago

Just post to the database then asynch send to message queue. Messages should still be idempotent by the consumer but at least this follows rest and is transactional.
It’s simple and easy to follow. At scale use multi tenancy.
belinder - 3 hours ago

Why not just put the message queue in the same db
- CodesInChaos - 3 hours ago
  
  That's what I generally choose. You don't need to worry about distributed system semantics, if you choose to not make the system distributed.
  However the way Postgres keeps around obsolete rows (deleted or modified) until they're vacuumed can cause problems for high throughput queues. So for those systems the complexity might be worth it. But I bet 90% of the time the choice to use a separate queue is premature optimization. And hopefully OrioleDB (undo based storage engine for postgres) will avoid most of these pitfalls reducing the need for separate queues even further.
- mrkeen - 3 hours ago
  
  Step 1: identify that you and at least one other node are separated by distance, and some lossy communication channel, and therefore form a distributed system.
  Step 2: propose a source of truth that everyone can listen to. Hearing the same facts in the same order should put everyone in the same state (eventual consistency)
  Step 3 (you are here): try to do better than EC, by merging the external queue into one of the nodes, making it the master.
  Step 4: Now there's no distance between the nodes, so no need to solve the distributed systems problem and you can retire the queue.
  - convolvatron - 2 hours ago
    
    eventual consistency as generally used doesn't guarantee that events are presented in the same order. I use 'monotonic consistency' for that, but idk how common that is.
    
    AlotOfReading - 33 minutes ago
    
    Order independence/monotonicity is strong EC rather than regular EC.
    
    mrkeen - 2 hours ago
    
    Yes, same-ordering gives you EC, not the other round.
- KraftyOne - 3 hours ago
  
  That's what the post is about! Once you're doing that, you really do have transactions between the state and the queue.
jayd16 - 3 hours ago

It's a bit of trick that the outbox to queue part of it likely needs to support "at least once but duplicates possible" into the queue.
- mrkeen - 3 hours ago
  
  "Send multiple times from D to Q and deduplicate with a UUID" (idempotency) is well short of "insert into both D and Q or neither" (atomicity)
  - jayd16 - 2 hours ago
    
    What are you saying here? I'm pointing out that you need to be ok with the lack of exactly once transaction between O and Q. Maybe you're agreeing and simply saying that's a fine?
- KraftyOne - 3 hours ago
  
  Every item will be written to the queue exactly once (as the update is transactional). Queue processing may need at-least-once semantics, yes, depending on what exactly you're doing.
  - jayd16 - 2 hours ago
    
    The queue write is not in the transaction. The proposed trick is that that is ok because an outbox is able to be transacted on. It kicks the can some what...
game_the0ry - 2 hours ago

I envy you DB + distributed systems specialists. Reminds me I still have a lot to learn.

cloudie78 - 5 hours ago

Congratulations, you discovered a mutex.

Is it really a distributed system or just a bunch of services with a central database?

tomjakubowski - 4 hours ago

I don't think it's true that distributed and decentralized mean the same thing. A hub and spoke rail system is centralized, but it's still a distributed system, if it has multiple trains running concurrently.* A distributed system has to coordinate somehow, and a single central DB is one way of doing it.
*: edit, maybe a better example here is a rail system with a single central dispatcher is centralized but may still be distributed
- munk-a - 4 hours ago
  
  In fact - if you're building a very large distributed system the goal is usually to shrink that centralized component to the smallest and most robust surface you can. If the system is well designed it is amazing just how much consistency power you can get from a tiny component of centralization.
  There are always tradeoffs of course, but building a truly decentralized system requires some really difficult compromises to correctness. The two general's problem is a great piece of reading on this topic - distribution always requires compromises in general, but to fully remove an authority on truth gets quite tricky.
  - brentjanderson - 4 hours ago
    
    I think Ducklake[1] is a terrific example of this. They said "look, let's build a lake house over S3, but for the bit that needs strong consistency (the manifest of which S3 blobs are in play), let's use Postgres". Postgres as a metadata catalog or control plane is brilliant for this, since you get strong consistency and the scaling story around a metadata catalog is far different than the volume of data you need to store. Use S3 for volume, Postgres for consistent metadata.
    A similar pattern has spilled out of projects like Warpstream[2], which I suspect is using Postgres behind the scenes of their control plane.
    [1]: https://ducklake.select
    [2]: https://www.warpstream.com/
    
    munk-a - 4 hours ago
    
    I have built and maintain a system that uses a very similar system - we register artifacts with UUIDs into S3 in a specifically write-once, never edit, never remove approach and then store those UUIDs in a postgres system. We simply juggle around the connection of other model objects to UUIDs as needed allowing us to achieve safe guarantees without burdening the centralized system with the massive volume (these artifacts are often 50MB+ PDFs). I will mention that I am quite fond of this approach but it's good to be aware that introducing levels of abstraction like this do necessarily widen some fail points on the storage side - if your service uses multiple persistence stores each additional store exposes yet another point where inconsistency could be introduced and/or a message could be lost. Still, fragmenting your data over multiple stores that are particularly well suited for their specialized usages can be huge for performance and cost.
  - mrkeen - 2 hours ago
    
    > The two general's problem is a great piece of reading on this topic
    It is!
    And the solution is to add an extra general on the left side. Let's call him Outus Boxus. The two generals on the left side can communicate in perfect lockstep. Then if you need the general on the right to find out about something, you can send a few workers to tell him or something...
    More seriously though, you can have a DS for two reasons: tech or political.
    Tech means scaling or reliability. So clients can be serviced by any of the nodes.
    Political means different actors don't have a central authority. You can't stick two banks into one db.
    This technique doesn't seem to address either aspect.
- KraftyOne - 4 hours ago
  
  Exactly! It's a distributed system, with many processes performing work in parallel, with a central database as a coordination point, used as little as possible. A mutex wouldn't get quite the same performance :)
  - nyrikki - 3 hours ago
    
    A more modern term is your system is a single architectural quantum’
    Neal Ford calls this a distributed monolith because a change to a database schema can break every single service at once, but there are very valid uses of this method.
    There are decades of books on the foot guns as we used this even back in the client-server days.
    One suggestion I have is to research where the first version of SoA failed, especially as these systems tend to erode into Enterprise Service Busses.
    Products like Apache airflow tend to have value not because of the persistence layer, but because they force workflows into DAGs, which is an enforceable structural constraint, while SQL, being declarative, can sometimes force you into trying to enforce governance through observing behavior.
    The former is not subject to Rice’s theorem, while the latter is.
    If you actively control for these it will greatly increase the lifetime of this system before (or if) you reach the point you have to replace the system.
gopalv - an hour ago

> Is it really a distributed system or just a bunch of services with a central database?
I've asked myself this question every single time I've had to use Zookeeper.
Apache Kafka being the poster child of the problem, with HBase in a close second.
- 5 hours ago

[deleted]

jdw64 - 4 hours ago

So my understanding is that they're aligning the workflow progression unit and the database commit unit on a one-to-one basis. In other words, each step in the workflow becomes a database commit unit. That's why the outbox pattern gets simplified. But in exchange, the database itself becomes tightly coupled to the workflow, which will make it architecturally difficult to separate later on. Although, to be fair, I almost never actually need to separate the database anyway.

In most services, I often swap out the message broker or the workflow engine, but the database almost always stays the same.

I'm not sure if I've understood this correctly.

KraftyOne - 4 hours ago

Yes, the core design is building a workflow system on a database--essentially, replacing the central orchestrator most workflow systems use with a Postgres database. This previous blog post goes into more detail: https://www.dbos.dev/blog/postgres-is-all-you-need-for-durab... (HN discussion: https://news.ycombinator.com/item?id=48313530)

munk-a - 4 hours ago

We've leveraged the atomicity of transactions with a fail-safe approach for external service interactions for client email sending. This could certainly be done with a formal queue though it'd operate very similarly and achieve the same guarantees as we have today (and was built when we were too small to justify such an infra spend). Internally we have jobs that execute complex logic to transform data from a pending state to a computed state which lean on the DB's atomicity to guarantee that data is successfully transitions and those tasks are all incredibly resilient - but when a secondary persistence store is involved transactional guarantees need to be compromised in some manner. In our email sending example we have the opinion that it is more important to guarantee a client receives all notifications compared to a notification being guaranteed to be sent precisely once so our mechanism in sending is to confirm email sending was successful and then close a transaction that removes that message from the pending list.

There will always be a window for potential loss due to solar flares/whatever but the key in designing a system like this is to make sure you're aware of how the system can fail, accept that outcome and then work to, as much as possible, shrink the distance in cycles/logic between each persistence committal. Logic should be front-loaded to do as much prep work as possible before any irreversible actions happen and then those irreversible actions should be ordered to your preference and dispatched as quickly and cheaply as possible in a safe manner.

Crowberry - 3 hours ago

We’ve got an in-house pubsub solution that lives in the main applications database, so pretty much exactly as described in the article. And the atomicity it allows is indeed really nice!

- 3 hours ago

[deleted]

zyngaro - 2 hours ago

The article is ridden with misconception. Have you guys ever heard of the CAP theorem ? Disturbed system suck let's implement a non distributed one. The title is also misleading: Postgres transactions are not distributed.

aynyc - 3 hours ago

OK. I've read it a few times and still don't understand. Where is the distributed part? You store data in a single transaction into postgres. What/who is notifying the message queue?

KraftyOne - 3 hours ago

You build a distributed system on top of this! For example, you may have many distributed workers durably executing workflows from the Postgres-backed task queue. The Postgres transactions allow you to atomically perform operations spanning both your task queue and your business data.
Here's another blog post about how a Postgres-backed task queue can run at scale: https://www.dbos.dev/blog/making-postgres-queues-scale
hoppp - 3 hours ago

I've been writing distributed workers for ages with stored functions that have a SELECT FOR UPDATE query.
When workers query the db for jobs the rows get locked by the select and there are no race conditions or duplicate assigned jobs

hoppp - 3 hours ago

Just start writing stored functions already.

evilturnip - 4 hours ago

Can you use postgres as a state store for a distributed application?

It seems this article is trending toward that view: If you can maintain transactional consistency along with application workflow state, then would this generalize to maintaining distributed application state in general?

The follow-up would be: Would this be preferable to Valkey/Redis?

munk-a - 4 hours ago

Yes you can - usually I think it's advisable to wrap postgres in a shim application to provide a consistently defined surface you can control but postgres can absolutely serve as the authority node on data correctness.
As to which technical solution would be optimal there are a bunch of factors to consider and I think preferences around features could lead you to a variety of options. Postgres is excellent as long as you're minimizing the amount of data piping directly through it or operating at a reasonable scale.
mrkeen - 2 hours ago

> then would this generalize to maintaining distributed application state in general?
Yes, in the sense of 'too good to be true'
KraftyOne - 4 hours ago

[dead]

- 3 hours ago

[deleted]

bsaul - 5 hours ago

i don't understand the last point of UDF. Either you need the state to be updated atomically across different systems or you don't. But writing a row in a system in order to update the second one at any random time in the future isn't really much different from enqueuing a job in queue.

mrkeen - 3 hours ago

Your intuition sounds right to me.
This sounds a lot like reinventing a message queue. Someone trying this in the future might learn painful lessons about ordering, commits, partitioning, dead-letter-queues, replayability, don't-call-me-I'll-call-you, and anything else a Kafka-like comes with out of the box.
KraftyOne - 4 hours ago

The key is that the UDF's enqueue is transactional with the database update. Let's say the database update is inserting a new order. This provides the guarantee that if a new order is inserted, a job to process the order is also enqueued. It's impossible for a new order to be inserted without its processing job also being enqueued. Then the durable workflow/queue system is responsible for making sure the processing job, once enqueued, actually executes.
- LgWoodenBadger - 2 hours ago
  
  And if that job never runs? Or if that job runs and then fails to commit that it ran in postgres?
  - halfcat - an hour ago
    
    The job will run the next time a worker runs (in both cases).
    And doesn’t that mean the job potentially runs twice? Yes.
    In DBOS there are two kinds of “things that run”: workflows, and steps (workflows are made of steps).
    Workflows must be deterministic (so it’s fine if it runs twice). Steps don’t have to be deterministic but have at-least-once execution (so it’s best if these are idempotent).

lima - 3 hours ago

[dead]