Listen to Database Changes Through the Postgres WAL

peterullrich.com

166 points by pjullrich 6 days ago


JoelJacobson - 11 hours ago

> The problem with Postgres' NOTIFY is that all notifications go through a single queue!

> Even if you have 20 database connections making 20 transactions in parallel, all of them need to wait for their turn to lock the notification queue, add their notification, and unlock the queue again. This creates a bottleneck especially in high-throughput databases.

We're currently working hard on optimizing LISTEN/NOTIFY: https://www.postgresql.org/message-id/flat/6899c044-4a82-49b...

If you have any experiences of actual workload where you are currently experiencing performance/scalability problems, I would be interested in hearing from you, to better understand the actual workload. In some workloads, you might only listen to a single channel. For such single-channel workloads, the current implementation seems hard to tweak further, given the semantics and in-commit-order guarantees. However, for multi-channel workloads, we could do a lot better, which is what the linked patch is about. The main problem with the current implementation for multi-channel workloads, is that we currently signal and wake all listening backends (a backend is the PostgreSQL processes your client is connected to), even if they are not interested in the specific channels being notified in the current commit. This means that if you have 100 connections open in which each connect client has made a LISTEN on a different channel, then when someone does a NOTIFY on one of those channels, instead of just signaling the backend that listen on that channel, all 100 backends will be signaled. For multi-channel workloads, this could mean an enormous extra cost coming from the context-switching due to the signaling.

I would greatly appreciate if you could please reply to this comment and share your different workloads when you've had problems with LISTEN/NOTIFY, to better understand approximately how many listening backends you had, and how many channels you had, and the mix of volume on such channels. Anything that could help us do better realistic simulations of such workloads, to improve the benchmark tests we're working on. Thank you.

matharmin - 7 hours ago

We're relying on logical replication heavily for PowerSync, and I've found it is a great tool, but it is also very low-level and under-documented. This article gives a great overview - I wish I had this when we started with our implementation.

Some examples of difficulties we've ran into: 1. LSNs for transactions (commits) are strictly increasing, but not for individual operations across transactions. You may not pick this up during basic testing, but it starts showing up when you have concurrent transactions. 2. You cannot resume logical replication in the middle of a transaction (you have to restart the transaction), which becomes relevant when you have large transactions. 3. In most cases, replication slots cannot be preserved when upgrading Postgres major versions. 4. When you have multiple Postgres clusters in a HA setup, you _can_ use logical replication, but it becomes more tricky (better in recent Postgres versions, but you're still responsible for making sure the slots are synced). 5. Replication slots can break in many different ways, and there's no good way to know all the possible failure modes until you've run into them. Especially fun when your server ran out of disk space at some point. It's a little better with Postgres 17+ exposing wal_status and invalidation_reason on pg_replication_slots. 6. You need to make sure to acknowledge keepalive messages and not only data messages, otherwise the WAL can keep growing indefinitely when you don't have incoming changes (depending on the hosting provider). 7. Common drivers often either don't implement the replication protocol at all, or attempt to abstract away low-level details that you actually need. Here it's great that the article actually explains the low-level protocol details.

tengbretson - an hour ago

The longer I work the fewer legitimate use cases for LISTEN/NOTIFY I seem to find. At this point, I only really see the value in using it to allow a polling loop to have a longer sleep and to use notify events as interrupts. Even that use case comes with its own set of problems that may make it unsuitable.

isidor3 - 10 hours ago

I was hoping from the title that this would be about creating an audio representation so you could listen to the database work as if it were an old PC that had distinctive sounds for different tasks

JoelJacobson - 12 hours ago

> If you call pg_notify or NOTIFY inside a trigger, it will get called 100,000 times and send out 100,000 notifications if you change 100,000 rows in a single transaction which from a performance perspective is ... not ideal.

This is only true if those notifications are different; if they are identical, such as in the same the notification is to alert listeners some table has new data (for cache invalidation), they are sent out as one notification only. See source code comment in async.c:

     *   Duplicate notifications from the same transaction are sent out as one
     *   notification only. This is done to save work when for example a trigger
     *   on a 2 million row table fires a notification for each row that has been
     *   changed. If the application needs to receive every single notification
     *   that has been sent, it can easily add some unique string into the extra
     *   payload parameter.
refset - 9 hours ago

Recently released Clojure implementation of the same pattern: https://github.com/eerohele/muutos

JoelJacobson - 10 hours ago

> It works, but suddenly your query times explode! Instead of doing 1 million transactions per second* you can now do only 3 (*These numbers were exaggerated for dramatic effect)

In general, a single-queue design doesn’t make throughput collapse when you add more parallelism; it just gives you a fixed ceiling. With a well-designed queue, throughput goes up with concurrency, then flattens when the serialized section (the queue) saturates, maybe sagging a bit from context switching.

If instead you see performance severely degrade as you add workers, that typically means there’s an additional problem beyond “we have one queue” — things like broadcast wakeups (“every event wakes every listener”), global scans on each event, or other O(N) work per operation. That’s a very different, and more serious, scalability bug than simply relying on a single queue.

decasia - 7 hours ago

I misinterpreted the title and was hoping that this was going to be a post about realtime algorithmic music generation from the Postgres WAL, something like the Hatnote “listen to Wikipedia edits” project.

http://listen.hatnote.com/

davidelettieri - 12 hours ago

It's worth mentioning debezium https://debezium.io/

It allows to publish all changes from the db to Kafka.

jumski - 7 hours ago

Worth mentioning that WAL is enabling and used by Supabase Realtime [0] and WalEx [1] which both are doing very similar thing - polling the WAL changes in order to emit them as Websocket events.

I never worked with WalEx but have experience with Supabase Realtime and it is a joy to work with and fits great into the Supa ecosystem. So many layers are disappearing when you rely more on Postgres!

[0] https://github.com/supabase/realtime [1] https://github.com/cpursley/walex

greener_grass - an hour ago

This is how Debezium works.

It is probably best to use that unless there is a strong reason against.

agentultra - 3 hours ago

I wrote a Haskell library for this as well: https://hackage.haskell.org/package/postgresql-replicant

jwsteigerwalt - 5 hours ago

I have had great success in the past building business event alert systems for underlying salesforce data with Notify on Heroku data warehouse replications of salesforce. The potential for performance issues was always a nagging issue in the back of my head and this article is a great example why.

antirez - 10 hours ago

Related: Redis has a keyspace notification doing something similar that is not very well known, but when it is needed, it is really needed. We are thinking of extending this mechanism in new ways. Similarly I have seen setups where the notifications arriving from Postgres/MySQL are used in order to materialize (and keep updated) a cached view in Redis. To me, it is interesting how certain teams relay on these kind of mechanisms provided by database systems, and other teams like to do things in the client side to have full control, even in the face of having to reimplement some logic.

edistra - 8 hours ago

Thanks for the deep dive on Postgres WAL.

I recently had the opportunity to play with PostgreSQL WAL in the scope of implementing opensearch cdc pipeline, and it was really exciting to see what is possible to achieve with it.

Be cautious with idle replica slots though, I got bitten by inactive slots filling up the production database storage.

PostgreSQL 18 introduces idle_replication_slot_timeout to mitigate this.

a_t48 - 11 hours ago

I had to set one of these up somewhat under duress to replace a poorly written scheduled query to export data to BigQuery. It’s nice to know what it’s actually doing under the hood, thanks for the info.

hbarka - 13 hours ago

I remember decades ago working in something similar in SQL Server called transactional log shipping. It worked on the same principle as write-ahead to another server and was an ETL method for incremental change capture to a data warehouse.

dm03514 - 4 hours ago

Great Article! This is also timely for me, I spent all last week deep in the postgres documentation learning about replication (wish I had this article).

I'm building kafka connect (i.e. Debezium) compatible replication:

https://github.com/turbolytics/librarian/pull/41

-----

I really like the mongo change stream API. I think it elegantly hides a lot of the complexity of replication. As a side-effect of building postres -> kafka replication, I'm also trying to build a nicer replication primitive for postgres, that hides some of the underlying replication protocol complexity!

cpursley - 8 hours ago

Thanks for the WalEx mention!: https://github.com/cpursley/walex also, there’s https://github.com/sequinstream/sequin if you want something more robust that already works with connectors and smokes debezium in terms of performance and is simpler to operate.

stackskipton - 2 hours ago

The Ops person in me is crying.

Please stop doing this via database. Scaling issues can develop and we in Ops catch flak for your bad design choices. Use Kafka/RabbitMQ/Redis PLEASE.

morshu9001 - 13 hours ago

This is cool, but also, why is pg_notify designed with such a contentious queue?