Listen to Database Changes Through the Postgres WAL
peterullrich.com166 points by pjullrich 6 days ago
166 points by pjullrich 6 days ago
> The problem with Postgres' NOTIFY is that all notifications go through a single queue!
> Even if you have 20 database connections making 20 transactions in parallel, all of them need to wait for their turn to lock the notification queue, add their notification, and unlock the queue again. This creates a bottleneck especially in high-throughput databases.
We're currently working hard on optimizing LISTEN/NOTIFY: https://www.postgresql.org/message-id/flat/6899c044-4a82-49b...
If you have any experiences of actual workload where you are currently experiencing performance/scalability problems, I would be interested in hearing from you, to better understand the actual workload. In some workloads, you might only listen to a single channel. For such single-channel workloads, the current implementation seems hard to tweak further, given the semantics and in-commit-order guarantees. However, for multi-channel workloads, we could do a lot better, which is what the linked patch is about. The main problem with the current implementation for multi-channel workloads, is that we currently signal and wake all listening backends (a backend is the PostgreSQL processes your client is connected to), even if they are not interested in the specific channels being notified in the current commit. This means that if you have 100 connections open in which each connect client has made a LISTEN on a different channel, then when someone does a NOTIFY on one of those channels, instead of just signaling the backend that listen on that channel, all 100 backends will be signaled. For multi-channel workloads, this could mean an enormous extra cost coming from the context-switching due to the signaling.
I would greatly appreciate if you could please reply to this comment and share your different workloads when you've had problems with LISTEN/NOTIFY, to better understand approximately how many listening backends you had, and how many channels you had, and the mix of volume on such channels. Anything that could help us do better realistic simulations of such workloads, to improve the benchmark tests we're working on. Thank you.
Here is the Commitfest entry if you want to help with reviewing/development/testing of the patch: https://commitfest.postgresql.org/patch/6078/
I have listen/notify on most changes in my database. Not sure I've experienced any performance issue though but I can't say I've been putting things through their paces. IMHO listen/notify's simplicity outweighed the perf gains by WAL.
I'm only sharing this should it be helpful:
def up do
whitelist = Enum.join(@user_columns ++ ["tick"], "', '")
execute """
CREATE OR REPLACE FUNCTION notify_phrasing() RETURNS trigger AS $$
DECLARE
notif jsonb;
col_name text;
col_value text;
uuids jsonb := '{}'::jsonb;
user_columns text[] := ARRAY['#{whitelist}'];
BEGIN
-- First, add all UUID columns
FOR col_name IN
SELECT column_name
FROM information_schema.columns
WHERE table_name = TG_TABLE_NAME AND data_type = 'uuid'
LOOP
EXECUTE format('SELECT ($1).%I::text', col_name)
INTO col_value
USING CASE WHEN TG_OP = 'DELETE' THEN OLD ELSE NEW END;
IF col_value IS NOT NULL THEN
uuids := uuids || jsonb_build_object(col_name, col_value);
END IF;
END LOOP;
-- Then, add user columns if they exist in the table
FOREACH col_name IN ARRAY user_columns
LOOP
IF EXISTS (
SELECT 1
FROM information_schema.columns
WHERE table_name = TG_TABLE_NAME AND column_name = col_name
) THEN
EXECUTE format('SELECT ($1).%I::text', col_name)
INTO col_value
USING CASE WHEN TG_OP = 'DELETE' THEN OLD ELSE NEW END;
IF col_value IS NOT NULL THEN
uuids := uuids || jsonb_build_object(col_name, col_value);
END IF;
END IF;
END LOOP;
notif = jsonb_build_object(
'table', TG_TABLE_NAME,
'event', TG_OP,
'uuids', uuids
);
PERFORM pg_notify('phrasing', notif::text);
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
"""
# Create trigger for each table
Enum.each(@tables, fn table ->
execute """
CREATE TRIGGER notify_phrasing__#{table}
AFTER INSERT OR UPDATE OR DELETE ON #{table}
FOR EACH ROW EXECUTE FUNCTION notify_phrasing();
"""
end)
end
I do react to most (70%) of my database changes in some way shape or form, and post them to a PubSub topic with the uuids. All of my dispatching can be done off of uuids.We use it like this:
CREATE TRIGGER notify_events_trg AFTER INSERT ON xxx.events FOR EACH ROW EXECUTE PROCEDURE public.notify_events();
CREATE FUNCTION public.notify_events() RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
PERFORM pg_notify('events', row_to_json(NEW)::text);
RETURN NEW;
END;
$$;
And then we have a bunch of triggers like this on many tables: CREATE TRIGGER create_category_event_trg AFTER INSERT OR DELETE OR UPDATE ON public.categories FOR EACH ROW EXECUTE PROCEDURE public.create_category_event();
CREATE FUNCTION public.create_category_event() RETURNS trigger
LANGUAGE plpgsql SECURITY DEFINER
AS $$
DECLARE
category RECORD;
payload JSONB;
BEGIN
category := COALESCE(NEW, OLD);
payload := jsonb_build_object('id', category.id);
IF NEW IS NULL OR NEW.deleted_at IS NOT NULL THEN
payload := jsonb_set(payload, '{deleted}', 'true');
END IF;
INSERT INTO xxx.events (channel, inserted_at, payload)
VALUES ('category', NOW() AT TIME ZONE 'utc', payload);
RETURN NULL;
END;
$$;
We found no notable performance issues. We have a single LISTEN in another application. We did some stress testing and found that it performs way better than we would ever needThanks for the report. For that use-case (if you have a single application using a single connection with a LISTEN) then it's expected that is should perform well, since then there is only a single backend which will be context-switched to when each NOTIFY signals it.
Just out of curiosity, could you try to frame in what context this would or would not work? If you have multiple backends with multiple connections for instance? And then if we start with such a "simple" solution and we later need to scale with distributed backends, how should we do this?
In the linked "Optimize LISTEN/NOTIFY" pgsql-hackers, I've shared a lot of benchmark results for different workloads, which also include results on how PostgreSQL currently works (this is "master" in the benchmark results), that can help you better understand the expectations for different workloads.
The work-around solution we used at Trustly (a company I co-founded), is a component named `allas` that a colleague of mine at that time, Marko Tikkaja, created to solve our problems, that massively reduced the load on our servers. Marko has open sourced and published this work here: https://github.com/johto/allas
Basically, `allas` opens up a single connection to PostgreSQL, on which it LISTEN on all the channels it needs to listen on. Then clients connect to `allas` over the PostgreSQL protocol, so it's basically faking a PostgreSQL server, and when clients do LISTEN on a channel with allas, allas will then LISTEN on that channel on the real PostgreSQL server on the single connection it needs. Thanks to `allas` being implemented in Go, using Go's efficient goroutines for concurrency, it efficiently scales with lots and lots of connections. I'm not a Go-expert myself, but I've understood Go is quite well suited for this type of application.
This component is still being used at Trustly, and is battle-tested and production grade.
That said, it would of course be much better to avoid the need for a separate component, and fix the scalability issues in core PostgreSQL, so that's what I'm currently working on.
For folks like myself, who don't know much about DB internals but know a bit about Kubernetes, this sounds very akin to the K8s API watch cache.
https://danielmangum.com/posts/k8s-asa-watching-and-caching/
The post seems to say that NOTIFY is generally not a good idea, then comments here say that NOTIFY can actually work, but it depends on some particular things (which are not easy to know for newcomers to Postgres), makes it a bit complicated to know what is the way to go for a new database
In my case I have an IoT setting, where my devices can change their "DesiredState", and I want to listen on this to push some message to MQTT... but then there might be also other cases where I want to listen to some messages elsewhere (eg do something when there is an alert on a device, or listen to some unrelated object, eg users, etc)
I'm not clear right now what would be the best setting to do this, the tradeoffs, etc
Imagine I have eg 100k to 10M range of devices, that sometimes these are updated in bulks and change their DesiredState 10k at a time, would NOTIFY work in that case? Should I use the WAL/Debezium/etc?
Can you try to "dumb down" in which cases we can use NOTIFY/LISTEN and in which case it's best not to? you're saying something about single-channel/multi-channel/etc but to a newcomer I'm not clear on what all these are
We're relying on logical replication heavily for PowerSync, and I've found it is a great tool, but it is also very low-level and under-documented. This article gives a great overview - I wish I had this when we started with our implementation.
Some examples of difficulties we've ran into: 1. LSNs for transactions (commits) are strictly increasing, but not for individual operations across transactions. You may not pick this up during basic testing, but it starts showing up when you have concurrent transactions. 2. You cannot resume logical replication in the middle of a transaction (you have to restart the transaction), which becomes relevant when you have large transactions. 3. In most cases, replication slots cannot be preserved when upgrading Postgres major versions. 4. When you have multiple Postgres clusters in a HA setup, you _can_ use logical replication, but it becomes more tricky (better in recent Postgres versions, but you're still responsible for making sure the slots are synced). 5. Replication slots can break in many different ways, and there's no good way to know all the possible failure modes until you've run into them. Especially fun when your server ran out of disk space at some point. It's a little better with Postgres 17+ exposing wal_status and invalidation_reason on pg_replication_slots. 6. You need to make sure to acknowledge keepalive messages and not only data messages, otherwise the WAL can keep growing indefinitely when you don't have incoming changes (depending on the hosting provider). 7. Common drivers often either don't implement the replication protocol at all, or attempt to abstract away low-level details that you actually need. Here it's great that the article actually explains the low-level protocol details.
Yeah I was debating heavily between WAL and L/N. Tried to get WAL set up, struggled; tried to learn more about WAL, failed; tried to persevere, shot myself in the foot.
At the end of the day the simplicity of L/N made it well worth the performance degradation. Still making thousands-to-millions of writes per second, so when the original article said they were 'exaggerating' I think they may have gone too far.
I've been hoping WAL gets some more documentation love in the years/decades L/N will serve me should I ever need to upgrade, so please share more! :D
The longer I work the fewer legitimate use cases for LISTEN/NOTIFY I seem to find. At this point, I only really see the value in using it to allow a polling loop to have a longer sleep and to use notify events as interrupts. Even that use case comes with its own set of problems that may make it unsuitable.
I was hoping from the title that this would be about creating an audio representation so you could listen to the database work as if it were an old PC that had distinctive sounds for different tasks
We were kids learning C++ at school and we giggled each time we printed the \a character (system bell). At home I piped /dev/mem to /dev/dsp to get the harshest white noise.
> If you call pg_notify or NOTIFY inside a trigger, it will get called 100,000 times and send out 100,000 notifications if you change 100,000 rows in a single transaction which from a performance perspective is ... not ideal.
This is only true if those notifications are different; if they are identical, such as in the same the notification is to alert listeners some table has new data (for cache invalidation), they are sent out as one notification only. See source code comment in async.c:
* Duplicate notifications from the same transaction are sent out as one
* notification only. This is done to save work when for example a trigger
* on a 2 million row table fires a notification for each row that has been
* changed. If the application needs to receive every single notification
* that has been sent, it can easily add some unique string into the extra
* payload parameter.Recently released Clojure implementation of the same pattern: https://github.com/eerohele/muutos
I saw the post on the clojure subreddit! I’m stoked to try it out.
Ever since I saw Martin Kleppman’s “Turning the database inside out” talk I’ve wanted an easy way to hook into a transaction log. Apache samza/kafka is very cool but I’m not going to set it up for personal projects. It’d be VERY cool to make materialized views straight from the log!!
If you do end up trying it out and hit any roadblocks, please don't hesitate to file an issue. I'd be very interested in hearing how it goes!
> It’d be VERY cool to make materialized views straight from the log!!
This is what Materialize does. You point it at some PG (or MySQL, or... probably lots more by now) sources, and then you can define arbitrary views on top of those sources using SQL (with the full power of relational logic: fully precise joins, etc.) The views are maintained incrementally, so they update almost instantly when the underlying data updates; you don't have to manually refresh them.
Disclaimer: I worked on Materialize for five years, and it is a commercial, proprietary project. But it is source-available (https://github.com/materializeinc/materialize) under a BSL-style license, and it has fairly generous terms for free usage.
> It works, but suddenly your query times explode! Instead of doing 1 million transactions per second* you can now do only 3 (*These numbers were exaggerated for dramatic effect)
In general, a single-queue design doesn’t make throughput collapse when you add more parallelism; it just gives you a fixed ceiling. With a well-designed queue, throughput goes up with concurrency, then flattens when the serialized section (the queue) saturates, maybe sagging a bit from context switching.
If instead you see performance severely degrade as you add workers, that typically means there’s an additional problem beyond “we have one queue” — things like broadcast wakeups (“every event wakes every listener”), global scans on each event, or other O(N) work per operation. That’s a very different, and more serious, scalability bug than simply relying on a single queue.
I misinterpreted the title and was hoping that this was going to be a post about realtime algorithmic music generation from the Postgres WAL, something like the Hatnote “listen to Wikipedia edits” project.
It's worth mentioning debezium https://debezium.io/
It allows to publish all changes from the db to Kafka.
Perhaps the situation has gotten better since I looked a few years ago, but my experience is the Debezium project doesn’t really guarantee exactly-once delivery. Meaning that if row A is replaced by row B, you might see (A, -1), (A, -1), (B, +1), if for example Debezium was restarted at precisely the wrong time. Then if you’re using this stream to try to keep track of what’s in the database, you will think you have negatively many copies of A.
It sounds silly, but caused enormous headaches and problems for the project I was working on (Materialize), one of whose main use cases is creating incrementally maintained live materialized views on top of replicated Postgres (or MySql) data.
Debezium published this doc on Exactly-once delivery with their most recent 3.3.0 version. They dont support it natively, but say it can be achieved via Kafka Connect
https://debezium.io/documentation/reference/stable/configura...
You could probably achieve something similar with the NATS Jetstream sink as well, which has similar capabilities - though I think it doesnt have quite the same guarantees.
I switched to using Debezium a few months ago, after a Golang alternative to Debezium + Kafka Connect - ConduitIO - went defunct. I should have been using Debezium all along, as it is clearly the most mature, most stable option in the space, with the best prospects for long-term survival. Highly recommended, even if it is JVM (though they're currently doing some interesting stuff with Quarkus and GraalVM that might lead to a jvm-free binary at some point)
Debezium generally produces each change event exactly once if there are no unclean connector shut-downs. If that's not the case, I'd consider this a bug which ought to be fixed.
(Disclaimer: I used to lead the Debezium project)
The problem is that unclean connector shutdowns are a thing that can happen in real life.
They can happen, yes, although this should be a rather rare event (the most common reason would be misconfiguration, such as a K8s pod with too low memory limits). That said, work towards exactly-once has been done [1], utilizing the support for EOS in Kafka Connect (KIP-618).
In particular for Postgres, consumers can detect and reject duplicates really easy though, by tracking a watermark for the {Commit LSN / Event LSN} tuple which is monotonically increasing. So a consumer just needs to compare the value for that tuple from the incoming event to the highest event it has received before. If the incoming value is lower, the event must be a duplicate. We added support for exposing this via the `source.sequence` field a while back upon request by the Materialize team btw.
[1] https://debezium.io/documentation/reference/stable/configura....
> They can happen, yes, although this should be a rather rare event
For our use case, it didn't matter if it was rare or not: the fact that it could happen at all meant we needed to be robust to it, which basically meant storing the entire database in memory.
> We added support for exposing this via the `source.sequence` field a while back upon request by the Materialize team btw.
Yes, I helped work on this! I'm not sure whether Materialize is still using it (it's been years since I've thought about MZ/Debezium integration) but it was helpful, thanks.
Does it handle the things that the post mentions about the ever-growing WAL, and the fact that some listeners can go offline and need to get back old messages (eg if Kafka crashes?)
Robustness is a key design goal of Debezium. It supports heart beating to address WAL growth issues (wrote about that issue at [1]). If Kafka crashes (or Debezium itself), it will resume consuming the replication slot from where it left off before (applying at-least once semantics, i.e. there can be duplicates in case of an unclean shut-down).
Naturally, if the consumer is down, WAL retained for that replication slot continues to grow until it comes back up again, hence monitoring is key (or the slot gets invalidated at a certain threshold, it will restart with a new initial snapshot).
Disclaimer: I used to lead the Debezium project
[1] https://www.morling.dev/blog/mastering-postgres-replication-...
Worth mentioning that WAL is enabling and used by Supabase Realtime [0] and WalEx [1] which both are doing very similar thing - polling the WAL changes in order to emit them as Websocket events.
I never worked with WalEx but have experience with Supabase Realtime and it is a joy to work with and fits great into the Supa ecosystem. So many layers are disappearing when you rely more on Postgres!
[0] https://github.com/supabase/realtime [1] https://github.com/cpursley/walex
This is how Debezium works.
It is probably best to use that unless there is a strong reason against.
Concur on this. I always thought that Debezium only supported MySQL binlog since that is the world I operate in. I did not know that it was also an option for Postgres CDC via the WAL. Now that I know, I would recommend defaulting to it.
Even with Debezium I have run into myriad issues. It blows my mind that someone would want to roll their own logic to set up replicas. IMHO this post should be advertised more as a fun intellectual exercise.
I wrote a Haskell library for this as well: https://hackage.haskell.org/package/postgresql-replicant
I have had great success in the past building business event alert systems for underlying salesforce data with Notify on Heroku data warehouse replications of salesforce. The potential for performance issues was always a nagging issue in the back of my head and this article is a great example why.
Related: Redis has a keyspace notification doing something similar that is not very well known, but when it is needed, it is really needed. We are thinking of extending this mechanism in new ways. Similarly I have seen setups where the notifications arriving from Postgres/MySQL are used in order to materialize (and keep updated) a cached view in Redis. To me, it is interesting how certain teams relay on these kind of mechanisms provided by database systems, and other teams like to do things in the client side to have full control, even in the face of having to reimplement some logic.
Thanks for the deep dive on Postgres WAL.
I recently had the opportunity to play with PostgreSQL WAL in the scope of implementing opensearch cdc pipeline, and it was really exciting to see what is possible to achieve with it.
Be cautious with idle replica slots though, I got bitten by inactive slots filling up the production database storage.
PostgreSQL 18 introduces idle_replication_slot_timeout to mitigate this.
For those not on 18 yet, there's also max_slot_wal_keep_size, which invalidates slots based on the amount of WAL they retain rather than based on a period of inactivity. It's available since PG 13.
I had to set one of these up somewhat under duress to replace a poorly written scheduled query to export data to BigQuery. It’s nice to know what it’s actually doing under the hood, thanks for the info.
I remember decades ago working in something similar in SQL Server called transactional log shipping. It worked on the same principle as write-ahead to another server and was an ETL method for incremental change capture to a data warehouse.
Great Article! This is also timely for me, I spent all last week deep in the postgres documentation learning about replication (wish I had this article).
I'm building kafka connect (i.e. Debezium) compatible replication:
https://github.com/turbolytics/librarian/pull/41
-----
I really like the mongo change stream API. I think it elegantly hides a lot of the complexity of replication. As a side-effect of building postres -> kafka replication, I'm also trying to build a nicer replication primitive for postgres, that hides some of the underlying replication protocol complexity!
Thanks for the WalEx mention!: https://github.com/cpursley/walex also, there’s https://github.com/sequinstream/sequin if you want something more robust that already works with connectors and smokes debezium in terms of performance and is simpler to operate.
The Ops person in me is crying.
Please stop doing this via database. Scaling issues can develop and we in Ops catch flak for your bad design choices. Use Kafka/RabbitMQ/Redis PLEASE.
Those can listen to database changes and aren't separate services to set up, keep online and pay for? Neat (and news to me)!
Except if you disconnect for any reason, you will miss changes. If you want to break out services into different servers, clients will still be crossing the streams to listen. Database schema changes must be coordinated properly. Basically, you are taking one database to rule them all model which comes with massive drawbacks and inserting additional capabilities into it.
Sure, if you are tiny, will forever remain tiny, then whatever, ignore this Ops person.
Please check out the linked libs, this is not true (and dig deeper into Erlang/Elixir). The schema change issue can be tricky, however.
This is cool, but also, why is pg_notify designed with such a contentious queue?
It's a common misconception that the single queue is a poor design choice. The user reports, of seeing notifications/second severely degrade with lots of backends, cannot be explained by the single-queue design. An efficient implementation of a single-queue, should flatten out as parallellism increases, not degrade and go towards zero.