Listen to Database Changes Through the Postgres WAL

166 points by pjullrich 6 days ago

> The problem with Postgres' NOTIFY is that all notifications go through a single queue!

> Even if you have 20 database connections making 20 transactions in parallel, all of them need to wait for their turn to lock the notification queue, add their notification, and unlock the queue again. This creates a bottleneck especially in high-throughput databases.

We're currently working hard on optimizing LISTEN/NOTIFY: https://www.postgresql.org/message-id/flat/6899c044-4a82-49b...

If you have any experiences of actual workload where you are currently experiencing performance/scalability problems, I would be interested in hearing from you, to better understand the actual workload. In some workloads, you might only listen to a single channel. For such single-channel workloads, the current implementation seems hard to tweak further, given the semantics and in-commit-order guarantees. However, for multi-channel workloads, we could do a lot better, which is what the linked patch is about. The main problem with the current implementation for multi-channel workloads, is that we currently signal and wake all listening backends (a backend is the PostgreSQL processes your client is connected to), even if they are not interested in the specific channels being notified in the current commit. This means that if you have 100 connections open in which each connect client has made a LISTEN on a different channel, then when someone does a NOTIFY on one of those channels, instead of just signaling the backend that listen on that channel, all 100 backends will be signaled. For multi-channel workloads, this could mean an enormous extra cost coming from the context-switching due to the signaling.

I would greatly appreciate if you could please reply to this comment and share your different workloads when you've had problems with LISTEN/NOTIFY, to better understand approximately how many listening backends you had, and how many channels you had, and the mix of volume on such channels. Anything that could help us do better realistic simulations of such workloads, to improve the benchmark tests we're working on. Thank you.

JoelJacobson - 10 hours ago

Here is the Commitfest entry if you want to help with reviewing/development/testing of the patch: https://commitfest.postgresql.org/patch/6078/

barrell - 4 hours ago

I have listen/notify on most changes in my database. Not sure I've experienced any performance issue though but I can't say I've been putting things through their paces. IMHO listen/notify's simplicity outweighed the perf gains by WAL.

I'm only sharing this should it be helpful:

  def up do
    whitelist = Enum.join(@user_columns ++ ["tick"], "', '")

    execute """
    CREATE OR REPLACE FUNCTION notify_phrasing() RETURNS trigger AS $$
    DECLARE
      notif jsonb;
      col_name text;
      col_value text;
      uuids jsonb := '{}'::jsonb;
      user_columns text[] := ARRAY['#{whitelist}'];
    BEGIN
      -- First, add all UUID columns
      FOR col_name IN
        SELECT column_name
        FROM information_schema.columns
        WHERE table_name = TG_TABLE_NAME AND data_type = 'uuid'
      LOOP
        EXECUTE format('SELECT ($1).%I::text', col_name)
        INTO col_value
        USING CASE WHEN TG_OP = 'DELETE' THEN OLD ELSE NEW END;

        IF col_value IS NOT NULL THEN
          uuids := uuids || jsonb_build_object(col_name, col_value);
        END IF;
      END LOOP;

      -- Then, add user columns if they exist in the table
      FOREACH col_name IN ARRAY user_columns
      LOOP
        IF EXISTS (
          SELECT 1
          FROM information_schema.columns
          WHERE table_name = TG_TABLE_NAME AND column_name = col_name
        ) THEN
          EXECUTE format('SELECT ($1).%I::text', col_name)
          INTO col_value
          USING CASE WHEN TG_OP = 'DELETE' THEN OLD ELSE NEW END;

          IF col_value IS NOT NULL THEN
            uuids := uuids || jsonb_build_object(col_name, col_value);
          END IF;
        END IF;
      END LOOP;

      notif = jsonb_build_object(
        'table', TG_TABLE_NAME,
        'event', TG_OP,
        'uuids', uuids
      );

      PERFORM pg_notify('phrasing', notif::text);
      RETURN NULL;
    END;
    $$ LANGUAGE plpgsql;
    """

    # Create trigger for each table
    Enum.each(@tables, fn table ->
      execute """
      CREATE TRIGGER notify_phrasing__#{table}
      AFTER INSERT OR UPDATE OR DELETE ON #{table}
      FOR EACH ROW EXECUTE FUNCTION notify_phrasing();
      """
    end)
  end

I do react to most (70%) of my database changes in some way shape or form, and post them to a PubSub topic with the uuids. All of my dispatching can be done off of uuids.

tobyhinloopen - 10 hours ago
We use it like this:
```
    CREATE TRIGGER notify_events_trg AFTER INSERT ON xxx.events FOR EACH ROW EXECUTE PROCEDURE public.notify_events();

    CREATE FUNCTION public.notify_events() RETURNS trigger
    LANGUAGE plpgsql
    AS $$
    BEGIN
      PERFORM pg_notify('events', row_to_json(NEW)::text);
      RETURN NEW;
    END;
    $$;
```
And then we have a bunch of triggers like this on many tables:
```
    CREATE TRIGGER create_category_event_trg AFTER INSERT OR DELETE OR UPDATE ON public.categories FOR EACH ROW EXECUTE PROCEDURE public.create_category_event();

    CREATE FUNCTION public.create_category_event() RETURNS trigger
        LANGUAGE plpgsql SECURITY DEFINER
        AS $$
    DECLARE
      category RECORD;
      payload JSONB;
    BEGIN
      category := COALESCE(NEW, OLD);
      payload := jsonb_build_object('id', category.id);
      IF NEW IS NULL OR NEW.deleted_at IS NOT NULL THEN
        payload := jsonb_set(payload, '{deleted}', 'true');
      END IF;
      INSERT INTO xxx.events (channel, inserted_at, payload)
        VALUES ('category', NOW() AT TIME ZONE 'utc', payload);
      RETURN NULL;
    END;
    $$;
```
We found no notable performance issues. We have a single LISTEN in another application. We did some stress testing and found that it performs way better than we would ever need
- JoelJacobson - 9 hours ago
  
  Thanks for the report. For that use-case (if you have a single application using a single connection with a LISTEN) then it's expected that is should perform well, since then there is only a single backend which will be context-switched to when each NOTIFY signals it.
  - oulipo2 - 9 hours ago
    
    Just out of curiosity, could you try to frame in what context this would or would not work? If you have multiple backends with multiple connections for instance? And then if we start with such a "simple" solution and we later need to scale with distributed backends, how should we do this?
    
    JoelJacobson - 8 hours ago
    
    In the linked "Optimize LISTEN/NOTIFY" pgsql-hackers, I've shared a lot of benchmark results for different workloads, which also include results on how PostgreSQL currently works (this is "master" in the benchmark results), that can help you better understand the expectations for different workloads.
    The work-around solution we used at Trustly (a company I co-founded), is a component named `allas` that a colleague of mine at that time, Marko Tikkaja, created to solve our problems, that massively reduced the load on our servers. Marko has open sourced and published this work here: https://github.com/johto/allas
    Basically, `allas` opens up a single connection to PostgreSQL, on which it LISTEN on all the channels it needs to listen on. Then clients connect to `allas` over the PostgreSQL protocol, so it's basically faking a PostgreSQL server, and when clients do LISTEN on a channel with allas, allas will then LISTEN on that channel on the real PostgreSQL server on the single connection it needs. Thanks to `allas` being implemented in Go, using Go's efficient goroutines for concurrency, it efficiently scales with lots and lots of connections. I'm not a Go-expert myself, but I've understood Go is quite well suited for this type of application.
    This component is still being used at Trustly, and is battle-tested and production grade.
    That said, it would of course be much better to avoid the need for a separate component, and fix the scalability issues in core PostgreSQL, so that's what I'm currently working on.
    
    phrotoma - 4 hours ago
    
    For folks like myself, who don't know much about DB internals but know a bit about Kubernetes, this sounds very akin to the K8s API watch cache.
    https://danielmangum.com/posts/k8s-asa-watching-and-caching/
oulipo2 - 9 hours ago

The post seems to say that NOTIFY is generally not a good idea, then comments here say that NOTIFY can actually work, but it depends on some particular things (which are not easy to know for newcomers to Postgres), makes it a bit complicated to know what is the way to go for a new database
In my case I have an IoT setting, where my devices can change their "DesiredState", and I want to listen on this to push some message to MQTT... but then there might be also other cases where I want to listen to some messages elsewhere (eg do something when there is an alert on a device, or listen to some unrelated object, eg users, etc)
I'm not clear right now what would be the best setting to do this, the tradeoffs, etc
Imagine I have eg 100k to 10M range of devices, that sometimes these are updated in bulks and change their DesiredState 10k at a time, would NOTIFY work in that case? Should I use the WAL/Debezium/etc?
Can you try to "dumb down" in which cases we can use NOTIFY/LISTEN and in which case it's best not to? you're saying something about single-channel/multi-channel/etc but to a newcomer I'm not clear on what all these are

matharmin - 7 hours ago

We're relying on logical replication heavily for PowerSync, and I've found it is a great tool, but it is also very low-level and under-documented. This article gives a great overview - I wish I had this when we started with our implementation.

Some examples of difficulties we've ran into: 1. LSNs for transactions (commits) are strictly increasing, but not for individual operations across transactions. You may not pick this up during basic testing, but it starts showing up when you have concurrent transactions. 2. You cannot resume logical replication in the middle of a transaction (you have to restart the transaction), which becomes relevant when you have large transactions. 3. In most cases, replication slots cannot be preserved when upgrading Postgres major versions. 4. When you have multiple Postgres clusters in a HA setup, you _can_ use logical replication, but it becomes more tricky (better in recent Postgres versions, but you're still responsible for making sure the slots are synced). 5. Replication slots can break in many different ways, and there's no good way to know all the possible failure modes until you've run into them. Especially fun when your server ran out of disk space at some point. It's a little better with Postgres 17+ exposing wal_status and invalidation_reason on pg_replication_slots. 6. You need to make sure to acknowledge keepalive messages and not only data messages, otherwise the WAL can keep growing indefinitely when you don't have incoming changes (depending on the hosting provider). 7. Common drivers often either don't implement the replication protocol at all, or attempt to abstract away low-level details that you actually need. Here it's great that the article actually explains the low-level protocol details.

barrell - 3 hours ago

Yeah I was debating heavily between WAL and L/N. Tried to get WAL set up, struggled; tried to learn more about WAL, failed; tried to persevere, shot myself in the foot.
At the end of the day the simplicity of L/N made it well worth the performance degradation. Still making thousands-to-millions of writes per second, so when the original article said they were 'exaggerating' I think they may have gone too far.
I've been hoping WAL gets some more documentation love in the years/decades L/N will serve me should I ever need to upgrade, so please share more! :D

tengbretson - an hour ago

The longer I work the fewer legitimate use cases for LISTEN/NOTIFY I seem to find. At this point, I only really see the value in using it to allow a polling loop to have a longer sleep and to use notify events as interrupts. Even that use case comes with its own set of problems that may make it unsuitable.

isidor3 - 10 hours ago

I was hoping from the title that this would be about creating an audio representation so you could listen to the database work as if it were an old PC that had distinctive sounds for different tasks

aitchnyu - 8 hours ago

We were kids learning C++ at school and we giggled each time we printed the \a character (system bell). At home I piped /dev/mem to /dev/dsp to get the harshest white noise.

JoelJacobson - 12 hours ago

> If you call pg_notify or NOTIFY inside a trigger, it will get called 100,000 times and send out 100,000 notifications if you change 100,000 rows in a single transaction which from a performance perspective is ... not ideal.

This is only true if those notifications are different; if they are identical, such as in the same the notification is to alert listeners some table has new data (for cache invalidation), they are sent out as one notification only. See source code comment in async.c:

     *   Duplicate notifications from the same transaction are sent out as one
     *   notification only. This is done to save work when for example a trigger
     *   on a 2 million row table fires a notification for each row that has been
     *   changed. If the application needs to receive every single notification
     *   that has been sent, it can easily add some unique string into the extra
     *   payload parameter.

refset - 9 hours ago

Recently released Clojure implementation of the same pattern: https://github.com/eerohele/muutos

chamomeal - 5 hours ago

I saw the post on the clojure subreddit! I’m stoked to try it out.
Ever since I saw Martin Kleppman’s “Turning the database inside out” talk I’ve wanted an easy way to hook into a transaction log. Apache samza/kafka is very cool but I’m not going to set it up for personal projects. It’d be VERY cool to make materialized views straight from the log!!
- nightwolf - 3 hours ago
  
  If you do end up trying it out and hit any roadblocks, please don't hesitate to file an issue. I'd be very interested in hearing how it goes!
- umanwizard - an hour ago
  
  > It’d be VERY cool to make materialized views straight from the log!!
  This is what Materialize does. You point it at some PG (or MySQL, or... probably lots more by now) sources, and then you can define arbitrary views on top of those sources using SQL (with the full power of relational logic: fully precise joins, etc.) The views are maintained incrementally, so they update almost instantly when the underlying data updates; you don't have to manually refresh them.
  Disclaimer: I worked on Materialize for five years, and it is a commercial, proprietary project. But it is source-available (https://github.com/materializeinc/materialize) under a BSL-style license, and it has fairly generous terms for free usage.

JoelJacobson - 10 hours ago

> It works, but suddenly your query times explode! Instead of doing 1 million transactions per second* you can now do only 3 (*These numbers were exaggerated for dramatic effect)

In general, a single-queue design doesn’t make throughput collapse when you add more parallelism; it just gives you a fixed ceiling. With a well-designed queue, throughput goes up with concurrency, then flattens when the serialized section (the queue) saturates, maybe sagging a bit from context switching.

If instead you see performance severely degrade as you add workers, that typically means there’s an additional problem beyond “we have one queue” — things like broadcast wakeups (“every event wakes every listener”), global scans on each event, or other O(N) work per operation. That’s a very different, and more serious, scalability bug than simply relying on a single queue.

decasia - 7 hours ago

I misinterpreted the title and was hoping that this was going to be a post about realtime algorithmic music generation from the Postgres WAL, something like the Hatnote “listen to Wikipedia edits” project.

http://listen.hatnote.com/

davidelettieri - 12 hours ago

It's worth mentioning debezium https://debezium.io/

It allows to publish all changes from the db to Kafka.

umanwizard - 9 hours ago

Perhaps the situation has gotten better since I looked a few years ago, but my experience is the Debezium project doesn’t really guarantee exactly-once delivery. Meaning that if row A is replaced by row B, you might see (A, -1), (A, -1), (B, +1), if for example Debezium was restarted at precisely the wrong time. Then if you’re using this stream to try to keep track of what’s in the database, you will think you have negatively many copies of A.
It sounds silly, but caused enormous headaches and problems for the project I was working on (Materialize), one of whose main use cases is creating incrementally maintained live materialized views on top of replicated Postgres (or MySql) data.
- nchmy - 5 hours ago
  
  Debezium published this doc on Exactly-once delivery with their most recent 3.3.0 version. They dont support it natively, but say it can be achieved via Kafka Connect
  https://debezium.io/documentation/reference/stable/configura...
  You could probably achieve something similar with the NATS Jetstream sink as well, which has similar capabilities - though I think it doesnt have quite the same guarantees.
  I switched to using Debezium a few months ago, after a Golang alternative to Debezium + Kafka Connect - ConduitIO - went defunct. I should have been using Debezium all along, as it is clearly the most mature, most stable option in the space, with the best prospects for long-term survival. Highly recommended, even if it is JVM (though they're currently doing some interesting stuff with Quarkus and GraalVM that might lead to a jvm-free binary at some point)
- gunnarmorling - 9 hours ago
  
  Debezium generally produces each change event exactly once if there are no unclean connector shut-downs. If that's not the case, I'd consider this a bug which ought to be fixed.
  (Disclaimer: I used to lead the Debezium project)
  - umanwizard - 4 hours ago
    
    The problem is that unclean connector shutdowns are a thing that can happen in real life.
    
    gunnarmorling - 3 hours ago
    
    They can happen, yes, although this should be a rather rare event (the most common reason would be misconfiguration, such as a K8s pod with too low memory limits). That said, work towards exactly-once has been done [1], utilizing the support for EOS in Kafka Connect (KIP-618).
    In particular for Postgres, consumers can detect and reject duplicates really easy though, by tracking a watermark for the {Commit LSN / Event LSN} tuple which is monotonically increasing. So a consumer just needs to compare the value for that tuple from the incoming event to the highest event it has received before. If the incoming value is lower, the event must be a duplicate. We added support for exposing this via the `source.sequence` field a while back upon request by the Materialize team btw.
    [1] https://debezium.io/documentation/reference/stable/configura....
    
    umanwizard - 3 hours ago
    
    > They can happen, yes, although this should be a rather rare event
    For our use case, it didn't matter if it was rare or not: the fact that it could happen at all meant we needed to be robust to it, which basically meant storing the entire database in memory.
    > We added support for exposing this via the `source.sequence` field a while back upon request by the Materialize team btw.
    Yes, I helped work on this! I'm not sure whether Materialize is still using it (it's been years since I've thought about MZ/Debezium integration) but it was helpful, thanks.
oulipo2 - 9 hours ago

Does it handle the things that the post mentions about the ever-growing WAL, and the fact that some listeners can go offline and need to get back old messages (eg if Kafka crashes?)
- gunnarmorling - 9 hours ago
  
  Robustness is a key design goal of Debezium. It supports heart beating to address WAL growth issues (wrote about that issue at [1]). If Kafka crashes (or Debezium itself), it will resume consuming the replication slot from where it left off before (applying at-least once semantics, i.e. there can be duplicates in case of an unclean shut-down).
  Naturally, if the consumer is down, WAL retained for that replication slot continues to grow until it comes back up again, hence monitoring is key (or the slot gets invalidated at a certain threshold, it will restart with a new initial snapshot).
  Disclaimer: I used to lead the Debezium project
  [1] https://www.morling.dev/blog/mastering-postgres-replication-...

jumski - 7 hours ago

Worth mentioning that WAL is enabling and used by Supabase Realtime [0] and WalEx [1] which both are doing very similar thing - polling the WAL changes in order to emit them as Websocket events.

I never worked with WalEx but have experience with Supabase Realtime and it is a joy to work with and fits great into the Supa ecosystem. So many layers are disappearing when you rely more on Postgres!

[0] https://github.com/supabase/realtime [1] https://github.com/cpursley/walex

greener_grass - an hour ago

This is how Debezium works.

It is probably best to use that unless there is a strong reason against.

stuxnet79 - 5 minutes ago

Concur on this. I always thought that Debezium only supported MySQL binlog since that is the world I operate in. I did not know that it was also an option for Postgres CDC via the WAL. Now that I know, I would recommend defaulting to it.
Even with Debezium I have run into myriad issues. It blows my mind that someone would want to roll their own logic to set up replicas. IMHO this post should be advertised more as a fun intellectual exercise.

agentultra - 3 hours ago

I wrote a Haskell library for this as well: https://hackage.haskell.org/package/postgresql-replicant

jwsteigerwalt - 5 hours ago

I have had great success in the past building business event alert systems for underlying salesforce data with Notify on Heroku data warehouse replications of salesforce. The potential for performance issues was always a nagging issue in the back of my head and this article is a great example why.

antirez - 10 hours ago

Related: Redis has a keyspace notification doing something similar that is not very well known, but when it is needed, it is really needed. We are thinking of extending this mechanism in new ways. Similarly I have seen setups where the notifications arriving from Postgres/MySQL are used in order to materialize (and keep updated) a cached view in Redis. To me, it is interesting how certain teams relay on these kind of mechanisms provided by database systems, and other teams like to do things in the client side to have full control, even in the face of having to reimplement some logic.

edistra - 8 hours ago

Thanks for the deep dive on Postgres WAL.

I recently had the opportunity to play with PostgreSQL WAL in the scope of implementing opensearch cdc pipeline, and it was really exciting to see what is possible to achieve with it.

Be cautious with idle replica slots though, I got bitten by inactive slots filling up the production database storage.

PostgreSQL 18 introduces idle_replication_slot_timeout to mitigate this.

gunnarmorling - 7 hours ago

For those not on 18 yet, there's also max_slot_wal_keep_size, which invalidates slots based on the amount of WAL they retain rather than based on a period of inactivity. It's available since PG 13.

a_t48 - 11 hours ago

I had to set one of these up somewhat under duress to replace a poorly written scheduled query to export data to BigQuery. It’s nice to know what it’s actually doing under the hood, thanks for the info.

hbarka - 13 hours ago

I remember decades ago working in something similar in SQL Server called transactional log shipping. It worked on the same principle as write-ahead to another server and was an ETL method for incremental change capture to a data warehouse.

dm03514 - 4 hours ago

Great Article! This is also timely for me, I spent all last week deep in the postgres documentation learning about replication (wish I had this article).

I'm building kafka connect (i.e. Debezium) compatible replication:

https://github.com/turbolytics/librarian/pull/41

-----

I really like the mongo change stream API. I think it elegantly hides a lot of the complexity of replication. As a side-effect of building postres -> kafka replication, I'm also trying to build a nicer replication primitive for postgres, that hides some of the underlying replication protocol complexity!

cpursley - 8 hours ago

Thanks for the WalEx mention!: https://github.com/cpursley/walex also, there’s https://github.com/sequinstream/sequin if you want something more robust that already works with connectors and smokes debezium in terms of performance and is simpler to operate.

cpursley - 6 hours ago

Btw, the WalEx repo has some useful links to articles about the Postgres WAL

stackskipton - 2 hours ago

The Ops person in me is crying.

Please stop doing this via database. Scaling issues can develop and we in Ops catch flak for your bad design choices. Use Kafka/RabbitMQ/Redis PLEASE.

cpursley - 2 hours ago

Those can listen to database changes and aren't separate services to set up, keep online and pay for? Neat (and news to me)!
- stackskipton - 2 hours ago
  
  Except if you disconnect for any reason, you will miss changes. If you want to break out services into different servers, clients will still be crossing the streams to listen. Database schema changes must be coordinated properly. Basically, you are taking one database to rule them all model which comes with massive drawbacks and inserting additional capabilities into it.
  Sure, if you are tiny, will forever remain tiny, then whatever, ignore this Ops person.
  - cpursley - an hour ago
    
    Please check out the linked libs, this is not true (and dig deeper into Erlang/Elixir). The schema change issue can be tricky, however.
- 2 hours ago

[deleted]

morshu9001 - 13 hours ago

This is cool, but also, why is pg_notify designed with such a contentious queue?

JoelJacobson - 8 hours ago

It's a common misconception that the single queue is a poor design choice. The user reports, of seeing notifications/second severely degrade with lots of backends, cannot be explained by the single-queue design. An efficient implementation of a single-queue, should flatten out as parallellism increases, not degrade and go towards zero.