We made Postgres writes faster, but it broke replication

241 points by philippemnoel 2 days ago

Love this style of no-fluff technical deep dive. HN needs more content like this.

Agree with others saying HN needs more content like this!

After reading I don’t get how locks held in memory affect WAL shipping. WAL reader reads it in a single thread, updates in-memory data structures periodically dumping them on disk. Perhaps you want to read one big instruction from WAL and apply it to many buffers using multiple threads?

>Adapting algorithms to work atomically at the block level is table stakes for physical replication

Why? To me the only thing you have to do atomically is WAL write. WAL readers read and write however they want given that they can detect partial writes and replay WAL.

>If a VACUUM is running on the primary at the same time that a query hits a read replica, it's possible for Postgres to abort the read.

The situation you referring to is: 1. Record inserted 2. Standby long query started 3. Record removed 4. Primary vacuum started 5. Vacuum replicated 6. Vacuum on standby cannot remove record because it is being read by the long query. 7. PG cancels the query to let vacuum proceed.

I guess your implementation generates a lot of dead tuples during compaction. You clearly fighting PG here. Could a custom storage engine be a better option?

- 2 days ago

[deleted]
stuhood - 2 days ago
Thanks for the questions!
```
    After reading I don’t get how locks held in memory affect WAL shipping.
    WAL reader reads it in a single thread, updates in-memory data structures
    periodically dumping them on disk. Perhaps you want to read one big
    instruction from WAL and apply it to many buffers using multiple threads?
```
We currently use an un-modified/generic WAL entry, and don't implement our own replay. That means we don't control the order of locks acquired/released during replay: and the default is to acquire exactly one lock to update a buffer.
But as far as I know, even with a custom WAL entry implementation, the maximum in one entry would still be ~8k, which might not be sufficient for a multi-block atomic operation. So the data structure needs to support block-at-a-time atomic updates.
```
    I guess your implementation generates a lot of dead tuples during
    compaction. You clearly fighting PG here. Could a custom storage
    engine be a better option?
```
`pg_search`'s LSM tree is effectively a custom storage engine, but it is an index (Index Access Method and Custom Scan) rather than a table. See more on it here: https://www.paradedb.com/blog/block_storage_part_one
LSM compaction does not generate any dead tuples on its own, as what is dead is controlled by what is "dead" in the heap/table due to deletes/updates. Instead, the LSM is cycling blocks into and out of a custom free space map (that we implemented to reduce WAL traffic).

otterley - 2 days ago

> To be an effective alternative to Elasticsearch we needed to support high ingest workloads in real time.

Why not just use OpenSearch or ElasticSearch? The tool is already in the inventory; why use a screwdriver when a chisel is needed and available?

This is another one of those “when you have a hammer, everything looks like your thumb” stories.

iw2rmb - 2 days ago

Because you don’t need to sync and you have ACID with joins.
- otterley - 2 days ago
  
  Is there a whole business to be had with those advantages alone? I’m curious as to who the target market is.
  - levkk - 2 days ago
    
    My last big co, we had a team of 10 who's entire job was to sync data from Postgres into Elastic. It would take weeks and fallover regularly due to traffic.
    If we could have a DB that could do search and be a store of record, it would be amazing.
    
    otterley - 2 days ago
    
    They're different access patterns, though. Are there no concerns about performance and potentially blocking behavior? Decoupling OLTP and analytics is frequently done with good reason: 1/to allow the systems to scale independently, and 2/to help prevent issues with one component from impacting the other (i.e., contain blast radius). I wouldn't want a failure of my search engine to also take down my transaction system.
    
    philippemnoel - 2 days ago
    
    You don't need to. Customers usually deploy us on a standalone replica(s) on their Postgres cluster. If a query were to take it down, it would only take down the replica(s) dedicated to ParadeDB, leaving the primary and all other read replicas dedicated to OLTP safe.
    
    otterley - 2 days ago
    
    Are you saying that the cluster isn't homogenous? It sounds like you're describing an architecture that involves a cluster that has two entirely different pieces of software on it, and whose roles aren't interchangeable.
    
    philippemnoel - 2 days ago
    
    Bear with me, this will be a bit of a longer answer. Today, there are two topologies under which people deploy ParadeDB.
    - <some managed Postgres service> + ParadeDB. Frequently, customers already use a managed Postgres (e.g. AWS RDS) and want ParadeDB. In that world, they maintain their managed Postgres service and deploy a Kubernetes cluster running ParadeDB on the side, with one primary instance and some number of replicas. The AWS RDS primary sends data to the ParadeDB primary via logical replication. You can see a diagram here: https://docs.paradedb.com/deploy/byoc
    In this topology, the OLTP and search/OLAP workloads are fully isolated from each other. You have two clusters, but you don't need a third-party ETL service since they're both "just Postgres".
    - <self-hosted Postgres> + ParadeDB. Some customers, typically larger ones, prefer to self-host Postgres and want to install our Postgres extension directly. The extension is installed in their primary Postgres, and the CREATE INDEX commands must be issued on the primary; however, they may route reads only to a subset of the read replicas in their cluster.
    In this topology, all writes could be directed to the primary, all OLTP read queries could be routed to a pool of read replicas, and all search/OLAP queries could be directed to another subset of replicas.
    Both are completely reasonable approaches and depend on the workload. Hope this helps :)
    
    adsharma - 2 days ago
    
    Which of these two is the higher order bit?
    * ParadeDB speaks postgres protocol
    * These setups don't have a complex ETL pipeline
    If you have a ETL pipeline specialized for PG logical replication (as opposed to generic JVM based Debizium/Kafka setups), you get some fraction of the same benefits. I'm curious about Conduit and its postgres plugin.
    That leaves: ParadeDB uses vanilla postgres + rust extension. This is a technology detail. I was looking for an articulation of the customer benefit because of this technologically appealing architecture.