A distributed queue in a single JSON file on object storage

turbopuffer.com

100 points by Sirupsen 3 days ago


staticassertion - 2 hours ago

Yeah, I mean, I think we're all basically doing this now, right? I wouldn't choose this design, but I think something similar to DeltaLake can be simplified down for tons of use cases. Manifest with CAS + buffered objects to S3, maybe compaction if you intend to do lots of reads. It's not hard to put it together.

You can achieve stupidly fast read/write operations if you do this right with a system that is shocking simple to reason about.

> Step 4: queue.json with an HA brokered group commit > The broker is stateless, so it's easy and inexpensive to move. And if we end up with more than one broker at a time? That's fine: CAS ensures correctness even with two brokers.

TBH this is the part that I think is tricky. Just resolving this in a way that doesn't end up with tons of clients wasting time talking to a broker that buffers their writes, pushes them, then always fails. I solved this at one point with token fencing and then decided it wasn't worth it and I just use a single instance to manage all writes. I'd again point to DeltaLake for the "good" design here, which is to have multiple manifests and only serialize compaction, which also unlocks parallel writers.

The other hard part is data deletion. For the queue it looks deadly simple since it's one file, but if you want to ramp up your scale and get multiple writers or manage indexes (also in S3) then deletion becomes something you have to slip into compaction. Again, I had it at one point and backed it out because it was painful.

But I have 40k writes per second working just fine for my setup, so I'm not worrying. I'd suggest others basically punt as hard as possible on this. If you need more writes, start up a separate index with its own partition for its own separate set of data, or do naive sharding.

talentedtumor - 4 minutes ago

Does this suffer from ABA problem, or does object storage solve that for you by e.g. refusing to accept writes where content has changed between the read and write?

pjc50 - 5 hours ago

Several things going on here:

- concurrency is very hard

- .. but object storage "solves" most of that for you, handing you a set of semantics which work reliably

- single file throughput sucks hilariously badly

- .. because 1Gb is ridiculously large for an atomic unit

- (this whole thing resembles a project I did a decade ago for transactional consistency on TFAT on Flash, except that somehow managed faster commit times despite running on a 400Mhz MIPS CPU. Edit: maybe I should try to remember how that worked and write it up for HN)

- therefore, all of the actual work is shifted to the broker. The broker is just periodically committing its state in case it crashes

- it's not clear whether the broker ACKs requests before they're in durable storage? Is it possible to lose requests in flight anyway?

- there's a great design for a message queue system between multiple nodes that aims for at least once delivery, and has existed for decades, while maintaining high throughput: SMTP. Actually, there's a whole bunch of message queue systems?

salil999 - 2 hours ago

Reminds me of WarpStream: https://www.warpstream.com

Similar idea but you have the power of S3 scale (if you really need it). For context, I do not work at WS. My company switched to it recently and we've seen great improvements over traditional Kafka.

loevborg - 2 hours ago

Love this writeup. There's so much interesting stuff you can build on top of Object Storage + compare-and-swap. You learn a lot about distributed systems this way.

I'd love to see a full sample implementation based on s3 + ecs - just to study how it works.

Normal_gaussian - 5 hours ago

The original graph appears to simply show the blocking issue of their previous synchronisation mechanism; 10 min to process an item down to 6 min. Any central system would seem to resolve this for them.

In any organisation its good to make choices for simplicity rather than small optimisations - you're optimising maintenance, incident resolution, and development.

Typically I have a small pg server for these things. It'll work out slightly more expensive than this setup for one action, yet it will cope with so much more - extending to all kinds of other queues and config management - with simple management, off the shelf diagnostics etc.

While the object store is neat, there is a confluence of factors which make it great and simple for this workload, that may not extend to others. 200ms latency is a lot for other workloads, 5GB/s doesn't leave a lot of headroom, etc. And I don't want to be asked to diagnose transient issues with this.

So I'm torn. It's simple to deploy and configure from a fresh deployment PoV. Yet it wouldn't be accepted into any deployment I have worked on.

soletta - 6 hours ago

The usual path an engineer takes is to take a complex and slow system and reengineer it into something simple, fast, and wrong. But as far as I can tell from the description in the blog though, it actually works at scale! This feels like a free lunch and I’m wondering what the tradeoff is.

dewey - 5 hours ago

Depending on who hosts your object storage this seems like it could get much more expensive than using a queue table in your database? But I'm also aware that this is a blog post of an object storage company.

jamescun - 6 hours ago

This post touches on a realisation I made a while ago, just how far you can get with the guarantees and trade-offs of object storage.

What actually _needs_ to be in the database? I've never gone as far as building a job queue on top of object storage, but have been involved in building surprisingly consistent and reliable systems with object storage.

motoboi - 2 hours ago

By typography alone I can now turbopuffer is written in zig.

isoprophlex - 5 hours ago

Is this reinventing a few redis features with an object storage for persistence?

octoclaw - 2 hours ago

[dead]

PunchyHamster - 3 hours ago

[flagged]

jstrong - 4 hours ago

that's A choice.