Netflix Revamps Tudum's CQRS Architecture with Raw Hollow In-Memory Object Store

infoq.com

90 points by NomDePlum 5 days ago


hnthrow20938572 - 2 days ago

>However, due to the caching refresh cycle, CMS updates were taking many seconds to show up on the website. The issue made it problematic for content editors to preview their modifications and got progressively worse as the amount of content grew, resulting in delays lasting tens of seconds.

This seems superficial, why not have a local-only CMS site to preview changes for the fast feedback loop and then you only have to dump the text to prod?

>got progressively worse as the amount of content grew, resulting in delays lasting tens of seconds.

This is like the only legit concern to justify redoing this, but even then, it was still only taking seconds to a minute.

yodon - 2 days ago

The most incredible thing is someone thought it was a good idea to write an engineering blog post about a team who so screwed up the spec for a simple static website that they needed 20 engineers and exotic tech to build it.

It can be ok to admit you did something stupid, but don't brag about it like it's an accomplishment.

stopthe - 2 days ago

Having dealt with similar architectures, I have a hypothesis on how this library (Hollow) emerged and evolved.

In the beginning there was a need for a low-latency Java in-process database (or near cache). Soon intolerable GC pauses pushed them off the Java heap. Next they saw the memory consumption balloon: the object graph is gone and each object has to hold copies of referenced objects, all those Strings, Dates etc. Then the authors came up with ordinals, which are... I mean why not call them pointers? (https://hollow.how/advanced-topics/#in-memory-data-layout)

That is wild speculation ofc. And I don't mean to belittle the authors' effort: the really complex part is making the whole thing perform outside of simple use cases.

thinkindie - 2 days ago

Am I naive thinking this infra is overblown for a read-only content website?

As much as this website could be very trafficked I have the feeling they are overcomplicating their infra, for little gains. Or at least, I wouldn't expect to end having an article about.

ram_rar - a day ago

I’m a bit underwhelmed by the quality of articles coming out of Netflix. 100 Million records / entity is nothing for Redis — even without RAW hollow-style compression techniques used (bit-packing, dedup, dict encoding is pretty standard stuff) [1].

Framing this as a hard-scaling problem (tudum seems mostly static, please cmiiw if thats not the case) feels like pure resume-driven engineering. Makes me wonder: what stage was this system at that they felt the need to build this?

[1] https://hollow.how/raw-hollow-sigmod.pdf

thecupisblue - 2 days ago

Holy shit the amount of overcomplications to serve simple HTML and CSS. Someone really has to justify their job security to be pulling shit like this, or they really gotta be bored.

If anyone can _legitimately_ justify this, please do, I'd love to hear it.

And don't go "booohooo at scale" because I work at scale and am 100% not sure what is the problem this is solving that can't just be solved with a simpler solution.

Also this isn't "Netflix scale", Tudum is way less popular.

mkl95 - 2 days ago

Isn't Tudum mostly a static site? It must be a great project to try out cool stuff on, with a near zero chance of that cool stuff making it to the main product and having a significant impact on customers. Most of the traffic probably comes from bots.

anonzzzies - 2 days ago

Handy to have [0] [1].

[0] https://github.com/Netflix/hollow

[1] https://hollow.how/raw-hollow-sigmod.pdf

spyrefused - 2 days ago

This brings back old memories, from when they released the first version of Hollow (2016, I think) and I started writing a port in .Net because I thought it would be useful for some projects I was working on. I don't remember why I stopped, I guess I realized it was too much work, plus working at a low level with C# was quite a pain...

porridgeraisin - 2 days ago

Mom the astronauts are back at it again

yunohn - 2 days ago

I don’t normally comment on technical complexity at scale, but having used Tudum, it is truly mind boggling why they need this level of complexity and architecture for what is essentially a WP blog equivalent.

immibis - 2 days ago

https://netflixtechblog.com/netflix-tudum-architecture-from-...

Concerning that their total uncompressed data size including full history is only 520MB and they built this complex distributed system rather than, say, rsyncing an sqlite database. It sounds like only Netflix staff write to the database.

pyrolistical - 2 days ago

CQRS wasn’t the issue and in-memory object store is over kill.

The 60 second cache expiry was the issue. This is why the only sane cache policy is cache forever. Use content hashes as cache keys

conceptme - 2 days ago

If it fits in memory why choose for cqrs with kafka in the first place?

supportengineer - 2 days ago

“Too dumb”? How did that get past everyone?

revskill - 2 days ago

Money can fix skill issue and buy happiness.

- 2 days ago
[deleted]
- 2 days ago
[deleted]
solid_fuel - a day ago

I find this really confusing, and feel like I'm missing some things. Overall the impression I get from this post is of an application optimized in very weird places, while being very unoptimized in others.

A few thoughts:

1) RAW Hollow sounds like it has very similar properties to Mnesia (the distributed database that comes with the Erlang Runtime System), although Mnesia is more focused on fast transactions while RAW Hollow seems more focused on read performance.

2) It seems some of this architecture was influence by the presence of a 3rd party CMS. I wonder how much impact this had on the overall design, and I would like to know some more about the constraints it imposed.

    > The write part of the platform was built around the 3rd-party CMS product, and had a dedicated ingestion service for handling content update events, delivered via a webhook. 
    > ...
    >  For instance, when the team publishes an update, the following steps must occur:
    > 
    >  1) Call the REST endpoint on the 3rd party CMS to save the data.
    >  2) Wait for the CMS to notify the Tudum Ingestion layer via a webhook.

What? You call the CMS and then the CMS calls you back? Why? What is the actual function of this 3rd-party CMS? I had the impression it may be some kind of editor tool, but then why would Tudum be making calls out to it?

3) What is the actual daily active user count, and how much customization is possible for the users? Is it just basic theming and interests or something more? When I look through the Tudum site, it seems like it is just connected to the users netflix account. I'm assuming the personalization is fairly simple like theming, favorited shows, etc.

> Attracting over 20 million members each month, Tudum is designed to enrich the viewing experience by offering additional context and insights into the content available on Netflix.

It's unclear to me if this is users signing up for Tudum, the number of unique monthly visitors, the number of page views, or something else. I'm assuming that it is monthly active users, and that those users generally already have netflix accounts.

4) An event-driven architecture feels odd for this sort of access and use pattern. I don't understand what prevents using a single driving database, like postgres, in a more traditional pattern. By my count just the data from the CMS is also duplicated in the Hollow datastore and, implicitly, the generated pages. Of course when you duplicate data you create synchronization problems and latency. That is the nature of computing. I have always preferred to, instead, stick with just a few active copies of relevant data when practical.

> Storing three years’ of unhydrated data requires only a 130MB memory footprint — 25% of its uncompressed size in an Iceberg table!

Compressed, uncompressed, this is a comically small amount of data. High end Ryzen processors almost have this much L3 CACHE!!

As near as I can tell, writes only flow one way in this system so I don't even know if RAW Hollow needs strong read-after-write consistency. It seems like writes flow from the CMS, into RAW Hollow, and then onto the Page Builder nodes. So how does this provide anything that a postgres read-replica wouldn't?

5) Finally, the most confusing part - are they pre-building every page for every user? That seems ridiculous but it is difficult to square some of the requirements without such a thing. If you can render a page in 400 milliseconds then congratulations, you are in the realm of most good SSR applications. This would immediately save a ton of computation because there is no need to pre-build these, why not just render them on demand?

Overall this is perplexing post. I don't understand why a lot of these decisions were made and the solution seems very over-complicated for the problem as described.