Nobody ever got fired for using a struct

feldera.com

123 points by gz09 4 days ago


amluto - 10 hours ago

There are many systems that take a native data structure in your favorite language and, using some sort of reflection, makes an on-disk structure that resembles it. Python pickles and Java’s serialization system are infamous examples, and rkyv is a less alarming one.

I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.

Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.

(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)

duc_minh - 10 hours ago

> Sometimes the best optimization is not a clever algorithm. Sometimes it is just changing the shape of the data.

This is basically Rob Pike's Rule 5: If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.(https://users.ece.utexas.edu/~adnan/pike.html)

SoftTalker - 12 hours ago

> But SQL schemas often look like this. Columns are nullable by default, and wide tables are common.

Hard disagree. That database table was a waving red flag. I don't know enough/any rust so don't really understand the rest of the article but I have never in my life worked with a database table that had 700 columns. Or even 100.

lsuresh - 3 hours ago

Feldera co-founder here. Great discussions here.

Some folks pointed out that no one should design a SQL schema like this and I agree. We deal with large enterprise customers, and don't control the schemas that come our way. Trust me, we often ask customers if they have any leeway with changing their SQL and their hands are often tied. We're a query engine, so have to be able to ingest data from existing data sources (warehouse, lakehouse, kafka, etc.), so we have to be able to work with existing schemas.

So what then follows is a big part of the value we add: which is, take your hideous SQL schema and queries, warts and all, run it on Feldera, and you'll get fully incremental execution at low latency and low cost.

700 isn't even the worst number that's come our way. A hyperscale prospect asked about supporting 4000 column schemas. I don't know what's in that table either. :)

astrostl - 11 hours ago

I have mixed feelings about it, but I'm going to fire somebody tomorrow for using a struct just to prove a point to the author.

jamesblonde - 7 hours ago

Here is an article I wrote this week with a section on Feldera - how it uses its incremental compute engine to compute "rolling aggregates" (the most important real-time feature for detecting changes in user behavior/pricing/anamalies).

https://www.hopsworks.ai/post/rolling-aggregations-for-real-...

logdahl - 8 hours ago

Strictly speaking, Isn't there still a way to express at least one Illegal string in ArchivedString? I'm not sure how to hint to the Rust compiler which values are illegal, but if the inline length (at most 15 characers) is aliased to the pointer string length (assume little-endian), wouldnt {ptr: null, len: 16} and {inline_data: {0...}, len: 16} both technically be an illegal value?

I'm not saying this is better than your solution, just curious :^)

saghm - 10 hours ago

I feel like I'm missing something, but the article started by talking about SQL tables, and then in-memory representations, and then on-disk representation, but...isn't storing it on a disk already what a SQL database is doing? It sounds like data is being read from a disk into memory in one format and then written back to a disk (maybe a different one?) in another format, and the second format was not as efficient as the first. I'm not sure I understand why a third format was even introduced in the first place.

- 11 hours ago
[deleted]
jim33442 - 9 hours ago

I did read the rest, but I'm stuck on the first part where their SQL table has almost a thousand cols. Why so many?

- 8 hours ago
[deleted]
arcrwlock - 11 hours ago

Why not use a struct of arrays?

https://en.wikipedia.org/wiki/Data-oriented_design

kleiba - 7 hours ago

> This struct we saw earlier had 700+ of optional fields. In Rust you would never design a struct like this. You would pick a different layout long before reaching 700 Options. But SQL schemas often look like this.

Really? I've never had to do any serious db work in my career, but this is a surprise to me.

SigmundA - 11 hours ago

Looks like they just recreated a tuple layout in rust with null bit map and everything, next up would be storing them in pages and memmap the pages.

https://www.postgresql.org/docs/current/storage-page-layout....

everyone - 10 hours ago

Just cus structs and classes work differently, and classes are much more common. I tend to make everything a class, unless there is a really good reason to make it a struct.

- 10 hours ago
[deleted]
Ciantic - 7 hours ago

If I understand this problem was in rkyv, and solution is using rkyv with glue code. I hope they could integrate some sort of official derive macro `rkyv::Sparse` for this if it can't be done automatically in rkyv.

dyauspitr - 11 hours ago

No one has written a struct in 10 years.

porise - 8 hours ago

Why is rust allowed to reorder fields? If I know that fields are going to be generally accessed together, this prevents me from ordering them so they fit in cache lines.