Have your cake and decompress it too

12 points by emschwartz 3 days ago

Mid 2015, I spent months optimizing Apache ORC's compression models over TPC-H.

It was easier to beat Parquet's defaults - ORC+zlib seemed to top out around the same as the default in this paper (~178Gb for the 1TB dataset, from the hadoop conf slides).

We got a lot of good results, but the hard lesson we learned was that scan rate is more important than size. A 16kb read and a 48kb read took about the same time, but CPU was used by other parts of the SQL engine, IO wasn't the bottleneck we thought it was.

And scan rate is not always "how fast can you decode", a lot of it was encouraging data skipping (see the Capacitor paper from the same era).

For example, when organized correctly, the entire l_shipdate column took ~90 bytes for millions of rows.

Similarly, the notes column was never read at all so dictionaries etc was useless.

Then I learned the ins & outs of another SQL engine, which kicked the ass of every other format I'd ever worked with, without too much magical tech.

Most of what I can repeat is that SQL engines don't care what order rows in a file are & neither should the format writer - also that DBAs don't know which filters are the most useful to organize around & often they are wrong.

Re-ordering at the row-level beats any other trick with lossless columnar compression, because if you can skip a row (with say an FSST for LIKE or CONTAINS into index values[1] instead of bytes), that is nearly infinite improvement in the scan rate and IO.

[1] - https://github.com/amplab/succinct-cpp

pella - an hour ago

Thanks!

Looks similar to OpenZL ( https://openzl.org/ ) "OpenZL takes a description of your data and builds from it a specialized compressor optimized for your specific format."