Text classification with Python 3.14's ZSTD module

maxhalford.github.io

190 points by alexmolas 3 days ago


ks2048 - 9 hours ago

This looks like a nice rundown of how to do this with Python's zstd module.

But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).

Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).

[0] https://kenschutte.com/gzip-knn-paper/

[1] https://kenschutte.com/gzip-knn-paper2/

staplung - 3 hours ago

This has been possible with the zlib module since 1997 [EDIT: zlib is from '97. The zdict param wasn't added until 2012]. You even get similar byte count outputs to the example and on my machine, it's about 10x faster to use zlib.

  import zlib

  input_text = b"I ordered three tacos with extra guacamole"

  tacos = b"taco burrito tortilla salsa guacamole cilantro lime " * 50
  taco_comp = zlib.compressobj(zdict=tacos)
  print(len(taco_comp.compress(input_text) + taco_comp.flush()))
  # prints 41

  padel = b"racket court serve volley smash lob match game set " * 50
  padel_comp = zlib.compressobj(zdict=padel)
  print(len(padel_comp.compress(input_text) +  padel_comp.flush()))
  # prints 54
pornel - 6 hours ago

The application of compressors for text statistics is fun, but it's a software equivalent of discovering that speakers and microphones are in principle the same device.

(KL divergence of letter frequencies is the same thing as ratio of lengths of their Huffman-compressed bitstreams, but you don't need to do all this bit-twiddling for real just to count the letters)

The article views compression entirely through Python's limitations.

> gzip and LZW don’t support incremental compression

This may be true in the Python's APIs, but is not true about these algorithms in general.

They absolutely support incremental compression even in APIs of popular lower-level libraries.

Snapshotting/rewinding of the state isn't exposed usually (custom gzip dictionary is close enough in practice, but a dedicated API would reuse its internal caches). Algorithmically it is possible, and quite frequently used by the compressors themselves: Zopfli tries lots of what-if scenarios in a loop. Good LZW compression requires rewinding to a smaller symbol size and restarting compression from there after you notice the dictionary stopped being helpful. The bitstream has a dedicated code for this, so this isn't just possible, but baked into the design.

m-hodges - an hour ago

Great overview. In 2023 I wrote about classifying political emails with Zstd.¹

¹ https://matthodges.com/posts/2023-10-01-BIDEN-binary-inferen...

chaps - 9 hours ago

Ooh, totally. Many years ago I was doing some analysis of parking ticket data using gnuplot and had it output a chart png per-street. Not great, but worked well to get to the next step of that project of sorting the directory by file size. The most dynamic streets were the largest files by far.

Another way I've used image compression to identify cops that cover their body cameras while recording -- the filesize to length ratio reflects not much activity going on.

matheus-rr - 3 hours ago

The stdlib inclusion angle is what makes this interesting to me beyond the compression-as-classifier debate.

Before 3.14, doing this required pip installing pyzstd or python-zstandard, which meant an extra dependency and a C extension build step. Having zstd in the stdlib means you can do dictionary-based compression tricks in any Python 3.14 environment without worrying about whether the deployment target has build tools or whether the package version matches across your team.

That matters more than it sounds for reproducibility. The gzip/zlib approach kenschutte mentions has been in stdlib forever, but zstd dictionaries are meaningfully better at learning domain-specific patterns from small training corpora. The difference between 41 and 54 bytes in the taco example is small, but on real classification tasks with thousands of categories the gap compounds.

Python 3.14 quietly shipped a lot of practical stuff like this that gets overshadowed by the free-threading and JIT headlines. The ZSTD module, t-strings for template injection safety, and deferred annotation evaluation are all things that change how you write everyday code, not just how the runtime performs.

woadwarrior01 - 2 hours ago

There's also Normalized Google Distance (a distance metric using the number of search results as a proxy), which can be used for text classification.

https://en.wikipedia.org/wiki/Normalized_Google_distance

throwaway81523 - 9 hours ago

Python's zlib does support incremental compression with the zdict parameter. gzip has something similar but you have to do some hacky thing to get at it since the regular Python API doesn't expose the entry point. I did manage to use it from Python a while back, but my memory is hazy about how I got to it. The entry point may have been exposed in the code module but undocumented in the Python manual.

stephantul - 3 hours ago

The speed comparison is weird.

The author sets the solver to saga, doesn’t standardize the features, and uses a very high max_iter.

Logistic Regression takes longer to converge when features are not standardized.

Also, the zstd classifier time complexity scales linearly with the number of classes, logistic regression doesn’t. You have 20 (it’s in the name of the dataset), so why only use 4.

It’s a cool exploration of zstd. But please give the baseline some love. Not everything has to be better than something to be interesting.

physicsguy - 2 hours ago

In my PhD more than a decade ago, I ended up using png image file sizes to classify different output states from simulations of a system under different conditions. Because of the compressions, homogenous states led to much smaller file size than the heterogenous states. It was super super reliable.

wodenokoto - 2 hours ago

Why did python include ZSTD? are people passing around files compressed with this algorithm? It's the first I've ever heard of it.

cluckindan - an hour ago

So that’s why Facebook developed a ”compression” library that has snaked itself into various environments.

Scene_Cast2 - 10 hours ago

Sweet! I love clever information theory things like that.

It goes the other way too. Given that LLMs are just lossless compression machines, I do sometimes wonder how much better they are at compressing plain text compared to zstd or similar. Should be easy to calculate...

EDIT: lossless when they're used as the probability estimator and paired with something like an arithmetic coder.

rob_c - 7 hours ago

So you just discovered pca in some other form?

OutOfHere - 9 hours ago

Can `copy.deepcopy` not help?

willmarquis - 4 hours ago

[dead]