Fabrice Bellard's TS Zip (2024)

bellard.org

135 points by everlier 11 hours ago


omoikane - 8 hours ago

Current leader of the Large Text Compression Benchmark is NNCP (compression using neural networks), also by Fabrice Bellard:

https://bellard.org/nncp/

Also, nncp-2024-06-05.tar.gz is just 1180969 bytes, unlike ts_zip-2024-03-02.tar.gz (159228453 bytes, which is bigger than uncompressed enwiki8).

egl2020 - 9 hours ago

When Jeff Dean gets stuck, he asks Bellard for help...

meisel - 9 hours ago

Looks like it beats everything in the large text compression benchmark for enwik8, but loses to several programs for enwik9. I wonder why that is.

oxag3n - 9 hours ago

Compression and intelligence reminded me of the https://www.hutter1.net/prize

I've encountered it >10 years ago and it felt novel that compression is related to intelligence and even AGI.

wewewedxfgdf - 10 hours ago

>> The ts_zip utility can compress (and hopefully decompress) text files

Hopefully :-)

gmuslera - 8 hours ago

Reminded me of pi filesystem (https://github.com/philipl/pifs), with enough digits of pi precalculated you might be able to do a decent compression program. The trick is in the amount of reasonable digits for that, if it’s smaller or bigger than that trained LLM.

bob1029 - 5 hours ago

PPMd is the most exotic compressor I've actually used in production. The first time I saw it in action I thought it was lossy or something was broken. I had never seen structured text compress that well.

rurban - 9 hours ago

So did beat his own leading program from 2019, nncp, finally.

MisterTea - 10 hours ago

This is something I have been curious about in terms of how an LLM's achieves compression.

I would like to know what deviations are in the output as this almost feels like a game of telephone where each re-compression results in a loss of data which is then incorrectly reconstructed. Sort of like misremembering a story and as you tell it over time the details change slightly.

shawnz - 9 hours ago

Another fun application of combining LLMs with arithmetic coding is steganography. Here's a project I worked on a while back which effectively uses the opposite technique of what's being done here, to construct a steganographic transformation: https://github.com/shawnz/textcoder

SnowProblem - 9 hours ago

I love this because it gets to the heart of information theory. Shannon's foundational insight was that information is surprise. A random sequence is incompressible by definition. But what counts as surprise depends on context, and for text, we know a large amount of it is predictable slop. I suspect there's a lot of room to go along this style of compression. For example, maybe you could store an upfront summary that makes prediction more accurate. Or perhaps you could encode larger sequences or some kind of hierarchical encoding. But this is great.

dmitrygr - 10 hours ago

"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.

benatkin - 10 hours ago

I propose the name tokables for the compressed data produced by this. A play on tokens and how wild it is.

- 9 hours ago
[deleted]
jokoon - 8 hours ago

so barely 2 or 3 times better than xz

not really worth it

publicdebates - 10 hours ago

Bellard finally working with his true colleague.