[HN Gopher] Fabrice Bellard's TS Zip (2024)

	[HN Gopher] Fabrice Bellard's TS Zip (2024)
	___________________________________________________________________

	Fabrice Bellard's TS Zip (2024)

	Author : everlier
	Score : 154 points
	Date : 2026-01-12 20:26 UTC (12 hours ago)

	web link (www.bellard.org)
	w3m dump (www.bellard.org)

	\| publicdebates wrote:
	\| Bellard finally working with his true colleague.

	\| dmitrygr wrote:
	\| "compressed size" does not seem to include the size of the model
	\| and the code to run it. According to the rules of Large Text
	\| Compression Benchmark, total size of those must be counted,
	\| otherwise a 0-byte "compressed" file with a decompressor
	\| containing the plaintext would win.

	\| underdeserver wrote:
	\| Technically correct, but a better benchmark would be a known
	\| compressor with an unknown set of inputs (that come from a
	\| real-world population, e.g. coherent English text).

	\| eru wrote:
	\| Yes, definitely. Alas, it's just harder to run these kinds of
	\| challenges completely fairly and self-administered, than the
	\| ones where you have a fixed texts as the challenge and add
	\| the binary size of the decompressor.

	\| paufernandez wrote:
	\| Yeah, but the xz algorithm is also not counted in the bytes...
	\| Here the "program" is the LLM, much like your brain remembers
	\| things by coding them compressed and then reconstructs them. It
	\| is a different type of compression: compression by
	\| "understanding", which requires the whole corpus of possible
	\| inputs in some representation. The comparison is not fair to
	\| classical algorithms yet that's how you can compress a lot more
	\| (given a particular language): by having a model of it.

	\| wrs wrote:
	\| "Compressors are ranked by the compressed size of enwik9
	\| (10^9 bytes) plus the size of a zip archive containing the
	\| decompresser." [0]
	\|
	\| [0] https://www.mattmahoney.net/dc/text.html

	\| FartyMcFarter wrote:
	\| True for competitions, but if your compression algorithm is
	\| general purpose then this matters less (within reason - no one
	\| wants to lug around a 1TB compression program).

	\| MisterTea wrote:
	\| This is something I have been curious about in terms of how an
	\| LLM's achieves compression.
	\|
	\| I would like to know what deviations are in the output as this
	\| almost feels like a game of telephone where each re-compression
	\| results in a loss of data which is then incorrectly
	\| reconstructed. Sort of like misremembering a story and as you
	\| tell it over time the details change slightly.

	\| Scaevolus wrote:
	\| When LLMs predict the next token, they actually produce a
	\| distribution of the probability of each of the possible next
	\| tokens, and the sampler chooses one of them, and not
	\| necessarily the most likely one!
	\|
	\| If instead you run LLM prediction and then encode the
	\| _probability_ of the next token of the input text you want to
	\| encode (from the cumulative distribution, a number in [0, 1])
	\| using arithmetic coding, you can run the same operation in
	\| reverse to achieve lossless compression.
	\|
	\| The tricky part is ensuring that your LLM executes absolutely
	\| deterministically, because you need to make sure that the
	\| encoder and decoder have the same probability distribution map
	\| at each step.

	\| AnotherGoodName wrote:
	\| Yes. The secret is in understanding arithmetic coding.
	\| https://en.wikipedia.org/wiki/Arithmetic_coding
	\|
	\| Arithmetic coding takes a prediction of the next bit and writes
	\| out exactly as many bits as needed to correct that prediction.
	\| The amazing part is that you can write out fractional bits. Eg.
	\| You predict the next bit is '1' with 75% probability? If it is
	\| 1 you only need to write out 1/2 of a bit (correcting that 25%
	\| portion). If it's 0 you need to write out 2.4bits. It may seem
	\| strange to work with 1/2 a bit but it works! (essentially the
	\| other half of the bit represents other future correction
	\| required). You might have heard of huffman coding which can't
	\| deal with fractional bits, arithmetic coding is a
	\| generalization of huffman coding that can deal with this.
	\|
	\| Arithmetic coding is mathematically perfect at what it does.
	\| You will not waste a single bit using this algorithm to encode
	\| data given a prediction of that data.
	\|
	\| So the entirety of modern compression techniques don't deal
	\| with the encoding/decoding side at all. What they deal with is
	\| modelling the data so far and making the most accurate
	\| prediction possible on the next bit of data (next byte also
	\| works, but working 1 bit at a time is easier to comprehend when
	\| learning arithmetic coding).
	\|
	\| Incidentally the encoders and decoders essentially work exactly
	\| the same. Given the data read or data decoded so far predict
	\| the next bit. This part is exactly the same either way. The
	\| decoder would read the compressed file for the correction and
	\| the encoder would read the input file and write out the
	\| correction. The important part is "predict the next bit". This
	\| is what separates all the different compressors.
	\|
	\| This is also where those of us experienced in this area try to
	\| correct people on the wrong track. A compression algorithm is
	\| never about the encoding side but instead 100% always about the
	\| prediction of the data. Can you build a model that can
	\| accurately predict what the next data to come is? That's what
	\| you need to do to make a better file compressor. The entropy
	\| encoding part is a completely solved problem already, don't
	\| bother re-solving that.

	\| ChadNauseam wrote:
	\| That's true for lossless compression. Is there a
	\| generalization for lossy compression?

	\| AnotherGoodName wrote:
	\| Lossy and lossless are essentially fundamentally the same
	\| actually. All state of the art lossy compressors use
	\| arithmetic coding as an example and they still do
	\| prediction. Eg. your favourite video codecs predict not
	\| only the next bit in the 2D frame, but also the next bit
	\| when modelling past frames (becomes a 3D cube of data at
	\| that point) and they also do things like motion prediction
	\| of individual objects in frame to help make a better
	\| prediction. They all use arithmetic encoders to encode the
	\| data.
	\|
	\| Where the lossy part comes in is the point at which humans
	\| notice/don't notice data being thrown away. Got a bit that
	\| was waaay out of line in the prediction and going to cost
	\| you 10bits to correct? Perhaps to humans it doesn't matter?
	\| Can we throw it away? This throwing away of data is often
	\| done before the prediction+compression stage (eg. maybe
	\| quantizing the color space to 8bits from 32bits?) but from
	\| there on it's the same thing.

	\| eru wrote:
	\| To simplify: lossy compression = lossless compression + a
	\| perception model that can tell you what aspect of the
	\| data you can safely toss away without anyone noticing.

	\| ChadNauseam wrote:
	\| Thanks! That's really enlightening. Maybe this can help
	\| me settle a question I've had. If I have 4k video and I
	\| want a smaller file size (but suppose I still have a 4k
	\| tv), I always felt like I should get better quality at
	\| that file size by compressing it further than by reducing
	\| the resolution. Rather than deciding for myself what data
	\| to throw away, why not let the compression algorithm do
	\| it?

	\| AnotherGoodName wrote:
	\| Lowering the resolution to match the typical TV
	\| resolutions is sensible but beyond that trusting the
	\| codec is always the better way. The codecs will
	\| adaptively compress portions of the video differently.
	\| The top left area that's in shadow for the next
	\| 30seconds? Maybe that areas effective resolution can be
	\| lowered? Etc. They can do things like this!
	\|
	\| If you change the resolution or color space of the entire
	\| file you do that without consideration to where the extra
	\| details might have been needed.
	\|
	\| So resolution should match typical output resolutions
	\| exactly and from there it's all on the codec.

	\| billforsternz wrote:
	\| I don't understand. On average, for every 4 input bits we
	\| will get it right 3 times writing 0.5 bits each time and get
	\| it wrong once writing 2.4 bits once. So we write a total of 3
	\| * 0.5 + 2.4 bits = 3.9 bits. The compressed output is 3.9/4 =
	\| 97.5% as big as the input. Not very compelling. What am I
	\| misunderstanding?

	\| AnotherGoodName wrote:
	\| I back of the enveloped it wrong is what :(.
	\|
	\| It's -log2(0.75) for getting a 75% chance right and
	\| -log2(0.25) for getting it wrong. I should have stated .4
	\| bits and 2bits respectively not 0.5 and 2.4. Sorry! Good
	\| catch.
	\|
	\| It's 3.2 vs 4bits. Now that may not seem huge but the
	\| probabilities tend to be at the more extreme ends if the
	\| predictor is any good. Once you start going towards the 99%
	\| range you get extreme efficiency.

	\| billforsternz wrote:
	\| Thanks for the clarification.

	\| themafia wrote:
	\| https://www.youtube.com/watch?v=QEzhxP-pdos

	\| wewewedxfgdf wrote:
	\| >> The ts_zip utility can compress (and hopefully decompress)
	\| text files
	\|
	\| Hopefully :-)

	\| hamandcheese wrote:
	\| Reading data is overrated. I highly recommend S4:
	\|
	\| http://www.supersimplestorageservice.com/

	\| benatkin wrote:
	\| I propose the name _tokables_ for the compressed data produced by
	\| this. A play on tokens and how wild it is.

	\| fancyswimtime wrote:
	\| please pass the tokables to the left hand side

	\| shawnz wrote:
	\| Another fun application of combining LLMs with arithmetic coding
	\| is steganography. Here's a project I worked on a while back which
	\| effectively uses the opposite technique of what's being done
	\| here, to construct a steganographic transformation:
	\| https://github.com/shawnz/textcoder

	\| akoboldfrying wrote:
	\| Cool! It creates very plausible encodings.
	\|
	\| > The Llama tokenizer used in this project sometimes permits
	\| multiple possible tokenizations for a given string.
	\|
	\| Not having tokens be a prefix code is thoroughly unfortunate.
	\| Do the Llama team consider it a bug? I don't see how to rectify
	\| the situation without a full retrain, sadly.

	\| shawnz wrote:
	\| I can't imagine they consider it a bug, it is a common and
	\| beneficial property of essentially every LLM today. You want
	\| to be able to represent common words with single tokens for
	\| efficiency, but at the same time you still need to be able to
	\| represent prefixes of those words in the cases where they
	\| occur separately

	\| akoboldfrying wrote:
	\| I find this surprising, but I suppose it must be more
	\| efficient overall.
	\|
	\| Presumably parsing text into tokens is done in some
	\| deterministic way. If it is done by greedily taking the
	\| longest-matching prefix that is a token, then when
	\| generating text it should be possible to "enrich" tokens
	\| that are prefixes of other tokens with additional
	\| constraints to force a unique parse: E.g., if "e" is a
	\| token but "en" is too, then after generating "e" you must
	\| never generate a token that begins with "n". A text
	\| generated this way can be deterministically parsed by the
	\| greedy parser.
	\|
	\| Alternatively, it would suffice to restrict to a subset of
	\| tokens that _are_ a prefix code. This would be simpler, but
	\| with lower coding efficiency.

	\| shawnz wrote:
	\| Regarding the first part: that's an interesting idea,
	\| although I worry it would bias the outputs in an
	\| unrealistic way. Then again, maybe it would only impact
	\| scenarios that would have otherwise been unparsable
	\| anyway?
	\|
	\| Regarding the second part: you'd effectively just be
	\| limiting yourself to single character tokens in that case
	\| which would drastically impact the LLM's output quality

	\| akoboldfrying wrote:
	\| The first approach would only affect outputs that would
	\| have been otherwise unparseable.
	\|
	\| The second approach works with _any_ subset of tokens
	\| that form a prefix code -- you effectively set the
	\| probability of all tokens outside this subset to zero
	\| (and rescale the remaining probabilities if necessary).
	\| In practice you would want to choose a large subset,
	\| which means you almost certainly want to avoid choosing
	\| any single-character tokens, since they can 't coexist
	\| with tokens beginning with that character. (Choosing a
	\| largest-possible such subset sounds like an interesting
	\| subproblem to me.)

	\| bonzini wrote:
	\| I think it's plausible that different languages would prefer
	\| different tokenizations. For example in Spanish the plural of
	\| carro is carros, in Italian it's carro. Maybe the LLM would
	\| prefer carr+o in Italian and a single token in Spanish.

	\| meisel wrote:
	\| Looks like it beats everything in the large text compression
	\| benchmark for enwik8, but loses to several programs for enwik9. I
	\| wonder why that is.

	\| AnotherGoodName wrote:
	\| It's actually not the best at enwik8 or 9.
	\|
	\| The results at https://www.mattmahoney.net/dc/text.html
	\| explicitly add the size of the compressor itself to the result.
	\| Note the "enwik9+prog" column. That's what it's ranked on.
	\|
	\| The reason to do this is that it's trivial to create a
	\| compressor that 'compresses' a file to 0 bytes. Just have an
	\| executable with a dictionary of enwik9 that writes that out
	\| given any input. So we always measure what is effectively the
	\| Kolmogorov complexity. The data+program as a whole that
	\| produces the result we want.
	\|
	\| So those results add in the compressor size. The programs there
	\| generally have no dictionary built in or in the case of LLM
	\| based compressors, no pre-trained data. They effectively build
	\| the model as they process data. Not compressing much at all at
	\| the start and slowly compressing better and better as they go.
	\| This is why these programs do better and better with larger
	\| data sets. They start with 0 knowledge. After a GB or so they
	\| have very good knowledge of the corpus of human language.
	\|
	\| This program here however is pre-trained and shipped with a
	\| model. It's 150MB in size! This means it has 150MB of extra
	\| starting knowledge over those models in that list. The top
	\| models in that list are the better compressors, they'll quickly
	\| out learn and overtake this compressor but they just don't have
	\| that headstart.
	\|
	\| Of course measuring fairly this should be listed with that
	\| 150MB program size added to the results when doing a
	\| comparison.

	\| srcreigh wrote:
	\| As an aside, I wonder how to account for the information
	\| content embedded in the hardware itself.
	\|
	\| A Turing Machine compressor program would likely have more
	\| bytes than the amd64 binary. So how to evaluate
	\| KolmogorovComplexity(amd64)?
	\|
	\| The laws of physics somehow need to be accounted for too,
	\| probably.

	\| d_burfoot wrote:
	\| Kolmogorov Complexity is only defined up to a constant,
	\| which represents Turing machine translation length.

	\| notpushkin wrote:
	\| I guess we need to guesstimate the length of a shortest
	\| Turing machine implementation of amd64 then?

	\| Dylan16807 wrote:
	\| The complexity of a simple turing machine is itty bitty,
	\| and you can bootstrap that into an x86 emulator in a matter
	\| of kilobytes, so when we're messing with 100MB files it's
	\| not a big factor.

	\| rao-v wrote:
	\| If every version of your OS ships with a default local model,
	\| it maybe interesting to see compression conditioned on the
	\| standard LLM weights

	\| rurban wrote:
	\| So did beat his own leading program from 2019, nncp, finally.

	\| egl2020 wrote:
	\| When Jeff Dean gets stuck, he asks Bellard for help...

	\| ok_dad wrote:
	\| Jeff Dean uses Fabrice Bellard as a rubber duck for debugging,
	\| and vice versa. Amazingly, neither says a thing for several
	\| minutes, staring into each other's eyes, and then they just
	\| start typing the solution to the problem on a single keyboard
	\| together.

	\| jl6 wrote:
	\| Shannon approaches the Bellard limit.

	\| SnowProblem wrote:
	\| I love this because it gets to the heart of information theory.
	\| Shannon's foundational insight was that information is surprise.
	\| A random sequence is incompressible by definition. But what
	\| counts as surprise depends on context, and for text, we know a
	\| large amount of it is predictable slop. I suspect there's a lot
	\| of room to go along this style of compression. For example, maybe
	\| you could store an upfront summary that makes prediction more
	\| accurate. Or perhaps you could encode larger sequences or some
	\| kind of hierarchical encoding. But this is great.

	\| bambax wrote:
	\| Yes! information is surprise, and that's why a measure of
	\| intelligence is the ability to predict.

	\| oxag3n wrote:
	\| Compression and intelligence reminded me of the
	\| https://www.hutter1.net/prize
	\|
	\| I've encountered it >10 years ago and it felt novel that
	\| compression is related to intelligence and even AGI.

	\| eru wrote:
	\| Yes.
	\|
	\| When you train your neural network to minimise cross-entropy
	\| that's literally the same as making it better as a building
	\| block in an arithmetic coding data compressor. See
	\| https://en.wikipedia.org/wiki/Arithmetic_coding
	\|
	\| See also https://learnandburn.ai/p/an-elegant-equivalence-
	\| between-llm...

	\| senderista wrote:
	\| Indeed, KL-divergence can be seen as the difference between
	\| the average number of bits required to arithmetically encode
	\| a sample from a given distribution, using symbol
	\| probabilities from both the original distribution and an
	\| approximating distribution.
	\|
	\| https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver.
	\| ..

	\| omoikane wrote:
	\| Current leader of the Large Text Compression Benchmark is NNCP
	\| (compression using neural networks), also by Fabrice Bellard:
	\|
	\| https://bellard.org/nncp/
	\|
	\| Also, nncp-2024-06-05.tar.gz is just 1180969 bytes, unlike
	\| ts_zip-2024-03-02.tar.gz (159228453 bytes, which is bigger than
	\| uncompressed enwiki8).

	\| smusamashah wrote:
	\| Doesn't this fit the Hutter Prize conditions that is mentioned
	\| in other comment here
	\| https://news.ycombinator.com/item?id=46595109

	\| WithinReason wrote:
	\| It's too slow for that. The Hutter prize is CPU only so
	\| neural network solutions (which are the most interesting IMO)
	\| are effectively excluded. You need to generate 11 574
	\| characters per second on the CPU only for decompression, and
	\| the compression time also counts and has to be below 24 hours
	\| in total.

	\| wiz21c wrote:
	\| while impressive, it's a very specific case of compression:
	\| english text. It may be in use in many places but there are
	\| many more things to compress.
	\|
	\| It'd be nice to have a comparison here:
	\| https://morotti.github.io/lzbench-web/?dataset=silesia/sao&m...

	\| gmuslera wrote:
	\| Reminded me of pi filesystem (https://github.com/philipl/pifs),
	\| with enough digits of pi precalculated you might be able to do a
	\| decent compression program. The trick is in the amount of
	\| reasonable digits for that, if it's smaller or bigger than that
	\| trained LLM.

	\| GuB-42 wrote:
	\| I suspect that the length of the offset of your input data in
	\| pi is equal to the length of the input data itself, plus or
	\| minus a few bytes at most, regardless of the size of the input
	\| data.
	\|
	\| That is: no compression, but it won't make things worse either.
	\|
	\| Unless the input data is the digits of pi, obviously, or the
	\| result of some computation involving pi.

	\| MrLeap wrote:
	\| You could express the offset with scientific notation,
	\| tetration, and other big math number things. You probably
	\| don't need the whole offset number all at once!

	\| GuB-42 wrote:
	\| Actually, you do.
	\|
	\| You can use all the math stuff like scientific notation,
	\| tetration, etc... but it won't help you make things
	\| smaller.
	\|
	\| Math notation is a form of compression. 10^9 is 1000000000,
	\| compressed. But the offset into pi is effectively a random
	\| number, and you can't compress random numbers no matter
	\| what technique you use, including math notation.
	\|
	\| This can be formalized and mathematically proven. The only
	\| thing wrong here is that pi is not a random number, but
	\| unless you are dealing with circles, it looks a lot like
	\| it, so while unproven, I think it is a reasonable shortcut.

	\| sltkr wrote:
	\| I'm going to be the nerd that points out that it has not been
	\| mathematically proven that pi contains every substring, so the
	\| pifs might not work even in theory (besides being utterly
	\| impractical, of course).
	\|
	\| On a more serious note, as far as I understand these
	\| compression competitions require that static data is included
	\| in the size computation. So if you compress 1000 MB into 500
	\| MB, but to decompress you need a 1 MB binary and a 100 MB
	\| initial dictionary, your score would be 500 + 100 + 1 = 601 MB,
	\| not 500 MB.
	\|
	\| The relevance to this discussion is that the LLM weights would
	\| have to be included as static data, since the only way to
	\| regenerate them is from the initial training data, which is
	\| much larger than the resulting model. By comparison, pi based
	\| compression is the other way around: since pi is a natural
	\| constant, if your decompressor requires (say) a trillion digits
	\| of pi, you could write a relatively small program (a few kb) to
	\| generate them. It would be terribly slow, but it wouldn't
	\| affect your compression ratio much.

	\| charcircuit wrote:
	\| This only does 1 byte, so you only have to prove it contains
	\| the bits for 0-255.

	\| eru wrote:
	\| > I'm going to be the nerd that points out that it has not
	\| been mathematically proven that pi contains every substring,
	\| so the pifs might not work even in theory (besides being
	\| utterly impractical, of course).
	\|
	\| Well, either your program 'works', or you will have
	\| discovered a major new insight about Pi.
	\|
	\| > On a more serious note, as far as I understand these
	\| compression competitions require that static data is included
	\| in the size computation. So if you compress 1000 MB into 500
	\| MB, but to decompress you need a 1 MB binary and a 100 MB
	\| initial dictionary, your score would be 500 + 100 + 1 = 601
	\| MB, not 500 MB.
	\|
	\| And that's the only way to do this fairly, if you are running
	\| a competition where you only have a single static corpus to
	\| compress.
	\|
	\| It would be more interesting and would make the results more
	\| useful, if the texts to be compressed would be drawn from a
	\| wide probability distribution, and then we scored people on
	\| eg the average length. Then you wouldn't necessarily need to
	\| include the size of the compressor and decompressor in the
	\| score.
	\|
	\| Of course, it would be utterly impractical to sample
	\| Gigabytes of new text each time you need to run the
	\| benchmark: humans are expensive writers. The only way this
	\| could work would be either to sample via an LLM, but that's
	\| somewhat circular and wouldn't measure what you actually want
	\| to measure in the benchmark, or you could try to keep the
	\| benchmark text secret, but that has its own problems.

	\| netsharc wrote:
	\| You mentioning the concept of pi containing every substring
	\| makes me think of Borges' Library of Babel.
	\|
	\| Ha, next: a compression algorithm that requires the user to
	\| first build an infinite library...

	\| _ache_ wrote:
	\| Yep, the Library of Babel is a related topic. Wikipedia
	\| links it as "See also" on the page of "Normal numbers".
	\|
	\| > a compression algorithm that requires the user to first
	\| build an infinite library...
	\|
	\| Kind of already exists, pifs. More like a joke, but the
	\| concept is already a joke so...
	\|
	\| https://github.com/philipl/pifs

	\| dataflow wrote:
	\| > I'm going to be the nerd that points out that it has not
	\| been mathematically proven that pi contains every substring
	\|
	\| Fascinating. Do you know if this has been proven about _any_
	\| interesting number (that wasn 't explicitly constructed to
	\| make this true)?

	\| kadoban wrote:
	\| https://en.wikipedia.org/wiki/Normal_number has some
	\| examples and a bunch of info. Most of them are pretty
	\| artificial, but the concatenation of the primes one is...
	\| at least interesting, not obvious (to me) from doing that
	\| that it'd be normal.

	\| _ache_ wrote:
	\| Hmm, you are referring to rich numbers but pointing to
	\| normal numbers, so I will be the nerd who points out that
	\| every normal number is rich, but some rich numbers aren't
	\| normal.
	\|
	\| https://en.wikipedia.org/wiki/Disjunctive_sequence

	\| jokoon wrote:
	\| so barely 2 or 3 times better than xz
	\|
	\| not really worth it

	\| bob1029 wrote:
	\| PPMd is the most exotic compressor I've actually used in
	\| production. The first time I saw it in action I thought it was
	\| lossy or something was broken. I had never seen structured text
	\| compress that well.

	___________________________________________________________________
	(page generated 2026-01-13 09:00 UTC)