Compression fever

* * * * *

Compression fever

It's one thing to trick someone on April Fools' Day [1], but to trick
yourself really takes talent.

I have that talent.

Sigh.

I started this last week and didn't suspect a thing.

I came across The Hutter Prize [2], which will pay up to 50,000€ to compress
a 100,000,000 byte file to less than 16,481,655 bytes (which includes the
decompression program). The rules [3] are straightforward, and to me, it
certainly seems easier than The Netflix Prize [4].

At the very least, it couldn't hurt to play around a bit.

So last Thursday I downloaded the file to be compressed and played around
with it. The file itself is nothing more than Wikipedia entries wrapped in
XML (eXtensible Markup Language). The XML only makes up 2.4% of the file—the
rest is text (13.5% is nothing but the space character; 8% is the letter
“e”).

Friday I started coding. Nothing difficult and it was pretty straightforward
(although while the algorithm for Huffman encoding [5] is pretty simple,
writing it was surprisingly difficult; it took about five attempts before I
was happy with the code). I did a few benchmarks against gzip and bzip2 and
was surprised with the results:

Table: My Sooperseekrit compression algorithm vs. gzip and bzip2
Program Size of archive
------------------------------
orginal file 100,000,000
gzip 36,445,248
bzip2 29,008,736
my attempt 19,774,963

While not less than the required 16,481,655 bytes, it is in the ballpark, and
I was quite surprised to beat the snot out of bzip2.

Not bad for a few hours of work.

So today, I'm hacking around with my program when I notice something odd—some
of the Huffman encodings are the same.

That's not a good sign—each encoding should be unique. It takes a while (it
takes almost 7½ minutes to compress 100,000,000 bytes with my program) but I
find the problem—a bug in the encoding routine. Building the Huffman encoding
table was fine—it was reading the encoded values that was buggy (and dropping
bits).

How buggy?

Table: My fixed Sooperseekrit compression algorithm vs. gzip and bzip2
Program Size of archive
------------------------------
orginal file 100,000,000
gzip 36,445,248
bzip2 29,008,736
my attempt 35,947,593

Um … yeah.

It's not the first time my hopes for vast fortunes were dashed because of a
bug fix [6].

Sigh.

[1] http://en.wikipedia.org/wiki/April_Fools'_Day
[2] http://prize.hutter1.net/
[3] http://prize.hutter1.net/hrules.htm
[4] http://www.netflixprize.com/index
[5] http://en.wikipedia.org/wiki/Huffman_encoding
[6] gopher://gopher.conman.org/0Phlog:2005/06/15.1

Email author at [email protected]