Development History Of a File Compression Utility

Development History Of a File Compression Utility
by Richard Greenlaw
251 Colony Ct.
Gahanna, Ohio 43230

May 18, 1981
Revised August 29, 1981

Introduction:

The file compression system consists of two programs, SQ and
USQ, meaning squeeze and unsqueeze. They are written in the
C language for the BDS C compiler. The executable files are
SQ.COM and USQ.COM, which are self-sufficient to run under
the CP/M operating system and consist of 8080 machine code.

SQ.COM is compiled from files SQ.C, SQDIO.C, TR1.C, TR2.C,
IO.C, SQ.H, DIO.H and SQCOM.H. USQ.COM is compiled from
files USQ.C, USQDIO.C, UTR.C, USQ.H and (again) DIO.H and
SQCOM.H. Both COM files also include standard library
functions and the BDS-C run-time package.

SQDIO.C and USQDIO.H provide i/o redirection and pipe
simulation and are just the BDS dio package renamed to
produce distinct CRL files corresponding to the different
addresses of external variables with which they are
compiled.

The SQ program builds a squeezed file by applying two
transformations:

First, byte values which are repeated consecutively three or
more times are reduced to the value, the token DLE
(delimiter), and a count. The penalty is that occurrances of
DLE are encoded as DLE, zero.

Second, the Huffman algorithm encodes each resulting byte
value or endfile as a bit string having length inversely
proportional to its frequency of occurrance. This is a
complex process requiring reading the input file twice.

The squeezed file contains various information to allow the
USQ program to decode it and recreate the original file
exactly.

The environment:

The programs should be nearly portable. The CP/M system
actually used is single user 2MHz 8080 without interrupts.

The BDS C compiler supports a subset of C. It does not
support register variables, long integers or floats. That
leads to complexity in collecting and processing the
frequencies of occurance of the various byte values being
encoded.

Outline of SQ

The interesting work begins in function squeeze in file
SQ.C. In the first pass, init_huff in file TR2.C reads the
input through the first encoder, getcnr, in file TR1.C,
collects the frequency distribution and builds the decoding
and encoding structures. Then wrt_head in file TR2.C writes
control information and the decoding structure to the output
file.

In the second pass, encoded bytes consisting of bits and
pieces of bit strings are generated by function gethuff in
file TR2.C and are simply written to the output by squeeze.

Development History of SQ

There have been seven operational pre-release versions of
SQ. The motive for change in each case was primarily
increased execution speed, although the conveniences of
operating on lists of files, automatic name generation for
squeezed files, and output drive specifiers were also added
in the later versions.

Early versions called the following chain of functions to
get a byte of encoded data: gethuff, getbit, get_cnr,
getc_crc and getc. It wrote through putce and putc. That's a
lot of function calling. In addittion, gethuff and getbit
were passed pointers to functions to identify get_cnr.

Actually, those versions used a dummy function for get_cnr,
the repeated value encoder, although the actual code was
present. This was to simplify debugging and because USQ did
not yet have the inverse of that translation.

The benchmark for comparisons was not consistent because
files were lost at two points. In effect, the current SQ.COM
squeezed itself! It typically acheived 6-7% compression on
a machine code file of 8K to 10K bytes. Of course machine
code is not a practical case, but it is a rugged workout.
Text files are compressed 33% to 46% depending on the
richness and distribution of the alphabet.

V0, for which listings have not survived, took 5:10 (five
minutes, 10 seconds) to squeeze itself! This was improved to
4:23 by the optimizer option of the compiler, which simply
generates in-line code rather than subroutine calls for all
local and external variable accesses. It was further
improved to 4:18 (and restored to its original length) by
the -e compiler option which specifies the origin of the
external variable area to allow direct addressing. (The BDS
linker resolves only function names - externals are actually
like FORTRAN COMMON and are normally accessed relative to a
pointer kept in RAM!).

Subsequent improvements came mostly from recoding the key
routine. Copies of gethuff and its partner getbit are
attached for versions V1 through V6 and the complete
listings (20 pages) for V6 are included.

In V0 through V2, gethuff forms an output byte by calling
getbit eight times and packing the bits. This is the obvious
method because the Huffman translation produces variable
length bit strings, not a byte for a byte.

V1 introduced the variable codebyte to the getbit function.
It was rotated each time a bit was removed, so that
subsequent calls had to shift it only one bit position. This
involved considerable change. Timing is uncertain now.

V2 continued to improve the getbit function by customizing
the three basic cases and providing seperate returns from
each to avoid unnecessary work.

The changes of V1 and V2 reduced run time to 1:41, a
whopping 61% reduction!

V3 incooperated getbit into gethuff. This wasn't difficult
because getbit was called only once by gethuff. It ran in
1:27 (on a slightly smaller file), another 14% reduction.

V4 removed the pointers to functions mentioned earlier and
substituted direct calls. However, at this point the real
translation for repeated values was enabled. The net result
was a slight loss of ground to 1:30, but more productive work.

V0 through V4 worked from Huffman code bit strings of
indefinite length accessed through an array of pointers.
Each string was byte alligned (unlike the final encoded
data).

V5 was a complete redesign of the storage and retreival of
the array of code strings. I had finally succeeded in
proving* that the maximum length code string would fit in the
same space as the sum of all frequency counts, so scaling in
init_huff was made more rigorous to fit them into unsigned
integers (16 bits).

* The proof was proven wrong in practice at least for the first
implementation of the algorithm. Sq 1.5 (8/29/81) tries harder
to generate codes no longer than it can handle (16 bits)
and if it fails at this it fudges the counts and tries
again.

This redesign paved the way for a relatively simple method
of processing the code strings several bits at a time,
rather than singly in an eight pass loop to form an output
byte.

At this this point the fancy file name processing, etc.,
were added, increasing the size of SQ.COM from 7680 bytes to
10,112 bytes, an increase of 32% in the work performed by
the "benchmark". V5 ran in 1:40, which scales to 1:16, a
reduction of 16%. In a second variant, changing the variable
cbitsrem to a char from an integer saved another 5%.

V6 restructures the gethuff of V5, replacing the while loop
with a custom (goto) loop with the exit condition tested
only in a special case. The two basic cases also do only the
work necessary to their cases. Also, squeeze in SQ.C calls
putc directly and does its own check for write failure,
saving one layer of function calls. It ran in 1:29 (scales
to 1:08), a reduction of 6% from the second variant of V5.

The overall performance improvement ratio, scaled for the
one major change in the benchmark workload (but not taking
credit for the enabling of the repeated character encoding)
was about 4.5 : 1, or a reduction of 78%. The true
improvement was probably a factor of 5.