GopherProxy

This file belongs to the CEP package | Ten plik nale/zy do pakietu CEP
This package is public domain | Pakiet stanowi dobro powszechne
For more info see `0CEP_LIC.ENG' | Wi/ecej informacji w ,,0CEP_LIC.POL''
===========================================================================
`CEPCOP_E.INF' -- ENGLISH DOCUMENTATION

The amount of disk space occupied by bitmap graphics is a well-recognized
problem. For example, 300dpi picture A4 contains ca 8 700 000 pixels;
assuming that each CMYK pixel occupies four bytes, one obtains ca 35MB
of disk space needed to store the picture.

Now, imagine a poor TeX-er, who is not allowed to use binary graphic data
(because of the otherwise magnificent DVIPS), thus the poor TeX-er usually
converts the binary data to hexadecimal EPSes, thus doubling the required
space, and next, after compiling a document with TeX+DVIPS, the whole graphic
data is put into the resulting PS file, so the required space is doubled
again -- altogether 140MB per one A4 page. Night-mare begins...

This problem is not a new one, it was was recognised by Adobe relatively long
time ago. In the Level 2 specification they included objects called filters
which enable data compression. In particular, instead of hexadecimal data one
can use ASCII85 encoding (alike unix utility uuencode-uudecode), run length
compression, LZW compression, DCT (used in JPEG compressed files), and many
others. Why not to make use of these tools? The question is not as silly as
it may look at the first glance, as there exist relatively few applications
generating well-compressed PostScript graphics.

We decided to patch somehow this gap. We developed a little package enabling
the compression of ``normal'' (non-compressed) graphic data. The nature of
the problem is more complex, however, than one might expect. In particular,
a universal, always efficient compression technique does not exist. Hence the
package has several ``buttons'' which enable controlling various aspects of
compression.
* * *

Our package consists of four AWK programs, CEP.AWK-UNCEP.AWK and
COP.AWK-UNCOP.AWK. CEP.AWK and COP.AWK generate (on-the-fly) PostScript
programs which, processed by Ghostscript, yield the appropriate data
compression. UNCEP and UNCOP accomplish (using a similar technique) the
reverse process, i.e., uncompression.

CEP is devised for the compression of usual bitmap EPS files, containing
a single, hexadecimally coded image; COP can be used to compress any
PostScript data.

The question arises: why to use two packing techniques? The answer is
simple: the efficiency of compression is higher if a compressing program
knows in advance which kind of data are to be expected. In general, bitmaps
are more regular (redundant) than arbitrary PostScript data, hence even
simple algorithms turn out to be more efficient.

Tests show that in the best case (screen dumps) squeezing up to 10% of the
original size is nothing unusual. Sometimes, however, no compression method
gives a satisfactory result. In such a case, one can always use encoding data
using ASCII85 filter, obtaining a reduction of a hexadecimal bitmap size by
approximately 35%.

Below we give a brief description of CEP and COP. So far, only the MS DOS
version of the PS-compressors is available. In this version the GAWK-EMX.EXE
implementation of AWK and GS386.EXE Ghostscript interpreter are used.

We tested the package using several Ghostscript and GAWK implementations,
now we use Ghostscript 5.10 and GAWK 3.0.3.

==========================================================================
C E P AND U N C E P
==========================================================================

The CEP subpackage consist of the MS DOS batch files CEP.BAT and UNCEP.BAT
and the AWK programs CEP.AWK and UNCEP.AWK. First, AWK inspects the source
EPS file doing its best to recognize a position of a hexadecimal bitmap, next
it creates an appropriate PostScript program, and then the control is passed
on to Ghostscript which just performs the submitted program: encodes the
bitmap and copies verbatim the remaining lines. The original preamble is
slightly modified; nevertheless, all DSC comments are left intact.

If the bitmap cannot be found or the AWK suspects that troubles may arise,
the CEP engine gives up.

The resulting file should be verified prior to removing the original one,
as the CEP heuristic tricks may fail to fix the bitmap properly; moreover,
due to GS bugs, premature removing of source may also be painful.

CEP never generates binary output -- only hexadecimal or ASCII85 encoding
are supported. This is due to the fact that CEP-compressed EPS files are
primarily meant to be used in the contexts of TeX+DVIPS. Nevertheless,
the resulting files can be used in other typesetting systems as so-called
placeable EPSes. The applicability to non-TeX application, however,
is somewhat limited, as binary TIFF previews may be misinterpreted by (G)AWK.

UNCEP requires that a CEP-compressed file was not changed. In particular,
it relies on the information in a quasi-DSC comment `%UNCEPInfo:'. This
information can be destroyed by a seemingly innocent modification (e.g., by
adding or removing a comment line). Note that the technique employed by CEP
destroys, by its nature, the information about the line-breaking structure
of the hexadecimal bitmap. Therefore, UNCEP cannot retrieve the original file.
Line-breaking structure does not make any problem for a PostScript
interpreter. There exist programs, however, reading their own bitmap EPS
files, which for unknown reasons make use of such (sub)lexical information;
Aldus PhotoStyler is a notable example.

CEP USAGE: cep.bat input_file output_file [options]
the program recognizes the following options:
8 -- use ASCII85 coding (default)
h or H -- use HEX (hexadecimal) coding
r or R -- use RLE (RunLength) compression (default)
l or L -- use LZW compression
f or F -- use Flate compression (non-standard!)
n or N -- don't compress
NOTE: names of input_file and output_file must differ.

UNCEP USAGE: uncep.bat input_file output_file
NOTE: names of input_file and output_file must differ;
decompression and decoding method is taken from input file.

==========================================================================
C O P AND U N C O P
==========================================================================

The subpackage consist of the MS DOS batch files COP.BAT and UNCOP.BAT, and
the AWK programs COP.AWK and UNCOP.AWK. COP reads and encodes appropriately the
supplied data. No analysis of the PostScript data is performed, as the entire
file is encoded without changing even a bit. The only aspect that is taken
into account is the DSC comment `%%BoundingBox:'; if it is found, COP inserts
this comments in the preamble, otherwise the resulting file does not contain
the bounding box information.

COP-generated files are readable to any PostScript Level 2 interpreter.

UNCOP scans the header and deduces from it the method of decompression, hence
no options are needed. UNCOP, unlike UNCEP, retrieves precisely the original
file. It is still recommended, however, that a user verifies whether the
resulting file is properly interpreted by GS. Due to GS bugs, premature
removing of the source file after compression or decompression may turn out
to be painful.

Since COP can be used to compress any data for arbitrary applications, also
binary encoding is allowed. The resulting files can be used typesetting
systems that accept so-called placeable EPSes. Unfortunately, binary TIFF
previews makes files after compression illegible for PostScript.

COP USAGE: cop.bat input_file output_file [options]
the program recognizes the following options:
8 -- use ASCII85 coding (default)
b or B -- use binary coding
h or H -- use HEX (hexadecimal) coding
r or R -- use RLE (RunLength) compression (default)
l or L -- use LZW compression
f or F -- use Flate compression (non-standard!)
n or N -- don't compress
NOTE: names of input_file and output_file must differ;
observe that binary encoding is, in fact, no encoding at all.

UNCOP USAGE: uncop.bat input_file output_file
NOTE: names of input_file and output_file must differ;
decompression and decoding method is taken from input file.

=============================================================================
A HEAP OF REMARKS CONCERNING C E P AND C O P
=============================================================================

The applied solution addresses several problems:

* It is not at all obvious how to determine syntactically
where a hexadecimal bitmap begins in an EPS file; semantic analysis
(by redefining PostScript primitives image, imagemask and colorimage)
is possible, but it has also its limitations; anyway, we decided
to recognize a bitmap syntactically, which implied a problem of
recognizing such artefacts as `add' or `def' which look like
fragments of a bitmap but, in fact, they are not.

* Also, it is not obvious which compression method should be applied for
a given data; usually, ASCII85 encoding is advisable; for pure bitmaps
(CEP) RLE compression is satisfactory, although LZW and Flate filters
produce usually much better results (the latter seams to be the best);
nevertheless, both LZW and Flate encodings have limited usability:
(a) LZW encoding is not implemented in GS ver. 4.x due to USA
patent law; Aladdin implemented an LZW-compatible filter instead,
which produces non-compressed data (in fact, enlarged by some 10%)
readable for any LZWDecode filter. You can use old GS version,
or compile a GS version containing the real LZW filter on your
own risk, but...
(b) Flate encoding (the same that is used in GZIP) is not available
(yet?) on PostScript phototypesetters -- in the Ghostscript
documentation one can find a moderately encouraging passage:
``Ghostscript also supports the as yet undocumented
FlateEncode and FlateDecode filters from PDF 1.2
and (presumably) PostScript Level 3''
As a rule of thumb we would suggest not to use any compression but
ASCII85 for detailed colour photo images. It is just weakness of all
non-lossy techniques -- algorithms employed by ARJ, ZIP, LHARC,
and others would yield also poor results. A reasonable alternative
for the data of this kind would be DCT (JPEG) compression.

* As was mentioned above, ASCII85 encoding can usually be recommended;
it added, however, some troubles. First, due to GS bugs, we decided
to add the (dummy) NullEncode filter which seems to cure the problem.
But there is one more problem: ASCII85 encoded bitmaps may contain
lines looking like DSC comments, i.e, they may begin with double percent
sign, %%, or with a pair percent-exclamation sign, %! -- why Adobe
didn't exclude a percent from ASCII85?. Some programs may try to
interprete maybe-DSC lines. For example, DVIPS just removes
such lines, unless option -K0 is not used; on the other hand,
leaving DSC comments intact may stupefy document managers.

* It would be convenient to have some more filters implemented, in
particular DCT and CCITTFax; both of them, however, make use of some
additional input data which makes using them more complex; moreover,
it is not clear whether one can find the optimal compression parameters
for DCT without a WYSIWYG program; we consider a possibility of
one-to-one conversion between JPEG files and EPS files making use of
DCT filters; also, a similar conversion between GIF files and EPS files
making use of LZW filters can perhaps be implemented.

* The package takes care of the working disk space -- no large temporary
files are created; roughly, the needed disk space is equal to the size
of the source + the size of the target.

* In order to check whether a given phototypesetter is a genuine
PostScript Level 2 interpreter, a trial-and-error method is necessary,
since many commercial PostScript devices only claim to be Level 2
compatible. The following file may be helpful for verifying
the claims of the producer of a PostScript device:

%!PS-Adobe-2.0 EPSF-1.2
%%Pages: 1
%%BoundingBox: 0 0 540 150
%%EndComments
/Helvetica 8 selectfont
90 rotate
1 2 moveto
(*)
{0 -10 rmoveto gsave show grestore}
255 string
/Filter
resourceforall
showpage
%%EOF

Running this program yields the list of filters for a given device.
The error reported during the processing of this file proves that
the device is not Level 2 compatible. In such a case, using the
CEP package should be abandoned.

* bugs and traps:

(a) Apparently prepending `flushfile' to `closefile' neutralizes an
error in GS 3.x (tail of output swallowed).

(b) Adding (a dummy) NullEncode filter neutralizes (probably) another
GS bug: ASCII85Encode filter with target procedure may produce
superfluous EOD marks, i.e., ~> (if things go really bad you can
obtain thousands of them). Using the target procedure instead of
a file object excludes GS ver. < 3.x, because early Ghostscripts
didn't support all features of PostScript Level 2. Nevertheless,
GS ver. >= 2.6 can be used for compression with hexadecimal encoding
(it has the ``legal'' LZW compression)

(c) the target procedure mentioned (b), in turn, is due to special
treatment of the ASCII85 encoded lines looking like DSC comments;
this special treatment is breaking lines after the first percent
character. It is dedicated to the DVIPS driver which has a dangerous
option `remove comments' (-K1)

(d) an artificial form of quitting `{2 2 .quit}' instead of `{2 .quit}'
is due to an infinite loop of GS 3.5x caused by the latter form.
The GS internal operation `.quit' was chosen to provide error
handling at the level of operating system.

(e) still, there exist bugs in older Ghostscripts that we were not able
to neutralize; e.g., some EPS files are properly compressed
by GS 2.6, but GS 2.6 breaks while displaying them; GS 3.51
behaves similarly with other bitmaps. So far, GS 4.x seems to be
the most resistant to the ``filter trial,'' but it also reveals
some deficiences.

(f) summing up, we would strongly recommend using GS 4.x or 5.x
(possibly with LZWEncode compiled in) and GAWK 3.x: GS 4.x is
nearly complete implementation of the Level 2 PostScript;
GAWK 3.x provides regular expressions for record separators,
which makes possible to force to handle end-of-lines in exactly
the same manner as PostScript does and, moreover, is more reliable
than earlier versions.

==========================================================================
H I S T O R Y
==========================================================================

CEP+UNCEP:
0.10 -- 16.03.97 -- first version
0.20 -- 05.04.97 -- some obvious bugs removed
0.30 -- 11.04.97 -- new method of prolog modification (processing complex
prologs is enabled), and merging output file in
Postscript (faster and less disk space needed)

0.35 -- 13.04.97 -- comments added (bilingual version)
0.40 -- 14.04.97 -- significant improvement of performance
0.50 -- 15.04.97 -- strings allocated statically, temporary files not
created, (speed improved and demand for disk space
slashed)
0.60 -- 19.04.97 -- postScript error handling added, some GS bugs
neutralized
0.65 -- 20.04.97 -- exit code added, frame documentation provided
0.70 -- 21.04.97 -- UNCEP added
0.75 -- 24.04.97 -- problems of end-of-data and end-of-lines fixed;
documentation collected
1.00 -- 02.05.97 -- public domain release (BachoTeX '97)
1.03 -- 07.01.98 -- documentation touched, package (CEP.AWK) more robust

COP+UNCOP:
0.10 -- 06.04.97 -- first version
0.20 -- 12.04.97 -- program structure unified with CEP, "cvx exec" used
in place of "run"
0.25 -- 13.04.97 -- comments added (bilingual version)
0.30 -- 15.04.97 -- strings allocated statically (speed improved)
0.40 -- 19.04.97 -- postScript error handling added, some GS bugs
neutralized
0.45 -- 20.04.97 -- exit code added, frame documentation provided
0.50 -- 24.04.97 -- problems of end-of-lines fixed; documentation collected
1.00 -- 02.05.97 -- public domain release (BachoTeX '97)
1.03 -- 07.01.98 -- documentation touched, package (CEP.AWK) more robust

==========================================================================
V O C A B U L A R Y
==========================================================================

Ghostscript, GS -- a magnificent interpreter of PostScript language
by Aladdin Enterprise, available as a free public license product;
its current version (4.03) turns out to be much more reliable
than not a few commercial interpreters.
AWK -- a utility and a programming language for convenient and efficient
batch data-reformatting; written in 1977 by Alfred V. Aho,
Peter J. Weinberger, and Brian W. Kernighan.
GAWK -- Gnu AWK, GNU Free Software Foundation implementation of AWK,
written in 1986 by Paul Rubin and Jay Fenlason, with advice
from Richard Stallman.
GNU -- The Free Software Foundation (FSF) is a non-profit organization
dedicated to the production and distribution of freely
distributable software, founded by Richard M. Stallman.
TeX -- public domain typesyetting system by Donald E. Knuth of
Stanford University
DVIPS -- TeX-to-PostScript driver by Tomas Rokicki of Stanford University
DSC -- Document Structuring Convention -- Adobe's standard
for structuring PostScript documents.
ASCII85 -- PostScript algorithm of coding binary data as 7-bit ASCII
text consisting of only printable characters; encodes every
four bytes as five characters from `%' to `u'; additionaly
`z' is used to code four zeros (see PostScript Language
Reference Manual, second edition, pp. 128--130)
RLE -- run length encoding -- a standard method of data compression
(see PostScript Language Reference Manual, second edition,
pp. 133--134)
LZW -- an algorithm of data compression by J. Ziv, A. Lempel (1978),
improved by T. Welch (1984); Unisys, at the time Welch's employer,
was granted an US patent in 1985 on Welch's algorithm; a grandfather
clause was established by Unisys to make pre-1995 implementations
of LZW code free of royalty requirements, thereby eliminating such
claims on UNIX compress (information after Nelson H. F. Beebe,
e-mail [email protected])
DCT -- discrete cosine transform compression, an elaborated, very
efficient but lossy compression scheme
JPEG -- Joint Photographic Experts Group, an organization responsible
for developing an international standard for compression of
image data; PostScript (Level 2) DCTEncoding filter conforms
to the JPEG-proposed standard.
GZIP -- compressing tool by GNU Free Software Foundation, based on
superior and unpatented compression algorithm, developed in order
to get rid of the patented LZW algorithm.
=======================================================================
END OF `CEPCOP_E.INF'