# Djvu and CBZ for easy portable documents

# Djvu and CBZ for easy portable documents
by Seth Kenlon

Recently, I discovered that my great great grandfather wrote two books near the turn of the previous century: one about sailing and the other about his career as [New York City's Fire Chief](https://www.fireengineering.com/articles/print/volume-56/issue-27/features/chief-john-kenlon-of-new-york-city.html). The books have a niche audience, but since it is part of my family history, I decided to try to preserve a copy of each, digitally. But what portable document format is best suited for such an endeavour?

I decided early on that PDF was not an option. The format, while good for printing pre-flight, seems condemned to nonstop feature bloat, and it produces documents than are difficult to introspect and edit. I wanted a smarter format with similar features.

## Comic Book archive

Comic book archives are a simple format most often used, as its name suggests, for comic books. You can see functional examples of comic book archives on sites like <a href="https://comicbookplus.com/" target="_blank">comicbookplus.com</ and <a href="https://digitalcomicmuseum.com/" target="_blank">digitalcomicmuseum.com</a>.

The greatest feature of a comic book archive is also its weakest: it's so simple it's almost more of a convention than a format. In fact, a comic book archive is just an ZIP, TAR, 7Z, or RAR archive given the extension ``.cbz``, ``.cbt``, ``.cb7``, or ``.cbr``, respectively. It has no standard for storing metadata.

They are, however, very easy to create.

1. Create a directory full of image files, and rename the images so that they have an inherent order:

$ n=0 && for i in *.png ; do mv $i `printf %04d $n`.png ; done

2. Archive the files using your favourite archive tool. In my experience, CBZ is best supported.

$ zip comicbook.zip -r *.png

3. Finally, rename the file with the appropriate extension:

$ mv comicbook.zip comicbook.cbz

The resulting file will likely open on most of your devices. On Linux, both [Evince](https://wiki.gnome.org/Apps/Evince) and [Okular](https://okular.kde.org) can open CBZ files. On Android, [Document Viewer](https://f-droid.org/en/packages/org.sufficientlysecure.viewer/) and [Bubble](https://f-droid.org/en/packages/com.nkanaev.comics/) open them.

### Uncompressing comic book archives

Getting your data back out of a comic book archive is also easy: just unarchive the CBZ file.

Since your archive tool of choice may not recognize the ``.cbz`` extension as a valid archive, it's safest to rename it back to its native extension:

$ comicbook.cbz comicbook.zip
$ unzip comicbook.zip

## Djvu

A more advanced format, developed over 20 years ago by AT&T, is DjVu (pronounced "déjà vu"). It's a digital document format with advanced compression technology and is viewable in more applications than you probably realise, including [Evince](https://wiki.gnome.org/Apps/Evince), [Okular](https://okular.kde.org) can open CBZ files, [djvu.js](http://djvu.js.org/) online, the Firefox extension [djvu.js viewer](https://github.com/RussCoder/djvujs), [GNU Emacs](https://elpa.gnu.org/packages/djvu.html), and applications like [Document Viewer](https://f-droid.org/en/packages/org.sufficientlysecure.viewer/) on Android.

An open source, cross-platform viewer [djview](http://djvu.sourceforge.net/djview4.html) is also available from Sourceforge.

You can read more about Djvu, and find sample ``.djvu`` files, at [djvu.org](http://djvu.org).

Djvu has several appealing features, including image compression, outline (bookmark) structure, and support for embedded text. It's easy to introspect and easy to edit using free and open source tools.

### Installing

The open source toolchain is [djvulibre](http://djvu.sourceforge.net), which you can find in your distribution's software repository. For example, on Fedora:

$ sudo dnf install dvjulibre

### Creating a Djvu file

A ``.djvu`` is any image that has been encoded as a Djvu file. It's a feature of Djvu that a ``.djvu`` can have contain one or more images (stored as "pages").

To manually produce a Djvu, you can use one of two encoders: ``c44`` for high quality images or ``cjb2`` for simple bitonal images. Each encoder accepts a different image format: ``c44`` can process ``.pnm`` or ``.jpeg`` files, while ``cjb2`` can process ``.pbm`` or ``.tiff`` images.

If you need to preprocess an image, you can do that in a terminal with [Image Magick](https://www.imagemagick.org/), using the ``-density`` option to define your desired resolution:

$ convert -density 200 foo.png foo.pnm

Then you can convert to Djvu:

$ c44 -dpi 200 foo.pnm foo.djvu

If your image is simple, like black text on a white page, you can try to convert it using the simpler encoder. If necessary, use Image Magick to first convert it to a compatible intermediate format:

$ convert -density 200 foo.png foo.pbm

And then convert to Djvu:

$ cjb2 -dpi 200 foo.pbm foo.djvu

You have now created a simple single-page ``.djvu`` document.

### Creating a multi-page Djvu file

While a single page Djvu can be useful given Djvu's sometimes excellent compression, the most common use is as a multi-page format.

Assuming you have a directory of many ``.djvu`` files, you can bundle them together with the ``djvm`` command:

$ djvm -c pg_1.djvu two.djvu 003.djvu mybook.djvu

Unlike a CBZ archive, the name of the bundled images has no effect on their order in the Djvu document. The order you provide in the command is preserved. If you had the foresight to name them in a natural sorting order (001.djvu, 002.djvu, 003.djvu, 004.djvu, and so on), then you can use a wildcard instead:

$ djvm -c *.djvu mybook.djvu

## Manipulating a Djvu document

Editing Djvu documents with ``djvm`` is easy. For instance, you can insert a page into an existing Djvu document:

$ djvm -i mybook.djvu newpage.djvu 2

In this example, the page ``newpage.djvu`` becomes the new page 2 in the file ``mybook.djvu``.

You can also delete a page. For example, to delete page 4 from ``mybook.djvu``:

$ djvm -i mybook.djvu 4

### Setting an outline

You can also add metadata to a Djvu file, such as an outline (commonly called "bookmarks"). To do this manually, create a plain text containing the document's outline. A Djvu outline is expressed in a LISP-like structure, with an opening ``bookmarks`` element followed by a bookmark name and page number:

(bookmarks
("Front cover" "#1")
("Chapter 1" "#3")
("Chapter 2" "#18")
("Chapter 3" "#26")
)

The parentheses define levels in your outline. The outline currently has only top-level bookmarks, but any section can have a subsection by delaying its closing parenthesis. For example, to add a subsection to Chapter 1:

(bookmarks
("Front cover" "#1")
("Chapter 1" "#3"
("Section 1" "#6"))
("Chapter 2" "#18")
("Chapter 3" "#26")
)

Once your outline is complete, save the file and then apply it to your Djvu file using the ``djvused`` command:

$ djvused -e 'set-outline outline.txt' -s mybook.djvu

Open your Djvu file to see the outline.

![A Djvu with an outline as viewed in Okular](outline.png)

### Embedding text

If you want to store the text of a document you're creating, you can embed text elements ("hidden text" in ``djvused`` terminology) in your Djvu file so that applications like Okular or djview can select and copy the text to a user's clipboard.

This is a complex operation, because in order to embed text you must first have text. If you have access to a good OCR application, or else you have the time and dedication to transcribe the printed page, then you may well have that data, but then you must map the text to the bitmap image.

Once you have the text and the coordinates for each line (or, if you prefer, for each word), you can write a ``djvused`` script with blocks for each page:

select; remove-ant; remove-txt
# -------------------------
select "p0004.djvu" # page 4
set-txt
(page 0 0 2550 3300
(line 1661 2337 2235 2369 "Fires and Fire-fighters")
(line 1761 2337 2235 2369 "by John Kenlon"))

.
# -------------------------
select "p0005.djvu" # page 5
set-txt
(page 0 0 2550 3300
(line 294 2602 1206 2642 "Some more text here, blah blah blah."))

The integers for each line represent the minmum and maximum locations for the X and Y coordinates of each line (``xmin``, ``ymin``, ``xmax``, ``ymax``). Each line is a rectangle measured in pixels, with an origin at the *bottom left* corner of the page.

You can define embedded text elements as words, lines, and hyperlinks, and you can map complex regions with more shapes than just rectangles. You can also embed specially define metadata, such as BibTex keys, which are expressed in lowercase (year, booktitle, editor, author, and so on), and DocInfo keys, borrowed from the PDF spec, always starting with an uppercase letter (Title, Author, Subject, Creator, Produced, CreationDate, ModDate, and so on).

## Automated Djvu creation with djvudigital

While it's nice to be able to handcraft a finely-detailed Djvu document, if you adopt Djvu as an everyday format, you'll notice that your applications lack some of the conveniences available for the more ubiquitous PDF. For instance, few if any applications offer a convenient **Print to Djvu** or **Export to Djvu** option the way they do for PDF.

However, you can still use Djvu by leveraging PDF as an intermediate format.

### Licensing kerfuffle

Unfortunately, the library required for easy automated Djvu conversion is licensed under the CPL, which has requirements that cannot be satisfied by the GPL code used in the toolchain. For this reason, it can't be distributed as a compiled library, but you're free to compile it yourself.

The process is relatively simple due to an excellent build script provided by the ``dvjulibre`` team.

1. First, you must prepare your system with software development tools. On Fedora, the quick and easy way to do this is with a DNF group:

$ sudo dnf group install @c-development

On Ubuntu:

$ sudo apt-get install build-essential

2. Next, download the ``gsdjvu`` source code from [sourceforge.net/projects/djvu/files/GSDjVu](https://sourceforge.net/projects/djvu/files/GSDjVu/1.10/). Be sure to download ``gsdjvu``, not ``djvulibre`` (in other words, don't click on the big green button at the top of the file listing, but on the latest file instead).

3. Unarchive the file you just downloaded, and then change directory into it:

$ cd ~/Downloads
$ tar xvf gsdjvu-X.YY.tar.gz
$ cd gsdjvu-X.YY

4. Create a directory called ``BUILD``. It must be called ``BUILD``, so don't quell your creativity:

$ mkdir BUILD
$ cd BUILD

5. Download the additional source packages required to build the ``gsdjvu`` application. Specifically, you must download the source for ``ghostscript`` (you almost certainly already have this installed, but you need its source for to build against). Additionally, your system must have source packages for ``jpeg``, ``libpng``, ``openjpeg``, ``zlib``. If you believe your system already has the source packages for these projects, then you can run the build script; if the sources are not found, the script fails and lets you correct the error before trying again.

6. Run the interactive ``build-gsdjvu`` build script included in the download. This script unpacks the source files, patches ghostscript with the ``gdevdjvu`` driver, compiles ghostscript, and then prunes unnecessary files from the build results.

7. You can install ``gsdjvu`` anywhere in your path. If you don't know what you ``PATH`` variable is, you can see it with ``echo $PATH``. For example, ot install it to the ``/usr/local`` prefix:

$ sudo cp -r BUILD/INST/gsdjvu /usr/local/lib64
$ cd /usr/local/bin
$ sudo ln -s ../lib64/gsdjvu/gsdjvu gsdjvu

### Converting a PDF to Djvu

Now that you've built the ghostscript driver, converting a PDF to Djvu is just one command:

$ djvudigital --lines mydocument.pdf mydocument.djvu

This transforms all pages, bookmarks, and embedded text in a PDF to a Djvu file. Using this tool, you can use convenient PDF functions from you applications but end up with Djvu files.

## Why Djvu

Djvu is a great additional document format for your archival arsenal. It seems silly to stuff a series of images into a PostScript format like PDF, or a format clearly meant mostly for text like EPUB, and so it's nice to have CBZ and Djvu as additional options. It might not be right for all of your documents, but it's a good one to get comfortable with and to use when it makes the most sense.

Happy formatting!