EPUB sucks
S. Gilles
2017-07-01
I have a large pile of epub files sitting around. After encountering
a few rendering glitches on my physical reader (as well as some
typos), I decided to recompile a few of them to perform some minor
edits, and maybe add some covers to the Project Gutenberg ones while
I was at it. I thought that since EPUB was built from well-known,
widely used technology like ZIP and XHTML, it would proabably be
usable.
As it turns out, EPUB is a terrible, terrible format. In no particular
order:
o Conceptually, EPUB is “just a bunch of (X)HTML files in a ZIP,
with some metadata”. But there are absurdly strict restrictions
on that ZIP. The “mimetype” file has to come first in the ordering,
it must not be compressed, and it must have no metadata. For the
rest of the file, directory entries must be supressed.
This is so that libmagic(5) style detection works easier, and
is how we get the magic
zip -X0 foo.epub mimetype
zip -Xur9D foo.epub *
sequence of commands. However, the structure of ZIP files already
provides a trivial solution: arbitrary archive comments can be
added at a predicatable location (the end) to a ZIP file. These
have been abused to create hosts of polyglot files already. It's
equally as easy to check for an uncompressed
mimetypeapplication/epub+zip
sequence as the beginning of the file as it is to ensure that
the file ends with the bytes
^@this_is_an_epub_file_okay?
so the EPUB spec could simply have mandated a particular comment
and stayed out of micromanaging ZIP arguments. The EPUB choice
sucks because it makes recreating the epub a nightmare: generic
tools like archivemount can't be trusted to faithfully preserve
an EPUB file while operating on it.
o Absurd redundancy of structure. The file I'm looking at now has:
- a META-INF/container.xml file, which contains a hierarchy of
files, but the hierarchy is trivial and just points to
content.opf.
- inside content.opf, there is a <manifest> object, containing
a list of all content-bearing files in the EPUB
- below that, there is a <spine toc="ncx"> object which is just
a list of ids previously established by the manifest.
- right below that is a <guide> object, which doesn't use the
ids from manifest.
- a toc.ncx file with exactly one entry,
- and that entry is a link to a table of contents created in
HTML as part of the book, bypassing the whole point of toc.ncx.
The <manifest> lists all objects in the file, which is rather
useless even in a streaming situation. The <spine> designates
an ordering of a subset of the <manifest>, and the <guide> applies
special types to elements of <manifest>. These could easily be
compressed into one object. That object could be the toc.ncx
file.
EPUB 3 makes this even worse, I hear.
o Poorly constructed website. Searching for epub specifications
brings up <
http://www.idpf.org/epub/20/spec/OPF_2.0_latest.htm>,
but since EPUB 3 exists, I tried changing that URL to
<
http://www.idpf.org/epub/30/spec/OPF_3.0_latest.htm>. But that
URL gives a 301, pointing to itself for an endless loop. I'm not
even sure how someone managed to do that. (The actual specification
for 3.0.1 is at
<
http://www.idpf.org/epub/301/spec/epub-publications.html>.)
o Investigating that specification yields the following example
for the title of LOTR. Instead of something like
<metadata
title-main="The Fellowship of the Ring"
collection="The Lord of the Rings"
title-expanded="THE LORD OF THE RINGS, Part One: The Fellowship of the Ring"
/>
The EPUB spec has opted for the following:
<metadata xmlns:dc="
http://purl.org/dc/elements/1.1/">
<dc:title id="t1">The Fellowship of the Ring</dc:title>
<meta refines="#t1" property="title-type">main</meta>
<dc:title id="t2">The Lord of the Rings</dc:title>
<meta refines="#t2" property="title-type">collection</meta>
<dc:title id="t3">THE LORD OF THE RINGS, Part One: The Fellowship of the Ring</dc:title>
<meta refines="#t3" property="title-type">expanded</meta>
…
</metadata>
Those jury-rigged id= and refines= tags are nothing short of
insane genius: it takes skill to start with XML and build a
specification in which it is possible to write a property that
modifies the wrong object, or no object at all.
o Specifying a cover for an music album, to a music player, is
pretty simple. You put a file called cover.jpg in the directory,
or perhaps folder.jpg if you belong to that camp.
For an EPUB document, the fastest way I can figure out is the
following: Put cover.jpg in the root, add it to the <manifest>
of content.opf, add something like
<meta name="cover" content="cover" />
to the <metadata> of content.opf, making sure that the content
tag is the same as the id from the manifest. When that doesn't
work, make a cover.xhtml file with appropriate css to reference
cover.jpg as an <img> element (make sure to see whether ‘width=100%’
or ‘height=100%’ is appropriate, since there's no way to easily
scale-to-fit preserving aspect-ratios in lowest-common-denominator
HTML+CSS), add THAT to the manifest, then add it to the <guide>
element of content.opf via something like
<reference href="cover.xhtml" type="cover" title="Cover" />
Just to save you some time, you can't put
<reference href="cover.jpg" type="cover" title="Cover" />
in the guide directly: each entry must be an “OPS Content
Document”, which is their name for “An XHTML document that
conforms to our DTD”.
This format was not meant for humans to work with, it was meant for
companies to charge other companies to churn out.