EPUB sucks

S. Gilles

2017-07-01

I have a large pile of epub files sitting around. After encountering
a few rendering glitches on my physical reader (as well as some
typos), I decided to recompile a few of them to perform some minor
edits, and maybe add some covers to the Project Gutenberg ones while
I was at it. I thought that since EPUB was built from well-known,
widely used technology like ZIP and XHTML, it would proabably be
usable.

As it turns out, EPUB is a terrible, terrible format. In no particular
order:

o Conceptually, EPUB is “just a bunch of (X)HTML files in a ZIP,
  with some metadata”. But there are absurdly strict restrictions
  on that ZIP. The “mimetype” file has to come first in the ordering,
  it must not be compressed, and it must have no metadata. For the
  rest of the file, directory entries must be supressed.

  This is so that libmagic(5) style detection works easier, and
  is how we get the magic

      zip -X0 foo.epub mimetype
      zip -Xur9D foo.epub *

  sequence of commands. However, the structure of ZIP files already
  provides a trivial solution: arbitrary archive comments can be
  added at a predicatable location (the end) to a ZIP file. These
  have been abused to create hosts of polyglot files already. It's
  equally as easy to check for an uncompressed

      mimetypeapplication/epub+zip

  sequence as the beginning of the file as it is to ensure that
  the file ends with the bytes

      ^@this_is_an_epub_file_okay?

  so the EPUB spec could simply have mandated a particular comment
  and stayed out of micromanaging ZIP arguments. The EPUB choice
  sucks because it makes recreating the epub a nightmare: generic
  tools like archivemount can't be trusted to faithfully preserve
  an EPUB file while operating on it.

o Absurd redundancy of structure. The file I'm looking at now has:

   - a META-INF/container.xml file, which contains a hierarchy of
     files, but the hierarchy is trivial and just points to
     content.opf.

   - inside content.opf, there is a <manifest> object, containing
     a list of all content-bearing files in the EPUB

   - below that, there is a <spine toc="ncx"> object which is just
     a list of ids previously established by the manifest.

   - right below that is a <guide> object, which doesn't use the
     ids from manifest.

   - a toc.ncx file with exactly one entry,

   - and that entry is a link to a table of contents created in
     HTML as part of the book, bypassing the whole point of toc.ncx.

  The <manifest> lists all objects in the file, which is rather
  useless even in a streaming situation.  The <spine> designates
  an ordering of a subset of the <manifest>, and the <guide> applies
  special types to elements of <manifest>.  These could easily be
  compressed into one object. That object could be the toc.ncx
  file.

  EPUB 3 makes this even worse, I hear.

o Poorly constructed website. Searching for epub specifications
  brings up <http://www.idpf.org/epub/20/spec/OPF_2.0_latest.htm>,
  but since EPUB 3 exists, I tried changing that URL to
  <http://www.idpf.org/epub/30/spec/OPF_3.0_latest.htm>. But that
  URL gives a 301, pointing to itself for an endless loop. I'm not
  even sure how someone managed to do that. (The actual specification
  for 3.0.1 is at
  <http://www.idpf.org/epub/301/spec/epub-publications.html>.)

o Investigating that specification yields the following example
  for the title of LOTR. Instead of something like

      <metadata
          title-main="The Fellowship of the Ring"
          collection="The Lord of the Rings"
          title-expanded="THE LORD OF THE RINGS, Part One: The Fellowship of the Ring"
       />

  The EPUB spec has opted for the following:

      <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
          <dc:title id="t1">The Fellowship of the Ring</dc:title>
          <meta refines="#t1" property="title-type">main</meta>

          <dc:title id="t2">The Lord of the Rings</dc:title>
          <meta refines="#t2" property="title-type">collection</meta>

          <dc:title id="t3">THE LORD OF THE RINGS, Part One: The Fellowship of the Ring</dc:title>
          <meta refines="#t3" property="title-type">expanded</meta>
          …
      </metadata>

   Those jury-rigged id= and refines= tags are nothing short of
   insane genius: it takes skill to start with XML and build a
   specification in which it is possible to write a property that
   modifies the wrong object, or no object at all.

o Specifying a cover for an music album, to a music player, is
  pretty simple. You put a file called cover.jpg in the directory,
  or perhaps folder.jpg if you belong to that camp.

  For an EPUB document, the fastest way I can figure out is the
  following: Put cover.jpg in the root, add it to the <manifest>
  of content.opf, add something like

      <meta name="cover" content="cover" />

  to the <metadata> of content.opf, making sure that the content
  tag is the same as the id from the manifest. When that doesn't
  work, make a cover.xhtml file with appropriate css to reference
  cover.jpg as an <img> element (make sure to see whether ‘width=100%’
  or ‘height=100%’ is appropriate, since there's no way to easily
  scale-to-fit preserving aspect-ratios in lowest-common-denominator
  HTML+CSS), add THAT to the manifest, then add it to the <guide>
  element of content.opf via something like

      <reference href="cover.xhtml" type="cover" title="Cover" />

  Just to save you some time, you can't put

      <reference href="cover.jpg" type="cover" title="Cover" />

  in the guide directly: each entry must be an “OPS Content
  Document”, which is their name for “An XHTML document that
  conforms to our DTD”.

This format was not meant for humans to work with, it was meant for
companies to charge other companies to churn out.