(2023-04-11) Sharballs > Tarballs
---------------------------------
Nowadays, when people hear the words "archive file", they usually think of
something like .zip or .7z (or, if they are completely braindead, something
like .rar) files, that contain some directory structure in compressed form.
Most of them associate archiving with compression and don't have a slightest
clue they are completely different processes, and that even the Info-ZIP
format supports the "store" method that doesn't compress any data. Those who
live in a more healthy kind of environment, surely know about tarballs, but
still not every one of them might understand or intuitively get why these
tarballs often have two separate suffixes in their names, like .tar.gz,
tar.bz2, .tar.xz and so on. And only those who have seen and worked with
cpio.gz files definitely know the truth about this, because otherwise they
wouldn't be able to create a single file in this format.

And the truth is that compression algorithms don't work with filesystem
structures like directories and files themselves. They only work with
streams of continuous data. Turning the former into the latter is the sole
task of an archiver. There is a whole lot of software that only archives
files and directories whithout any compression, with tar, ar and cpio being
the most famous and popular examples. Yes, modern GNU tar can automatically
call the compressor (gzip, bzip2, xz) if we tell it to, but it still is a
fully separate stage. We can gunzip a .tar.gz file and still work with the
bare .tar file as if it wasn't created with the gzipping option. This is why
archive formats are NOT the same as compression formats, and are an
interesting topic on their own.

By the way, I'm not really sure why tar took over cpio for general usage. The
cpio format itself is more straightforward and doesn't require 512-byte
block alignment, and now allows (and even recommends) to use plain
ASCII-based file headers, making it a fully plaintext format in case your
files are also plaintext. The only _major_ difference is that cpio command
itself (that, in GNU version, even supports tar/ustar format creation!), in
the archival mode, only accepts the _full_ list of files/directories from
the standard input and outputs the resulting stream to the standard output.
In the extraction mode, it accepts the stream from the standard input. I
like this behavior more than tar's, because it is implemented in the true
Unix way and serves the initial purposes of cpio (passing complex directory
structures over the network or between linear-access storage devices like
tapes) much better. The tar command always could simulate this experience,
but its default mode is accepting the flags and the archive file first, and
files/directories to add (in the archival mode) afterwards. And, unlike
cpio, if you add a single directory to the list, tar will automatically add
all the underlying elements recursively. Maybe not having to use find
command for this purpose made tar more appealing to noobs, as well as not
having to pipe the output to gzip of whatever for further compression, as
it's just a matter of a single letter flag you pass to the tar command.
That's probably why tarballs, whether compressed or not, became a de-facto
standard in the modern Unix-like ecosystem, despite cpio being much more
suitable for backup-restore scenarios.

But what if I told you that there exists an archive format that's even more
noob-friendly in terms of unpacking (kinda like SFX-type archives in
Faildows), has minimum dependencies to create the archives and zero
dependencies to unpack them, and is portable across all POSIX-compliant
systems? Interested? Well, this format is called shar (SHell ARchive) and it
doesn't have a single specification... well, because it's just a generated
shell script that recreates initial files from the data stored in it. So,
there is no separate shar unpacker, the files unpack themselves when passed
to sh. And all major differences between various shar flavors are in the
following things: how they store the data internally, what are the
dependencies required by the archiver and what are the dependencies required
by the self-unpacking script. Historically, shar archiver was a shell script
too but most current shar versions are written in C, and shell scripts
generated by them depend on the echo and mkdir commands and some kind of
monsters like sed and uudecode. I personally don't support this approach, as
echo can have some caveats in different OS implementations, uuencode and
uudecode might not be installed at all and sed is a Turing-complete language
by itself. That's why I naturally decided to create my own version of shar.
As a shell script, of course, but, for the first time in all these years,
not a Bash-specific one.

Writing a shar clone may seem a very straightforward task until you start
thinking about minimizing external dependencies as much as possible. I
decided that the minimum requirement for my shar is that it must work at
least on Busybox and on the bare KaiOS 2.5.x/Android 6 ADB shell with Toybox
or whatever it has there. By "work" i mean both archiving and extraction.
The questions that I had put before myself were:

1) What to use instead of echo?
2) What to use instead of uuencode to pack binary files?
3) How to read binaries in a non-Bash-specific way?
4) How to ensure we don't have duplicate input files and directories?
5) How to ensure we don't have EOF markers in our packed content?

And the answer to the first two questions came almost instantly: printf.
Alas, POSIX printf doesn't have the wonderful %q specifier that would solve
90% of my problems, and I didn't want to make Bash a dependency even for
packing only. As for the EOF markers all current shell-only shar versions
use, we could use them with variable reading and some end-of-line
manipulation, and this is what I tried first. But Android's shell reminded
me that this is the case when dumber is smarter. So instead of using EOF
markers, I ditched this approach altogether and wrote a function to
serialize any file into a series of shell printf calls by a fixed chunk
length (because we don't want to overflow the 130K command buffer, do we?).
And this function also addresses the question number three: use as standard
version of read builtin as possible with an empty IFS value, and read the
file byte by byte. It is slow but reliable. Then, using another printf call,
the ASCII code of the byte is retrieved, and then, depending on its value,
it is output "as is" or as a \x-sequence, hex-encoded. With what? With
printf, of course! Now, since the final printf call in the shar file that
actually unrolls the bytes will be called with %b specifier, we must also
make sure that all single quotes and backslashes are also passed in there
hex-encoded. That's another two conditions added into our loop.

Once a proper serializer is created, that already is 80% of success. Now, as
shar traditionally accepts the exhaustive list of input files from the
command line arguments only, and they can come from various sources, there's
no guarantee that the input won't contain duplicates, and, of course, we
don't want duplicates in our archive. This is where we must make use of an
external command dependency, namely, sort -u. Some might argue that
sort|uniq might be more portable but I've actually never seen any sort
command version - in GNU/Linux, macOS, Busybox or even Toybox - that
wouldn't support the -u flag. Looks like portable enough to me, at least
from the archive creation standpoint. Apart from that, I'd like to make the
shar script create the entire directory structure _before_ writing any files
into it, so a separate loop to do this was implemented.

And that actually is it. The entire archiving script, lshar.sh, that I have
published in the downloads section of my main hoi.st Gophermap, is exactly
60 SLOC of simple and well commented code that is portable across various
shells. And, just like the very first versions of shar, this script also is
released into public domain. I guess this will be my primary tool for
publishing new code and the code migrated from Git repos (something I
already suggested in my previous post, by the way). Obviously, just like
with any other shar, Lshar can be combined with gzip or a similar tool to
achieve compression. Examples:

Archive and compress:
find my-dir | xargs sh lshar.sh | gzip -9 > my-sharball.shar.gz

Decompress and unroll:
gzip -d -c my-sharball.shar.gz | sh

Note that, due to the script nature, unrolling is always fast but archiving
isn't. Since the serializer processes one byte at a time, it's not the
fastest thing in the world (and on KaiOS phones, it's very noticeable), so
I'm probably going to walk the path of the original shar creators and write
a portable ANSI C89 version of the same tool at some point in the future.
For the time being though, it serves its purpose and also is a cool example
of working with individual bytes in shell scripts in a way that isn't
Bash-specific.

Using tarballs to show respect to the Unix way? Switch to sharballs if you
truly love it.

--- Luxferre ---