Protocol pondering intensifies, Pt II

Protocol pondering intensifies, Pt II
-------------------------------------

In the previous post in this series[1] I compared request formats for
gopher and HTTP and thought a bit about what a good anonymous document
system actually needs. I ended up deciding that the answer was
nothing more than gopher already provides. In this post I'll continue
that discussion, focussing instead on the response format.

Recall that a gopher server's response to a request consists of
nothing more than the content. What does HTTP look like? Here's a
quite light real world example, obtained by requesting the /index.html
path from grex.org:

---------
HTTP/1.0 200 OK
Server: nginx
Date: Fri, 14 Jun 2019 19:16:18 GMT
Content-Type: text/html
Content-Length: 45
Last-Modified: Sat, 21 Apr 2018 12:23:32 GMT
Connection: close
ETag: "5adb2d44-2d"
Accept-Ranges: bytes

<html><body><h1>It works!</h1></body></html>
---------

The very first part, HTTP/1.0, is of course the protocol version.
Notice that this was the last component of the request, but it's the
first component of the response. What's all that about? Anyhow, in
general I think it makes a heck of a lot of a sense for a response to
a request to use the same protocol version as the request, which the
client of course is already aware of, so this is dead weight. The
next part, "200", is a status code, indicating whether or not the
request was successful or triggered an error. It's followed by a
human-friendly version of the machine-friendly status code, in this
case simply "OK". There are lots of lots of status codes in HTTP[2]!

Then we have a bunch of headers, which look just like the request
headers from last post. There's a lot of dead weight in here, for
simple purposes. The "Date:" is in there for cache-related reasons. In
a protocol without caching, this is useless. Specifying the "Server:"
software and version serves no useful purposes, and many webserver
admins actually disable this feature to avoid giving away hints about
which vulnerabilities might be applicable to their server. The
"Content-Length:" is useful for when a single TCP/IP connection is used
for multiple request/response pairs. There's some overhead involved
in setting up and tearing down these connections, and as webpages
started to trigger more and more requests - to fetch stylesheets, and
scripts, and images - this overhead added up to a non-trivial part of
total time for a website to render. Re-using connections is one
solution to this, and it means that the client needs to know when the
server is done responding, and it can do this by counting bytes until
the entire "Content-Length" has been received. A *better* solution to
this problem is to Stop Making So Many Damn Requests, which means the
server can signal the end of the content by just closing the
connection, rendering this header useless.

Is there anything of value in here? I think so! The status code is
interesting. Did you know that gopher has no real way to signal an
error? You might be thinking "Hey, what about item type 3?", but the
thing about item type 3 is, well, it's an item type. When do we see
item types? In gopher menus, and in gopher menus only. If a gopher
client sends a request for what it thinks should be a text file, but
it's followed a misspelled selector and the file doesn't exist, the
client isn't going to try to parse the response as a menu, so it's not
going to have any way to recognise the error item type. Indeed, if
you request a non-existent selector from a gopher server, it'll say
something to you like "Error: File or directory not found!" (this is
what Gophernicus will say), but it's only you, as a human, who
hopefully reads English, who can recognise this as an error. A simple
script has *no* way to distinguish this situation from a totally
successful transaction. Because of this, it's e.g. impossible to
write a script to crawl a gopherhole looking for broken links. Well,
maybe not impossible, but certainly non-trivial: you could figure out
the particular server's idiosyncratic choice of error message by
requesting a couple of randomly-generated long selectors which are
highly unlikely to be in actual use, and use the most common response
as the "404 equivalent string". Needless to say, this is not exactly
simple. It's perhaps not the biggest problem in the world, but
it's certainly a shortcoming of gopher which could very easily be
avoided.

But much more interesting and important is the "Content-Type" header.
Gopher, frankly, sucks at signalling content type. If you've arrived
at a document via a gopher menu, then you know its item type. What if
you want to request a document directly, not by following an item in a
menu? Maybe your friend has told you the selector in an email or via
XMPP. Maybe you bookmarked it last month. You can request that
document by just sending the path and a <CR><LF>, but how do you know
what kind of content you're getting back? If you don't somehow know
it in advance, you need to figure it out for yourself by looking at it
hard ("you" here are a gopher client, not an end user). This is the
reason that gopher has its very own unique URL scheme with its own RFC
(RFC4266), where the itemtype is introduced as an extra component of
the path. You need to write
gopher://zaibatsu.circumlunar.space/1/~solderpunk instead of just
gopher://zaibatsu.circumlunar.space/~solderpunk because with the later
option your client would have no idea whether or not it should try to
parse what comes back as a menu, display it as text or save it as a
binary file. This problem is also the reason that if you write a
gopher client with bookmark support, you need to store the item type
along with host, port and selector. Neither of these things are
terribly hard, but they are examples of small, inelegant extra hoops
which have to be jumped through because gopher, in this respect, is
*too* simple. It's too simple to straightforwardly handle a
perfectly reasonable situation like "I'd like to fetch this document
from this server but I've never seen it appear in a menu because my
friend just emailed me the link". To me it makes a *lot* of sense
that the *only* piece of information you should need to request and
then make use of a resource is that resource's path. That seems,
well, simple.

This problem in gopher is more widespread than just not knowing what
item type a document is. Even if you *know* that a path points to an
item type 0 text file, you can have problems. One of the earliest bug
reports I got after releasing VF-1 turned out to be the result of
floodgap.com using iso-8859-1 text encoding to support accented
characters in some of their content. VF-1 had just assumed that
everything on gopher was ASCII, which turned out to be very wrong.
There are a lot of encodings out in the wild on gopher. Standard
gopher has no way of telling you what they are. The only way to write
a client which can Just Go Anyway is to user some kind of third party
party library to try to "sniff" the encoding (VF-1 uses Chardet[3] for
this). That's a hard problem, which is never guaranteed to be
solvable, and is only possible using a big slab of natural language
corpus statistics. This requirement massively flies in the fact of
the RFC1436-enshrined philosophy that "intelligence is held by the
server". When all a protocol does is shovel a bunch of bytes down
your throat and say "you figure out what this is and what to do with
it!", you need a *very* intelligent client for it to really work out
in all conditions. I don't think it makes much sense to have every
client repeat exactly the same set of expensive computations
after requesting a document in order to figure out information that
the server *already knows*, but didn't share.

There's a saner alternative to this, and it's for the server to tell
the client, succinctly, what it's actually getting. This can be
implemented with a very small increase in protocol complexity, which
can result in a very large decrease in client complexity.

Consider the following as a response format, in a hypothetical
protocol which retains gopher's bare bones request format:

----------
<STATUS><TAB><MIMETYPE><TAB><ENCODING><CR><LF>
<Actual content>
----------

A concrete example:

----------
200 text/plain utf-8
Hello, world!
----------

The text encoding could be optional for non-text MIME types. We could
get away from having to specify an encoding at all if this protocol
specified "Thou shalt use UTF-8 and no other encoding shalt thou use",
saving us ~5 bytes, but I dunno if that's too authoritarian. Yes, you
can represent any language you like in UTF-8, but some languages can
be represented more compactly in other encodings, and it seems like a
good thing to provide the ability to minimise the number of bytes sent
over the network. Isn't that also part of the spirit of a minimalist
protocol? A compromise: if you use UTF-8, it's valid to leave off the
third component of the response header. UTF-8 is the implicit
default, but other encodings are possible for a tiny extra cost.

For the sake of fully specifying a system, including a navigation
solution, without any further discussion or design, let's keep
gopher's menu system as is, and introduce a new pseudo-MIME type for
it, like text/menu or something. I'm not saying this is a great idea,
it just provides a complete concrete example to talk about for the
rest of this post.

If we give gopher a complexity score of 1 and full-blown HTTP a
complexity score of 100, I don't see how this new protocol can be
reasonably scored higher than 10. It's still absolutely trivial to
write a client for this protocol, a nice little weekend project. You
can memorise the protocol easily so you don't need to look up a
complicated RFC to remind yourself of some detail while coding. You
can still cobble together a client out of standard unix utilities:
the response header is guaranteed to be one line long, so you can
just pipe what you get from the network through `tail -n +2` to cut it
off. I'm not sure if that would work for binary files, admittedly,
but for something vaguely gopher-like that's an edge case anyway. You
could even still use telnet as a client for this protocol if you
wanted to. Yes, you would see one short line of noise at the top of
each file, but that's a heck of a lot better than seeing a full set of
HTTP headers and I guarantee you'd get used to it and stop even
consciously seeing them after a day of practice. None of the extra
information in this header represents any threat to a user's privacy.
The network overhead is around 20 bytes per request, which is less
than 1% of the size of a typical phlog post.

Compared to gopher, this protocol can:

* Use standard URLs without embedded item types, without any
ambiguity.
* Serve plain text in any encoding under the sun, without ambiguity
that would otherwise force the client to waste computational effort
trying to identify the encoding.
* Serve any kind of non-text content under the sun, without ambiguity
that would otherwise force the client to waste computational effort
trying to identify the binary file format, and without being forced
to categorise the content as one of a small number of pre-defined
item types which are either hopeless vague or, in 2019, just kind of
whacky (e.g. gopher item type 5, "PC-DOS binary file of some sort").
* Precisely indicate error conditions in a machine-readable way. In
the example above I just copied HTTP's "200" status for
"everything's fine", but in reality HTTP's three digit status codes
are surely overkill for anything vaguely gopher-like. Status codes
could probably be a single character. I haven't thought too much
about applications of these. We *could* go nuts, implementing
redirects and all sorts, but I'm not really keen. From time to time
there are complaints on the gopher mailing list about badly behaved
crawlers making too many requests per second and overloading
servers, so a "too many requests, try again later" error code would
seem a practical thing. I'm not imagining any situation where
99.9% of requests result in more than 3 or 4 statuses. It should be
possible to learn all the status codes by heart easily.

This protocol is not as simple as gopher, but I would argue its power
to weight ratio is substantially greater. It's still very simple, and
its still totally harmless. Crucially, it's non-extensible: the
response header is not open ended, like HTTP's is, so people can't
just add in whatever extra junk they like. I don't want to say that
extensibility is a bad thing, it's often a very smart engineering
solution to some particular problem, but I think I do want to say that
extensibility is the enemy of intentionally brutal simplicity.
Optional extra cruft will inevitably accumulate and then become a de
facto requirement.

In the third and final post in this series, I'll address
possible solutions to the problem of navigation.

[1] gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/protocol-pondering-intensifies.txt
[2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
[3] https://chardet.github.io/