Protocol pondering intensifies

Protocol pondering intensifies
------------------------------

This is the first of a three-part epic miniseries of, well, ponderings
of new gopher-like protocols. If you're not into that kind of thing,
feel free to ignore them. If you *are*, grab a coffee or beer or
something and get comfy!

Any sort of document retrieval protocol needs to specify two things:
the format of a client's request, and the format of a server's
response. Focusing only on these two things, Gopher is impossible to
make simpler or more minimal. It literally is the simplest thing that
could possibly work, and this aspect of Gopher can barely even be
considered to have been "designed". All of the actual *decisions* are in
other details like the item type system and the menu format. The
request and response formats are pure Void.

For those unfamiliar with the protocol, when you came here, your
client connected to zaibatsu.circumlunar.space on port 70 and said:

----------
~solderpunk/phlog/protocol-pondering-intensifies.txt<CR><LF>
----------

(<CR> and <LF> are new-line characters, read up on ASCII if this is
new to you) That's it. Just a unique identifier for the document you
want (a "selector" in gopher lingo, it's in every way equivalent to
a "path" in HTTP), plus an unambiguous way of terminating the request,
and that's it. Both of those things are essential, any functional
protocol will have them (and we'll see them in HTTP shortly), and
Gopher has nothing else - real ultimate simplicity!

In response the server said, well, the contents of this file and
that's it. According to RFC1436, the server ought to include a
terminating line of ".<CR><LF>", but also says that "The client should
be prepared for the server closing the connection without sending the
Lastline", and a lot of modern servers seem to leave it out.

That's all there is, folks! It's this brutal minimalism which means
you can use telnet as a gopher client with only mild discomfort.

Let's bump the complexity up a bit! What does an HTTP request look
like? The simplest valid one would be:

----------
GET /~solderpunk/phlog/protocol-pondering-intensifies.txt HTTP/1.0<CR><LF>
<CR><LF>
----------

UPDATE 17/06/2019: Thanks to Mastodon user @[email protected] for
pointing out that I had originally standards non-compliant HTTP/1.1
requests in these posts!

What's the extra baggage here? The "GET" is called a HTTP method, and
tells the server that we want to, well, get the document at the
specified path, as opposed to, e.g. upload something new there or
delete something which is already there. This is actually another
real, core protocol-level difference between HTTP and Gopher - gopher
is a strictly consumption-oriented protocol. It's for reading, not
writing. You may have seen gopher guestbooks around the place, and
surely those involve writing? It's a clever hack using gopher's
search functionality. From the point of view of the protocol, you're
actually "searching" for your guestbook comment, and the server just
does something decidedly non-searchlike with your (256 char or less)
query. There's also Alex Schroeder's Oddmuse wiki server[1] which has
a gopher mode using a non-standard item type to allow writes. But I'm
getting off track, back to the HTTP request! After the "GET", there's
the path, no different to gopher, really. Finally, the "HTTP/1.0" is
a protocol version number, it tells the server we're using HTTP 1.0
and not a later or earlier version. This information is useful if a
protocol changes substantially over its lifetime. If the protocol is
fixed in stone, then it's dead weight. Why the blank second line,
containing only <CR><LF>? The above is the simplest possible HTTP
request, but you can add a lot of extra optional stuff, in the form of
"headers". Because you can add as few or as many headers as you like,
the number of lines in an HTTP request is variable, and so a blank
line is needed to unambiguously end the request. A slightly fancier
request might look like this:

----------
GET /~solderpunk/phlog/protocol-pondering-intensifies.txt HTTP/1.0<CR><LF>
If-Modified-Since: Wed, 12 Jun 2019 01:02:03 GMT
Accept-Language: en-US, de
<CR><LF>
----------

This request has two headers, and it says "Send me this phlog post,
but only if it's changed since the last time I fetched it a few days
ago (if it hasn't changed, I'll use my cached copy), and only if you
have a version in US English or German". This is still a *very*
minimal HTTP request. A modern browser like Firefox or Chrome will
probably jam at least a dozen headers into a typical request.

What's wrong with request headers? Well, there is nothing
fundamentally wrong with them. A lot of the headers in HTTP are
related to caching, and caching is neither dumb nor evil. If all
your content is small and it changes rarely then caching is totally
unnecessary, but for a fully general purpose protocol it makes sense.
I don't think there is any need for caching in gopher, and I don't
advocate adding it to a hypothetical new protocol to sit somewhere
"between gopher and the web". The language preference thing seems
like a nice idea, but in practice I've never seen it actually used.
Every multilingual website I've ever visited makes you play "hunt the
tiny flag icon" to change the language, so in reality its more dead
weight. A lot of HTTP headers fall into these categories: genuinely
useful stuff for a sufficiently aspirational protocol, or
good intentions which are rarely used in practice.

However, request headers are also the mechanism for a lot of the
nastiness in the modern web. The "User-Agent" header is how your
browser tells the server which browser version you're using, which is
None of Their Damn Business and is only something the server actually
needs to know if different clients have substantially different ways
of handling the same response, which is a Really Dumb Idea. The
"Referer" header is how your browser tells the server which *other*
webpage linked you to the one you're actually requesting, which is yet
more None of Their Damn Business (it has an arguably valid application
in preventing "hot linking" of images, but that's not a big concern
for anything vaguely gopher-like). And, of course, the "Cookie"
header is half of how cookies work - cookies come *into* your browser
via a "Set-Cookie" HTTP *response* header (more on those in the next
entry), and then sent back via a "Cookie" header in subsequent
requests to the same server. Even if you wrote a browser which never
sent any of these Three Evil Headers, it turns out that the unique
combination of seemingly-harmless headers you might send, about your
cache status and your encoding preferences and language preferences
and bla-bla-bla can act as a nearly-unique browser "fingerprint" to
facilitate tracking (a problem widely publicised by the EFF with their
"Panopticlick" site[2] back in 2010. Really, only 2010? It feels
older to me...).

So, request headers have a lot to answer for. Do we need to ban them
outright, or can we just put strong restrictions in place, maybe limit
ourselves to one or two or three request headers which are obviously
harmless? Well, if we want to have anything even remotely like a
strong guarantee of anonymity and untrackability, we'd need to insist
on a principle like "almost all requests for the same resource from
different clients should look almost exactly the same". And if we
meet that condition, then I think request headers become dead weight.
If everybody is specifying more or less the same information, then
just let that information become an unspoken assumption of the
protocol and drop the headers. So, in a protocol which is supposed to
be anonymous, I just don't see any place for request headers. In this
respect, gopher gets it exactly right and I see no reason to advocate
anything other than keeping exactly the same request format in the
Future Protocol of Truth and Glory.

Strictly forbidding request headers breaks any possible return path
for information from the server to the client and back, like cookies.
(well, not quite: an unscrupulous server can always inject
pseudo-cookies into paths. I've written about this elsewhere[3]. It
keeps me up at night, but there's no way to guard against it, so
that's that). By breaking this connection, the decision to leave out
request headers renders any and all possible *response* headers
harmless. Response headers are already less scary, simply because we
don't have to worry about clients tracking servers in the same way we
do about servers tracking clients. They're just less risky in
general for that reason. But if there's *any* way for information
that rides in on a response header to make it back to the server,
even if it's seemingly harmless information, that channel can and
eventually will be abused for tracking. In HTTP, this has happened
with "Etag" headers[4]. ETags are a kind of lightweight checksum
intended for cache validation, but they have been used as part of
so-called "super-cookies", where different clients are sent slightly
different ETags for the same resource. Then, even if you delete that
site's cookies, if you don't also clear your browser cache, the site
can recognise you from the Etag and send you the *same* cookie back.
Insidious! So even seemingly harmless response headers can in fact be
Pure Evil if there is a back channel. Breaking that channel lets us
relax when thinking about response headers - which is good, because I
think that they're actually the place where genuinely useful
enhancements can be made, compared to just sending the content and
nothing else, which is how gopher works. More on this in another
entry soon!

[1] https://oddmuse.org/wiki/Gopher_Server
[2] https://panopticlick.eff.org/
[3] gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/on-gopher-conservatism.txt
[4] https://lucb1e.com/rp/cookielesscookies/