Gemini maps 3

Gemini maps 3
-------------

Whew! Simplicity ain't simple, folks.

Sean is not happy with the newly proposed [text|link] syntax.
Nevertheless, there are now items using that syntax up at
gemini.conman.org. In fact, both the new syntax and the old tab-based
syntax are used there, which is smart! Clients written for one will
just display the other as raw text without any kind of error, so this
is a wonderful way to maintain compatibility during the transitional
period. I really appreciate Sean being a good sport and including the
new links even though he's not a fan.

The objection to the new syntax is that [, ] and | are ASCII printable
characters which a user might reasonably want to include in the text
portion of a link, or which could evenly conceivably need to appear in
the link part, if they were used in, e.g. a filename. It would be
easy to say "those are dumb characters to use in a filename, don't do
that", but arbitrary restrictions like this are unattractive and it
would be nice to avoid them where possible.

It seems obvious that *any* proposed syntax is going to have some
disadvantage or limitation which means some won't like it. The
sensible thing to do is to consider the severity and frequency of
these problems and choose a syntax whose inevitable shortcomings
which minimise those things as much as possible.

I'm still very convinced that tabs are a bad idea due to the
impossibility of unambiguously parsing an intended link with your
eyeballs. They're also problematic because when using a mouse to
copy and paste text in a terminal environment, the tab/space
distinction can easily get lost. I can very easily see these kinds of
issues causing confusion and frustration and broken links time and
time again. I think that syntax would ultimately cause more pain
more often than a limitation like "thou shalt not use these three
slightly unusual characters in your link text or filenames", so I
still think this latest step has been in the right direction.

And actually, I'm not sure the problem is as bad as "thou shalt
not...".

The new spec says that a link line must begin with a [ and end with a
]. That doesn't conflict at all with arbitrary additional instances
of either character. My two Python clients recognise and parse links
like this:

if line[0] == "[" and line[-1] == "]" and line.count("|") == 1:
text, link = line[1:-1].split("|")

That code will happily handle a text (or link) component of
"[[]][lolz!][[]]", no worries. And not because I was smart and wrote
clever code that could handle it. I wrote the most straightforward
code for this possible and it just happens to be totally robust to [
and ] characters appearing anywhere else. So, I think the problem is
in fact limited entirely to |.

Using a | anywhere in the text or link component will result in my
code above not recognising a line as a link. so we need to think about
that.

The "link" part of [text|link] is supposed to be a URL. Previously I
said that absolute URLs were definitely okay and I would think about
relative ones. Now that I've written a 100 line client which handles
relative URLs, I'm convinced that supporting them is not really very
difficult at all, so let's say they're allowed.

Absolute URLs are defined in RFC1738, and the | character (and,
incidentally, the [ and ] characters) are specified there as being
"unsafe" characters which must be escaped in URLs. Relative URLs have
their own spec in RFC1808, which doesn't explicitly discuss
"unsafe" characters, but based on a cursory skim (and expectation of
reasonable sanity in these RFCs) I don't think those characters
suddenly become safe in a relative context. So, it seems to me that
it is actually totally invalid to use a | in the right-hand part of a
link with the new syntax.

In passing - I'm not thrilled that invoking those URL RFCs means that
a really robust Gemini client/server is now probably going to have to
do all sorts of fiddly escaping/unescaping to cover edge cases. But,
using a non-standard definition of URLs is a sure road to madness, and
there will be existing libraries for this stuff anyway.

Now, strictly speaking, if there are guaranteed to be no |s in the
right-hand part of a link, this totally disambiguates any use of them
in the left-hand part: you just take everything after the *last* | as
the URL and everything before it as the text. The following code:

if line[0] == "[" and line[-1] == "]" and "|" in line:
text, link = line[1:-1].rsplit("|", 1)

should allow arbitrary use of [, ] and | in the text part of a link
without problems, as long as the link part is a valid URL with any
occurrence of | escaped.

But I'm not really thrilled about this, because it means that you need
to be just a little bit careful when parsing links. The above makes
it look easier than I think it will be in the average case - Python's
rsplit method makes this trivial but some languages lack an equivalent
and in general this approach is probably going to add a line of code
or three.

It's still, though, much better than saying "if you want to use a | in
your link text, escape it as \|". I surely don't want to go down that
route.

The alternative is to say "thou shalt not use |s in your link names",
which would allow a simple "ordinary split" on the one and only |
character. This would be kind of a shame, and I'd rather not have
this kind of limitation in place. That said, I don't think this
particular limitation can be considered deal-breakingly bad. The use
of | as a separator in linky contexts is not uncommon, so people using
geomyidae and MediaWiki are living with that restriction, and they
don't precisely seem miserable about it.

I don't know which of these is the lesser of two evils. Opinions very
welcome!

It's true that the original tab-based syntax was easier to parse - and
also much easier to write. It's certainly a selling point that the
tab key is *right there* on the keyboard, and even non-technical folk
who don't know that "pipe" means | know exactly where tab it is. Just
writing that makes me feel bad about the new syntax! But one of the
lessons I think that gopher has for us is that there's such a thing as
what ratfactor called "the wrong kind of simplicity" - simplicity
which results in ambiguity and/or disproportionate extra work. I
think that a link syntax which is very quick and easy to write and
parse but also enables the user to very quickly and easily make
mistakes which are hard to spot and fix might just be the wrong kind
of simplicity.

I regret immediately implementing the new syntax just because I
personally found the argument for it very compelling, without taking
longer to check in with the other people with a stake in this,
assuming that it would be unobjectionable. In the spirit of my
recent "slowing down" post, I'm not going to immediately scramble to
adopt a third syntax. Instead, let's talk about it and try to build
consensus. But I don't want us to get bogged down forever on this
one detail. Sean, if the above discussion has adequately addressed
your concerns, let me know and we'll stick with the new syntax. If
you really still don't like it, we can consider yet a third option
(I'm not willing to go back to the first tab option due to its severe
shortcomings), but I insist that the third one will be the last one,
so that we can move on to more interesting and important questions.

If anybody has any proposals, please write it up and let me know. If
you phlog about it, please share the link with me via email or
Mastodon because this conversation has extended beyond the small
subspace of gopherspace which I routinely keep tabs on and I may not
see it. I will not consider any proposal which has the whitespace
ambiguity problem of our first syntax, or which is even a little bit
more difficult to parse correctly than the second syntax. Please
don't propose a change just because you think different syntax would
"look nicer". A new proposal should be motivated entirely by
practical considerations regarding ease of writing (which I think
probably has to rule out otherwise interesting options like the
"separator" ASCII control characters), parsing (please, nothing which
makes people even *consider* using a regex), lack of ambiguity, etc.

Here's something to ponder: using a printable ASCII character like |
to separate the text from the link actually isn't necessary, as
unencoded whitespace is forbidden in valid URLs. So, once a line has
been identified as being a link, it's possible to just split it on
whitespace and take the last component as the URL. The following is
unambiguous, right?

[Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space]

This would allow | and indeed anything else to appear in the text
part (or even, RFC-breakingly, in the URL), no ambiguity problems.
Given this insight, in fact, the only thing we need on top of
<NAME><WHITESPACE><LINK> is some "garnish" to aid in recognition as a
link. This could be something bracketing that whole construct, as
above, or just a distinctive character combination in front.
Anything below seems like it would work?

#! Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space
@~ Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space
=> Mare Tranquillitatis People's Circumlunar Zaibatsu gemini://zaibatsu.circumlunar.space

This is very easy to recognise. Instead of:

if line[0] == "[" and line[-1] == "]" and line.count("|") == 1:

one would need just, e.g.:

if line.startswith("#! "):

That's actually *much* nicer to look at, IMHO. This syntax is also
very obviously not intended to be used in-line, which is another
advantage over [text|link]. I have to say, to me this kind of syntax
seems to avoid the worst problems of both the earlier proposals. It
also feels "lighter" somehow, I guess because all the actual syntax is
concentrated in one place on the left, instead of being spread all
along the line. I'm not attached to any particular pair of characters
at the start. Can people see real problems with this approach?