* * * * *

  A long, slightly rambling, but deeply technical, entry follows, so if you
    aren't interested in software internationalization, character sets and
 variable type systems, you might want to skip this entry entirely; you have
                                 been warned.

> > how could Planet RDF (Resource Description Framework) do things better?
> >
>
> Danny, you can see the problem above.
>
> Is the Planet RDF code available? I would gladly provide a patch.
>
> The problem is not in your input feed. You contain the following code:
> ’. Planet RDF converts that into binary, which I will express in hex:
> xC3A2C280C299. The correct formulation would be xE28099. I've got a good
> idea about what is going on under the covers as
>
> * â => xC3A2
> * € => xC280
> * ™ => xC299
>
> In other words, for some reason, RDF Planet is effectively doing a iso-8859
> 1 to utf-8 conversion, on utf-8 data.
>

“Sam Ruby [1]”

It's a mess.

Part of the problem I'm sure is that not many programmers are used to i18n
(Internationalization). And part of it is not many programmers are aware of
what data is in what format at what part of the processing. And that is
related to variable types in programming.

The first problem is i18n. I won't say it's trivial, but it does take some
forethought to handle it. Thirty years ago it was simple—you use the
character set of whatever country you were in and that was that. Since
computers work with numbers, a “character set” is just a mapping of values
(say, 1) to a visual representation of a character, a “glyph” (say, “A”).
Here in the US (United States), we used ASCII (American Standard Code for
Information Interchange), where in this case, the character “A” is internally
represented with the value 65 (and the lower case “a” by a different value,
97 since technically, it is a different glyph). The character “A” in say,
Freedonia might be stored as the value 193.

No problem. Well, unless your software was sold internationally, then you had
a problem of handing different character sets. And there are a lot of them.

So a solution (and it's a pretty good solution) is to use some internal
representation of every possible character set as the programming is running
(say, Unicode [2], which was defined to handle pretty much any written
language) and do any conversions on input and output to the locally defined
character set (I've found the GNU iconv library [3] to be pretty easy to use
and it can handle a ton of different character sets). Yes, there are some
really obscure corner cases doing this, but for the most part, it'll handle
perhaps 95% of all i18n issues.

But web based documents—or rather, documents based upon HTML (HyperText
Markup Language)/XML (eXtensible Markup Language)—present a wrinkle. HTML/XML
use the “<” character (value of 60) to start the beginning of tag, like <P>
(in this case, this particular tag denotes the start of a paragraph). But
what if the content (that is, non-tag portion) actually requires a literal
“<”? What then? Well, HTML/XM use another character, “&”, to encode
characters that you would otherwise be unable to use. So, to get a “<” you
would write it (or “encode it”) as &lt;, which means that to use a “&” in
your document, you need to write it as &amp;.

But you can use this form, called “entities” (or HTML entities) to include
other characters that might otherwise not be part of your local character
set, such as the much nicer typographical quotes “” (compare to the
computerish ""), or even the star in the middle of Wal★Mart.

Now, most of these entities, like &amp; are named, such as &ldquo; for the
nice looking opening double quotes, or &rarr; for a right arrow. But a lot
don't have names, like the star in the middle of Wal★Mart. To get those, you
need to use a numeric value, like &#9733;.

But, where do you get these values?

Well, they're Unicode values.

So, even if you're writing your page in a Slavic langauge (using for
instance, the character set defined as ISO (International Standards
Orgranization)-8859-5 [4]), the “á” can still be encoded as &#225;, using the
Unicode value for “á” (since ISO-8859-5 doesn't contain that glyph). To
further confound things (and seeing how for this example, we're using ISO-
8859-5) the character “Љ” can be inserted as a character with value 169
(whereas if viewed as an ISO-8859-1 character, it would be “©”) or as &#1033;
(which would work reguardless of the character set you are using for your
document).

Got it?

If not, you're in good company with lots of other programmers out there. And
that's even before we get to the issue of submitting text to a CGI (Common
Gateway Interface) script (I'll get to that later).

And that leads to the second problem, where programmers don't know what's
what where in their program. And that's related to variable types. Or rather,
the lack of variable types. Or rather, the rather lackadaisical approach to
typing most modern computer languages have today (Python? Perl? I'm looking
at you).

In today's modern languages, unlike slightly older languages like C or
Pascal, you don't really need to declare your variables before you use them—
you can just use them. Sure, you can predeclare them (and if you are smart,
you do) but you don't have to, and even when you do, you just declare that
foobar is a variable, not what it can hold (well, granted, in Perl, you have
to tell it if the variable can contain a single value, or a list of values,
or a table of values). As Lisp programmers are wont to say: “a variable
doesn't have a type, values do.”

Now, why is that such a problem?

Well, other than muddle headed thinking [5], there's overhead at runtime in
determining if the operation is permitted on the type of the value held in a
variable and if not, what can we do with the value in the variable to make it
the right type for the operation in question. So, in the following Perl code:

> $result = $a + $b;
>

at run time the computer has to determine the types of the values of the
variables $a and $b—if numeric it can just add them; if strings, it has to
convert the strings to a numeric value and then add them. And what if $a
contains the string “one” and $b contains the string “two”? $result will have
the numeric value of 0 because “one” and “two” do not convert to numeric
values (the ol “Garbage in, garbage out” saying in Computer Science; oh, you
were expecting maybe “onetwo”? That's a different operator altogether).

But even if we just stick with strings (and HTML processing is a lot of
string handling) we're not really out of the woods. Even if we were to use a
computer language with stonger typing (say, Ada) that wouldn't even help us
because the type systems of any language today just aren't up to it.

For instance, given a simple HTML form that accepts input, we can feed it the
following:

> “This—is–a★test”
>

Now, when the data is submitted, the browser will encode the data in yet
another encoding scheme (application/x-www-form-urlencoded [6] to be
precise), which can contain HTML encoded entities (see above) in any number
of character sets (although, unless otherwise stated, in the same character
set the browser thinks the page that has the form uses). Using the latest
version of Firefox (1.5 as of this writing), if instructed to send the data
using charset US-ASCII the data I get is:

> %26%238220%3BThis%26%238212%3Bis%26%238211%3Ba%26%239733%3Btest%26%238221%3B
>

Which, when decoded, is:

> “This—is–a★test”
>

But, if the data is sent as charset ISO-8859-1:

> %93This%97is%96a%26%239733%3Btest%94
>

Which, when decoded, is:

> “This—is–a★test”
>

The quotes and dashes (the “mdash” and “ndash” respectively) are single
characters but the star is still sent as an HTML entity. Yet if the browser
is told to send the data as UTF (Unicode Transformation Format)-8:

> %E2%80%9CThis%E2%80%94is%E2%80%93a%E2%98%85test%E2%80%9D
>

Which, when decoded, is:

> “This—is–a★test”
>

So, what does all this have to do with variable types and the weakness of
typechecking even in a language like Ada?

Well, if we could declare our variables with not only the type (in this case,
“string”) but with additional clarifications on said type (say, “charset ISO-
8859- 5 with HTML entities” or “charset UTF-8 encoded as application/x-www-
form-urlencoded”) it would clarify the data flow through a lot of web based
software and prevent stupid mistakes [7] (and yes, it would piss off a lot of
muddle-headed programmers but this would at least force them to think for
once).

Update a few moments after posting

All this talk about encoding issues, and I still blew it and had to manually
fix some encoding issues on this page.

Sigh.


[1] http://www.intertwingly.net/blog/2004/10/20/Attractive-
[2] http://en.wikipedia.org/wiki/Unicode
[3] http://www.gnu.org/software/libiconv/
[4] http://en.wikipedia.org/wiki/Iso-8859
[5] gopher://gopher.conman.org/0Phlog:2006/01/26.2
[6] http://www.w3.org/TR/html4/interact/forms.html#form-content-
[7] http://intertwingly.net/blog/2006/07/20/Stop-That/

Email author at [email protected]