Building character

* * * * *

Building character

I've been working on mod_blog [1] for the past few days. Bob, who runs the
Friday D&D (Dungeons and Dragons) game, has a site that used to use Blogger
[2] but due to reliability problems (as well as certain security issues
related to FTP (File Transfer Protocol)) I switched them over to mod_blog.

It's installed, but the input portion is … shall we say … less than user
friendly. Frankly, it suits my needs, and it suited the needs for Mark (back
when he had a blog) and a variation of it suits Spring [3], but that's
because the three of us know HTML (HyperText Markup Language) and aren't
intimidated by it. But most of the other players don't know, or don't care,
to type their entries with HTML so I've been adding code to allow them to
input plain text and have it converted to HTML.

It seems like it would be straightforward, but it isn't. Like I said, I'm
confortable with HTML and the input side of things reflected that comfort.
The first problem was adding a text-based (for lack of a better term) plug-
in. The second (and more annoying) problem is the utter horror that are
Buffers.

In my CGI (Common Gateway Interface) library (which I've been developing for
easily seven years) I have a concept of a “buffer,” which is 1) horribly
misnamed—it's not so much a “buffer” as it is a “stream”—either a stream of
input or a stream of output and 2) is so buggy as to be nearly useless. The
problem stems from my attempts to support both “read” and “write” methods on
a given buffer (did I mention it's really a stream and not a buffer?) and
that's why they're so buggy as to be nearly useless. There's only this vauge
notion of reading and writing and detecting the end of the stream. I've been
having to go in kludging up fixes for the various buffer modules I have,
allowing me to write to a buffer, then backup and start reading from said
buffer. And 3) what the XXXX was I thinking when I created LineBuffers? They
shouldn't exist, period. I think.

Anyway, it's not pretty, and the fixes are rather ad-hoc and I have to “know”
which type of buffer I'm dealing with and what I can and can't do with it.
Which defeats the purpose of abstracting things behind a “buffer” in the
first place.

And then there's the third problem. This is a doozy and it affects nearly
every other blogging software out there as well. It's the “copy-n-paste [4]”
problem (the linked article explains why a certain page appears corrupted,
but it ultimately stemmed from a copy-n-paste operation). What exactly is the
“copy-n-paste” problem?

It stems from character encodings [5] and the lack of character encoding
information when you copy-n-paste text between applications that have
different ideas of what character encoding it is expecting. For instance,
viewing a Microsoft Word document using the character set WINDOWS-1252, and
copy that into a website that might be expected ISO-8859-1, UTF-8 or even US-
ASCII (like who ever would use US-ASCII? Sheesh!). You know what you can
expect?

Γªρßåγ€

It really bugs me when I see stuff like “sensorâ€and” on a page.

I could try to use the ACCEPT-CHARSET [6] attribute of the <FORM> [7] tag in
HTML 4.01 (HyperText Markup Language v4.01) [8], but really, that's just a
“hint” to the browser on what to send—it doesn't actually have to pay
attention to that at all. To get around that little problem I'm playing
around with GNU's (GNU's Not Unix) [9] libiconv [10] (a large library to
convert from one character encoding scheme to another) in an attempt to
prevent this problem. You input the text, then I scan it, attempting to
classify what character encoding scheme is in use (and right now it only
detects US-ASCII, ISO-8859-1, WINDOWS-1252 and UTF-8) then converts whatever
it finds into Unicode [11] (specifically UCS-4) then back to US-ASCII, using
HTML numeric entities for anything outside of US-ASCII.

All that, just to support easy text entry into a blog, without having it look
like crap in case someone decides to copy-n-paste from some other
application.

And that's why adding a text plug-in isn't that straight forward.

Update early Sunday morning, December 5^th, 2004

Some more details [12].

[1] https://boston.conman.org/about/technical.html
[2] http://www.blogger.com/
[3] http://www.springdew.com/
[4] http://www.intertwingly.net/blog/2004/09/23/Copy-and-Paste
[5] http://gedcom-parse.sourceforge.net/doc/encoding.html#The_character_encoding_problem
[6] http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset
[7] http://www.w3.org/TR/html4/interact/forms.html#edef-FORM
[8] http://www.w3.org/TR/html4/
[9] http://www.gnu.org/
[10] http://www.gnu.org/software/libiconv/
[11] http://www.unicode.com/
[12] gopher://gopher.conman.org/0Phlog:2004/12/05.1

Email author at [email protected]