* * * * *

                      Code and Data, Theory and Practice

> From: Steve Crane <XXXXXXXXXXXXXXXXXXXXX>
> To: Sean Conner <[email protected]>
> Subject: @siwisdom twitter feed
> Date: Sat, 1 Jan 2011 15:46:14 +0200
>
> Hi Sean,
>
> Are you aware that the quotation marks in the @siwisdom
> <http://twitter.com/siwisdom [1]> tweets display as &ldquo; and &rdquo; in
> clients like TweetDeck? Perhaps you should switch to using regular ASCII
> (American Standard Code for Information Interchange) double quotes.
>
> Regards and Happy New Year.
>

Yes, I'm aware. They show up on the main Twitter page [2] as well, and there
isn't much I can do about it, other than sticking exclusively with ASCII and
forgoing the nice typographic characters. It appears to be related to this
rabbit hole [3], only in a way that's completely different.

What's going on here is explained here:

> We have to encode HTML (HyperText Markup Langauge) entities to prevent XSS
> (Cross Site Scripting) attacks. Sorry about the lost characters.
>

“counting messages: characters vs. bytes, HTML entities - Twitter Development
Talk | Google Groups [4]”

And XSS (Cross Site Scripting) [5] has nothing to do with attacking one
website from another, but everything to do with the proliferation of
character encoding schemes and the desire to fling bits of executable code
(aka (also known as) ``Javascript'') along with our bits of non-exectuable
data (aka ``HTML''). The problem is keeping the bits of executable code (aka
``Javascript'') from showing up where it isn't expected.

But in the case of Twitter, I don't think they actually understand how their
own stack works. Or they just took the easy way out and any of the
``special'' characters in HTML, like ``&'', ``<'' and ``>'' are automatically
converted to their HTML entity equivelents ``&amp;'', ``&lt;'' and ``&gt;''.
Otherwise, to sanitize the input, they would need to do the following:

 1. get the raw input from the HTML form
 2. convert the input from the transport encoding (usually URL encoding [6]
    but it could be something else, depending upon the form)
 3. possibly convert the string into a workable character set the program
    understands (say, the browser sent the character data in WINDOWS-1251,
    because Microsoft is like that, to something a bit easier to work with,
    say, UTF-8)
 4. if HTML is allowed, sanitize the HTML by
 4.   1. removing unsupported or dangerous tags, like <SCRIPT>, <EMBED> and
         <OBJECT>
 4.   2. removing dangerous attributes like STYLE or ONMOUSEOVER
 4.   3. check remaining attributes (like HREF) for dangerous content (like
         javascript:alert('1 h@v3 h@cxx0r3d ur c0mput3r!!!!!!!11111'))

 5. escape the data to work with it properly (or else face the wrath of
    Little Bobby Tables' mother [7]).

Fail to do any of those steps, and well … “1 h@v3 h@cxx0r3d ur
c0mput3r!!!!!!!11111” And besides, I'm probably missing some sanitizing step
somewhere.

Now, I could convert the input I give to Twitter to UTF-8 and avoid HTML
entities entirely, but then I would have to convert my blog engine to UTF-8
(because I display my Twitter feed in the sidebar) and while it may work just
fine with UTF-8, I haven't tested it with UTF-8 data. And I would prefer to
keep it in US (United States)-ASCII to avoid any nasty surprises.

Besides, I shouldn't have to do this, because that's why HTML entities were
designed in the first place—as a way of presenting characters when a
character set doesn't support those characters!

Hey—wait a second … what 's this river doing here [8]?

[1] http://twitter.com/siwisdom
[2] http://twitter.com/siwisdom
[3] gopher://gopher.conman.org/0Phlog:2006/08/08.1
[4] http://groups.google.com/group/twitter-development-
[5] http://en.wikipedia.org/wiki/Cross-site_scripting
[6] http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
[7] http://xkcd.com/327/
[8] http://answers.yahoo.com/question/index?qid=20070422170939AAVdNV5

Email author at [email protected]