Subj : Re: not all is lost but far too much for far too long
To : Ozz Nixon
From : Rob Swindell
Date : Wed Jul 03 2019 02:44 pm
Re: Re: not all is lost but far too much for far too long
By: Ozz Nixon to Maurice Kinal on Fri Jun 28 2019 09:23 pm
> On 2019-06-28 02:01:09 +0000, Maurice Kinal -> Torsten Bamberg said:
>
> FTN Header versus actual message body conveying Unicode.
>
> When I telnet to a SQL server that speaks Unicode only, it always
> returns the following characters (pascal): #239#187#191
Using telnet to connect to services that don't speak Telnet is generally a bad idea. Use netcat (nc) instead.
> When I telnet to a web page that speaks Unicode, it too returns
> #239#187#191 plus the <!doctype html> etc.
>
>
> So... would it not stand true that systems that are posting UTF8 do the
> same introduction on the message body? Then authors *know* it
> potentially has Unicode and leave it damn well alone, and also parse it
> based upon UTF8 instead of 8bit char...
It's an idea. But that's not how *other* charsets/encodings work and certainly not how MIME-encoded messages (e.g. email) works - header fields are used instead.
> This is how I am coding things here, just based upon NexusSQL,
> PremierSQL, MS SQL, Apache and Nexus Web Service. I do not have access
> to my Oracle box nor the MySQL 5 server to see if they do the same
> during the initial connection negotiation(s).
>
> A quick google: It's the utf8 byte order mark.
That's actually a misnomer (there is not "byte order" in UTF-8). The actual unicode code point is Zero Width No-Break Space:
https://www.compart.com/en/unicode/U+FEFF
> Some editors save the
> BOM inside the file (in order to be used as a header) which regularly
> causes confusion because it is optional.
>
> So, if we wanted to help enforce at a reader (or even tosser level) how
> to handle, I would offer this up as a required BOM to the message body
> that is UTF8.
And why is that better than a header field ("control paragraph" as defined in FTS-5003) which indicates UTF-8?