A System-Independent Way of Transferring Special Characters
Draft III
Tomas Gradin,
2:200/108@fidonet
Status of this document:
This FSC suggests a proposed protocol for the FidoNet(r) community,
and requests discussion and suggestions for improvements.
Distribution of this document is unlimited.
Fido and FidoNet are registered marks of Tom Jennings and Fido
Software.
Contents
Introduction
How does it work?
Advantages and problems
Technical description
The fallback method of displaying an extra character
How to use I51 in mail
Acknowledgements
Appendix A - The Latin-1 standard
Appendix B - A list of combined characters
Appendix C - Sample code
Appendix D - Comments on the base set
Appendix E - Comments on the escape character
Appendix F - When the change to I51 is taking place
Appendix G - Comments to the author
Introduction
This document proposes a method for transferring characters, including
accented and otherwisely special ones, in ordinary FidoNet messages,
and is the result of some of the thougts put forward in the discussion
of foreign characters at TechCon I, as well as extensive discussions
in the Swedish equivalent of NET_DEV.
The proposed standard will allow for the transmission of all variants
of letters in the latin alphabet, as well as several special
characters commonly used. At the same time the standard makes
inclusion of additional characters painless. The standard implements a
way of automatically displaying these characters as resemblingly as
possible on systems that doesn't yet support them, using the built-in
fallback method described in this document.
One main advantage of this standard is that even though it uses a
well-spread character set as its base, it is not limited to that set.
It is therefore possible to include as many characters as needed. The
only restriction is that the additional characters implemented should
be based on the latin alphabet.
How does it work?
The base character set used in this standard is ISO 8859-1, commonly
known as 'ISO Latin-1'. All characters present in that set are used
as is. The advantages of this character set are well known, and will
not be discussed in this document. However, the most obvious advantage
of Latin-1 is that characters can be easily case shifted.
All accented and special characters not present in the base set are
considered 'extra' characters, and are obtained by using a form of
character combination. To let message editors etc. know when to
combine characters, and when not to, all combination sequences are
preceded by a special 'escape' character. This escape character is
0x02, ie. ^B (STX).
Advantages and problems
A system that strips eight bit characters when displaying them is no
problem, since it doesn't support this proposed standard at this
moment. When eventually doing so (which I hope most systems will), the
hi-bit characters are treated as they should.
A system that treats eight bit characters as other characters will
give the effect that extra characters transmitted with the proposed
method look strange if the system isn't supporting this method.
* The method will never break anything fully FTS-compliant.
* It will give strange characters on systems that don't support this
method, but that is not worse than the current situation.
* It will give systems supporting this method the ability to
transfer national, accented and special characters to systems on
other computer platforms (ie. the characters look the same on a PC
and a Macintosh).
* Systems that support this method, but are implemented on computers
that don't have the ability to display certain characters will
automatically show the most resembling character the computer can
provide, if the character in question is one of the extended ones.
For the 96 hi-bit characters developers hopefully will include the
needed translation tables in their programs. Such tables can be
provided upon request.
* Conferences on FidoNet in English will be minimally affected,
since the English language seldom uses other characters than those
in pure ASCII. The possibility to use other characters will
however be present, if needed. Those that frequently use special
characters will benefit a lot, without causing trouble for those
that don't.
* In fact, the minimum requirement to be I51-compatible is that your
system can handle Latin-1 codes, plus the I51 fallback. When the
base set of I51 (ie. Latin-1) is implemented, you can obtain full
I51 compliance by just adding I51 fallback. After that, you can
choose which ones of the I51 extra characters to implement, if any
at all. The automatic fall-back system takes care of the rest for
you! The additional work to get a Latin-1 compatible system to
fully support I51 is indeed negligable.
Technical description
The format of a representation of an extra character is as follows:
<escape character><modifier><base character>
I will be using 0x02 as escape character in the examples below. It
will however be represented with a '.', since it is non-printable.
Examples:
02 2d 7e (.-~) will display as an about equals sign ('�').
02 50 74 (.$P) is used to represent a peseta symbol ('�').
02 02 represents a single 02, if that code ever is needed in a
message. I propose that the use of 0x02 in messages for other reasons
than in this method of character transmission should be prohibited.
The fallback method of displaying an extra character
If the system where you are implementing this method of special
character transmission doesn't support a certain extra character, the
following procedure should be used. To display a special character as
resemblingly as possible, just skip the modifier! Ie. the sequence 02
67 6a (.ga) is displayed as 'a', 02 5e 73 as 's'. It is therefore
preferred that the FTSC in assigning sequences to any additional
characters take this into account.
How to use I51 in mail
In transit mail in I51 format _must_ be passed on un-altered, per
FTS-0001. However, it is possible to store messages locally in any
desired format. As long as the BBS programs doesn't have options for
users to change their character setup and representation, this may be
desirable.
The I51 method of representing special characters is also allowed in
headers of messages, if account is taken to the fact that the extra
characters occupy more bytes than the 'normal' characters.
Since the character codes 0x80 - 0x9f are undefined in ISO 8859-1,
their presence in an I51 message is prohibited, if not defined
in an FTS document (eg. 'soft CR').
Acknowledgements
I would like to thank those present at TechCon I (in Antwerp, Belgium,
july 1990) during the discussion of foreign characters for the
fundamental ideas that lead to this proposal.
I would also like to thank all those that have made comments on this
document, both in netmail and echomail.
Appendix A - The Latin-1 standard
The following list comprises the hi-bit characters present in the
Latin-1 standard, with is used as the base set of I51.
hex value byte character description charcacter (PC codepage) *
a0 160 � non-breaking space ff (437)
a1 161 � inverted exclamation mark ad � (437)
a2 162 � cent sign bd � (437)
a3 163 � pound sign 9c � (437)
a4 164 � currency sign cf � (850)
a5 165 � yen sign be � (437)
a6 166 � broken bar dd � (850)
a7 167 � paragraph sign f5 � (850) *
a8 168 � diaeresis f9 � (850)
a9 169 � copyright sign b8 � (850)
aa 170 � feminine ordinal indicator a6 � (437)
ab 171 � left angle quotation mark ae � (437)
ac 172 � not sign aa � (437)
ad 173 � soft hyphen f0 � (850)
ae 174 � registered trade mark sign a9 � (850)
af 175 � macron ee � (850)
b0 176 � degree sign f8 � (437)
b1 177 � plus-minus sign f1 � (437)
b2 178 � superscript two fd � (437)
b3 179 � superscript three fc � (850)
b4 180 � acute accent ef � (850)
b5 181 � small greek letter mu e6 � (437)
b6 182 � pilcrow sign f4 � (850) *
b7 183 � middle dot fa � (437)
b8 184 � cedilla f7 � (850)
b9 185 � superscript one fb � (850)
ba 186 � masculine ordinal indicator a7 � (437)
bb 187 � right angle quotation mark af � (437)
bc 188 � vulgar fraction one quarter ac � (437)
bd 189 � vulgar fraction one half ab � (437)
be 190 � vulgar fraction three quarters f3 � (850)
bf 191 � inverted question mark a8 � (437)
c0 192 � A with grave accent b7 � (850)
c1 193 � A with acute accent b5 � (850)
c2 194 � A with circumflex accent b6 � (850)
c3 195 � A with tilde c7 � (850)
c4 196 � capital letter A with diaeresis 8e � (437)
c5 197 � capital letter A with ring above 8f � (437)
c6 198 � ligature AE 92 � (437)
c7 199 � C with cedilla 80 � (437)
c8 200 � E with grave accent d4 � (850)
c9 201 � E with acute accent 90 � (437)
ca 202 � E with circumflex accent d2 � (850)
cb 203 � E with diaeresis d3 � (850)
cc 204 � I with grave accent de � (850)
cd 205 � I with acute accent d6 � (850)
ce 206 � I with circumflex accent d7 � (850)
cf 207 � I with diaeresis d8 � (850)
d0 208 � Icelandic Eth e8 � (850)
d1 209 � N with tilde a5 � (437)
d2 210 � O with grave accent e3 � (850)
d3 211 � O with acute accent e0 � (850)
d4 212 � O with circumflex accent e2 � (850)
d5 213 � O with tilde e5 � (850)
d6 214 � O with diaeresis 99 � (437)
d7 215 � multiplication sign 9e � (850)
d8 216 � slash O 9d � (850)
d9 217 � U with grave accent eb � (850)
da 218 � U with acute accent e9 � (850)
db 219 � U with circumflex accent ea � (850)
dc 220 � U with diaeresis 9a � (437)
dd 221 � Y with acute accent ed � (850)
de 222 � capital Icelandic Thorn d1 � (850)
df 223 � small german letter sharp s e1 � (437)
e0 224 � a with grave accent 85 � (437)
e1 225 � a with acute accent a0 � (437)
e2 226 � a with circumflex accent 83 � (437)
e3 227 � a with tilde c6 � (850)
e4 228 � a with diaeresis 84 � (437)
e5 229 � a with ring above 86 � (437)
e6 230 � small ae-ligature 91 � (437)
e7 231 � c with cedilla 87 � (437)
e8 232 � e with grave accent 8a � (437)
e9 233 � e with acute accent 82 � (437)
ea 234 � e with circumflex accent 88 � (437)
eb 235 � e with diaeresis 89 � (437)
ec 236 � i with grave accent 8d � (437)
ed 237 � i with acute accent a1 � (437)
ee 238 � i with circumflex 8c � (437)
ef 239 � i with diaeresis 8b � (437)
f0 240 � small Icelandic Eth e7 � (850)
f1 241 � n with tilde a4 � (437)
f2 242 � o with grave accent 95 � (437)
f3 243 � o with acute accent a2 � (437)
f4 244 � o with circumflex accent 93 � (437)
f5 245 � o with tilde e4 � (850)
f6 246 � o with diaeresis 94 � (437)
f7 247 � division sign f6 � (437)
f8 248 � small o slash 9b � (850)
f9 249 � u with grave accent 97 � (437)
fa 250 � u with acute accent a3 � (437)
fb 251 � u with circumflex accent 96 � (437)
fc 252 � u with diaeresis 81 � (437)
fd 253 � y with acute accent ec � (850)
fe 254 � small icelandic thorn d0 � (850)
ff 255 y with diaeresis 98 � (437)
* The pilcrow and paragraph signs are also found in CP 437, at 0x14 and
0x15 respectively. All characters with CP listed as 437 have the same
codes in CP 850 - thus, viewing this list with CP set to 850 will give
all the right characters.
Appendix B - A list of combined characters
The following list contains the escaped representations of the
majority of the IBM PCs special and accented characters not present in
the base set, as well as some others. To standardize how a certain
additional character is to be represented the FTSC will publish a list
of such characters, similar to this one. The use of other combination
sequences than the ones approved by the FTSC is discouraged.
hex string bytes character description character (PC codepage)
02 20 30 . 0 superscript zero -
02 20 34 . 4 superscript four -
02 20 35 . 5 superscript five -
02 20 36 . 6 superscript six -
02 20 37 . 7 superscript seven -
02 20 38 . 8 superscript eight -
02 20 39 . 9 superscript nine -
02 2e 30 . 0 subscript zero -
02 20 69 . i dot-less i d5 � (850)
02 20 49 . I I with dot -
02 20 6e . n superscript n fc � (437)
02 22 55 ."U U with double acute accent -
02 22 75 ."u u with double acute accent -
02 2e 31 ..1 subscript one -
02 2e 32 ..2 subscript two -
02 2e 33 ..3 subscript three -
02 2e 34 ..4 subscript four -
02 2e 35 ..5 subscript five -
02 2e 36 ..6 subscript six -
02 2e 37 ..7 subscript seven -
02 2e 38 ..8 subscript eight -
02 2e 39 ..9 subscript nine -
02 24 50 .$P peseta sign 9e � (437)
02 24 66 .$f guilder sign 9f � (437)
02 2c 41 .,A A with cedilla -
02 2c 45 .,E E with cedilla -
02 2c 53 .,S S with cedilla -
02 2c 61 .,a a with cedilla -
02 2c 65 .,e e with cedilla -
02 2c 73 .,s s with cedilla -
02 2d 3c .-< equal or less than f3 � (437)
02 2d 3d .-= defined as f0 � (437)
02 2d 3e .-> equal or greater than f2 � (437)
02 2d 7e .-~ about equal f7 � (437)
02 2d 43 .-C complement of -
02 2d 49 .-I part of lot ee � (437)
02 2d 53 .-S Polish S with dash -
02 2d 5a .-Z Polish Z with dash -
02 2d 73 .-s Polish s with dash -
02 2d 7a .-z Polish z with dash -
02 2e 53 ..S Polish S with dot -
02 2e 5a ..Z Polish Z with dot -
02 2e 73 ..s Polish s with dot -
02 2e 7a ..z Polish z with dot -
02 2f 4c ./L Polish L slash -
02 2f 6c ./l Polish l slash -
02 5e 47 .^G G with inversed circ. accent -
02 5e 53 .^S S with inversed circ. accent -
02 5e 67 .^g g with inversed circ. accent -
02 5e 73 .^s s with inversed circ. accent -
02 67 47 .gG capital gamma e2 � (437)
02 67 61 .ga alpha e0 � (437)
02 74 6d .tm trade mark sign -
<end of list>
The number enclosed in brackets is the IBM PC codepage number. A
hyphen denotes a character that does not exist on the IBM PC.
Appendix C - Sample code
Here is some sample C code. The first function combines sequences into
their proper representation in IBM PC codepage 437, the second does
the reverse, ie. converts characters not found in the I51 base set to
their combination sequences.
void cmbch(char *s)
{
int z, x, sl;
sl = strlen(s);
for (z = 0, x = 0; x <= sl; z++, x++)
if (s[x] == '�')
switch (s[++x]) {
case '-': switch (s[++x]) {
case '<': s[z] = '�'; break;
case '=': s[z] = '�'; break;
case '>': s[z] = '�'; break;
case '~': s[z] = '�'; break;
case 'I': s[z] = '�'; break;
default: s[z] = s[x]; break;
}; break;
case 'g': switch (s[++x]) {
case 'G': s[z] = '�'; break;
case 'a': s[z] = '�'; break;
default: s[z] = s[x]; break;
}; break;
default: s[z] = s[++x];
}
else
s[z] = s[x];
}
The code neccessary to translate between I51 hibit characters and any
ordinary 8 bit character set is trivial and left as an exercise to
the reader..:-)
Appendix D - Comments on the base set
It is of course possible to use any character set as the base set,
even pure 7-bit ASCII. Earlier revisions of this standard were in fact
based on ASCII. But, the usage of ASCII as the base set will require
all non-ascii characters to be encoded. That would cause a lot of
unneccessary trouble for almost all foreign languages, and is not
desirable. No one would want all 'strange' characters of his language
to be encoded, just because 'we can't use 8 bits'. Mail sessions are
conducted in 8 bit, packets contain 8 bit data - so we can.
Then, of course, it is unwise not to use an 8 bit set as the base set,
since it will save a lot of space compared to a 7 bit set, not to
mention a lot of trouble. It is my belief that among 8 bit sets ISO
8859-1 is the most well-spread and common around, and that qualifies
it to be the proposed base set of this standard.
Appendix E - Comments on the escape character
The escape character can in fact be almost any character, if proper
measurements are taken to make the ordinary use for the character
chosen possible at the same time. To avoid too much trouble, it is
wise to select a character seldom found in mail. 0x01 would be a
perfect escape character, were it not for the fact that it is already
used for other purposes. The next character, however, is currently
unused. I therefore felt it wise to use 0x02 as the escape character
in this standard. There are several advantages related to the use of
this character as the escape character. There are of course other
characters (eg. '\' or '~') that could be used, but there are reasons
not to use them. '\', for instance, is commonly used in Europe to
represent a national character, and is therefore not well suited. The
'~' on the other hand is not often used, but can't be used as an
escape character due to the fact that it itself is an accent (see
below).
Appendix F - During the change to I51, co-existence with other methods
Any message in which the I51 standard is used (whether with extra
codes present or not) will, during a limited period of time, have the
following kludge line in it:
^AI51<cr>
With this kludge line present, a message editor at once will know that
a certain message should be 'de-I51-ified'. How to interpret messages
lacking this line is upon you decide. However, should you find a 0x02
in a message lacking the kludge line, the message is to be considered
an I51 message.
When a non-I51 message is quoted, its contents should be translated to
the corresponding I51 codes, if possible. Characters not found in the
I51 standard (as defined in this document) are to be ignored, unless a
similar I51 representation can be found.
Appendix G - Comments to the author
Please feel free to contact me on 2:200/108 if you have any questions,
comments or suggestions regarding this document, or anything
associated with it. I appreciate any suggestions on additional
'extra' characters to be added to this standard.