Document: FSC-0051
Version:  003
Date:     25-Feb-91


                                   I51

        A System-Independent Way of Transferring Special Characters

                                Draft III

                              Tomas Gradin,
                            2:200/108@fidonet



Status of this document:

    This FSC suggests  a proposed protocol  for the FidoNet(r)  community,
    and   requests   discussion   and   suggestions   for    improvements.
    Distribution of this document is unlimited.

    Fido  and  FidoNet  are  registered  marks  of  Tom  Jennings and Fido
    Software.


Contents

    Introduction
    How does it work?
    Advantages and problems
    Technical description
    The fallback method of displaying an extra character
    How to use I51 in mail
    Acknowledgements
    Appendix A - The Latin-1 standard
    Appendix B - A list of combined characters
    Appendix C - Sample code
    Appendix D - Comments on the base set
    Appendix E - Comments on the escape character
    Appendix F - When the change to I51 is taking place
    Appendix G - Comments to the author


Introduction

    This document proposes a method for transferring characters, including
    accented and otherwisely special  ones, in ordinary FidoNet  messages,
    and is the result of some of the thougts put forward in the discussion
    of foreign characters at TechCon  I, as well as extensive  discussions
    in the Swedish equivalent of NET_DEV.

    The proposed standard will allow for the transmission of all  variants
    of  letters  in  the  latin  alphabet,  as  well  as  several  special
    characters  commonly  used.  At  the  same  time  the  standard  makes
    inclusion of additional characters painless. The standard implements a
    way of  automatically displaying  these characters  as resemblingly as
    possible on systems that doesn't yet support them, using the  built-in
    fallback method described in this document.

    One main  advantage of  this standard  is that  even though  it uses a
    well-spread character set as its base, it is not limited to that  set.
    It is therefore possible to include as many characters as needed.  The
    only restriction is that the additional characters implemented  should
    be based on the latin alphabet.


How does it work?

    The base character set used  in this standard is ISO  8859-1, commonly
    known as 'ISO Latin-1'.  All  characters present in that set are  used
    as is. The advantages of this  character set are well known, and  will
    not be discussed in this document. However, the most obvious advantage
    of Latin-1 is that characters can be easily case shifted.

    All accented and  special characters not  present in the  base set are
    considered 'extra'  characters, and  are obtained  by using  a form of
    character combination.   To let  message editors  etc.   know when  to
    combine characters,  and when  not to,  all combination  sequences are
    preceded by a  special 'escape' character.   This escape character  is
    0x02, ie. ^B (STX).


Advantages and problems

    A system that strips eight  bit characters when displaying them  is no
    problem,  since  it  doesn't  support  this  proposed standard at this
    moment. When eventually doing so (which I hope most systems will), the
    hi-bit characters are treated as they should.

    A system  that treats  eight bit  characters as  other characters will
    give the effect  that extra characters  transmitted with the  proposed
    method look strange if the system isn't supporting this method.

      * The method will never break anything fully FTS-compliant.

      * It will give strange characters on systems that don't support this
        method, but that is not worse than the current situation.

      * It  will  give  systems  supporting  this  method  the  ability to
        transfer national, accented and  special characters to systems  on
        other computer platforms (ie. the characters look the same on a PC
        and a Macintosh).

      * Systems that support this method, but are implemented on computers
        that don't  have the  ability to  display certain  characters will
        automatically show the most resembling character the computer  can
        provide, if the character in question is one of the extended ones.
        For the 96 hi-bit characters developers hopefully will include the
        needed translation tables  in their programs.  Such tables can  be
        provided upon request.

      * Conferences  on  FidoNet  in  English will be  minimally affected,
        since the English language seldom uses other characters than those
        in  pure  ASCII.  The  possibility  to  use  other characters will
        however be present, if  needed. Those that frequently  use special
        characters will benefit a  lot, without causing trouble  for those
        that don't.

      * In fact, the minimum requirement to be I51-compatible is that your
        system can handle Latin-1 codes, plus the I51 fallback.  When  the
        base set of I51 (ie. Latin-1) is implemented, you can obtain  full
        I51 compliance by  just adding I51  fallback. After that,  you can
        choose which ones of the I51 extra characters to implement, if any
        at all. The automatic fall-back system takes care of the rest  for
        you! The  additional work  to get  a  Latin-1 compatible system to
        fully support I51 is indeed negligable.


Technical description

    The format of a representation of an extra character is as follows:

<escape character><modifier><base character>

    I will be  using 0x02 as  escape character in  the examples below.  It
    will however be represented with a '.', since it is non-printable.

 Examples:

    02 2d 7e (.-~) will display as an about equals sign ('�').

    02 50 74 (.$P) is used to represent a peseta symbol ('�').

    02  02  represents  a  single  02,  if  that  code ever is needed in a
    message. I propose that the use of 0x02 in messages for other  reasons
    than in this method of character transmission should be prohibited.


The fallback method of displaying an extra character

    If  the  system  where  you  are  implementing  this method of special
    character transmission doesn't support a certain extra character,  the
    following procedure should be used. To display a special character  as
    resemblingly as possible, just skip the modifier! Ie. the sequence  02
    67 6a  (.ga) is  displayed as  'a', 02  5e 73  as 's'. It is therefore
    preferred  that  the  FTSC  in  assigning  sequences to any additional
    characters take this into account.


How to use I51 in mail

    In transit  mail in  I51 format  _must_ be  passed on  un-altered, per
    FTS-0001. However,  it is  possible to  store messages  locally in any
    desired format. As long as  the BBS programs doesn't have  options for
    users to change their character setup and representation, this may  be
    desirable.

    The I51 method of representing  special characters is also allowed  in
    headers of messages,  if account is  taken to the  fact that the extra
    characters occupy more bytes  than the 'normal' characters.

    Since the  character codes  0x80 -  0x9f are  undefined in ISO 8859-1,
    their  presence  in  an  I51  message  is  prohibited,  if not defined
    in an FTS document (eg. 'soft CR').


Acknowledgements

    I would like to thank those present at TechCon I (in Antwerp, Belgium,
    july  1990)  during  the  discussion  of  foreign  characters  for the
    fundamental ideas that lead to this proposal.

    I would also like to thank  all those that have made comments  on this
    document, both in netmail and echomail.


Appendix A - The Latin-1 standard

    The  following  list  comprises  the  hi-bit characters present in the
    Latin-1 standard, with  is used as  the base set  of I51.

 hex value  byte  character description            charcacter (PC codepage) *

  a0 160     �    non-breaking space               ff   (437)
  a1 161     �    inverted exclamation mark        ad � (437)
  a2 162     �    cent sign                        bd � (437)
  a3 163     �    pound sign                       9c � (437)
  a4 164     �    currency sign                    cf � (850)
  a5 165     �    yen sign                         be � (437)
  a6 166     �    broken bar                       dd � (850)
  a7 167     �    paragraph sign                   f5 � (850) *
  a8 168     �    diaeresis                        f9 � (850)
  a9 169     �    copyright sign                   b8 � (850)
  aa 170     �    feminine ordinal indicator       a6 � (437)
  ab 171     �    left angle quotation mark        ae � (437)
  ac 172     �    not sign                         aa � (437)
  ad 173     �    soft hyphen                      f0 � (850)
  ae 174     �    registered trade mark sign       a9 � (850)
  af 175     �    macron                           ee � (850)
  b0 176     �    degree sign                      f8 � (437)
  b1 177     �    plus-minus sign                  f1 � (437)
  b2 178     �    superscript two                  fd � (437)
  b3 179     �    superscript three                fc � (850)
  b4 180     �    acute accent                     ef � (850)
  b5 181     �    small greek letter mu            e6 � (437)
  b6 182     �    pilcrow sign                     f4 � (850) *
  b7 183     �    middle dot                       fa � (437)
  b8 184     �    cedilla                          f7 � (850)
  b9 185     �    superscript one                  fb � (850)
  ba 186     �    masculine ordinal indicator      a7 � (437)
  bb 187     �    right angle quotation mark       af � (437)
  bc 188     �    vulgar fraction one quarter      ac � (437)
  bd 189     �    vulgar fraction one half         ab � (437)
  be 190     �    vulgar fraction three quarters   f3 � (850)
  bf 191     �    inverted question mark           a8 � (437)
  c0 192     �    A with grave accent              b7 � (850)
  c1 193     �    A with acute accent              b5 � (850)
  c2 194     �    A with circumflex accent         b6 � (850)
  c3 195     �    A with tilde                     c7 � (850)
  c4 196     �    capital letter A with diaeresis  8e � (437)
  c5 197     �    capital letter A with ring above 8f � (437)
  c6 198     �    ligature AE                      92 � (437)
  c7 199     �    C with cedilla                   80 � (437)
  c8 200     �    E with grave accent              d4 � (850)
  c9 201     �    E with acute accent              90 � (437)
  ca 202     �    E with circumflex accent         d2 � (850)
  cb 203     �    E with diaeresis                 d3 � (850)
  cc 204     �    I with grave accent              de � (850)
  cd 205     �    I with acute accent              d6 � (850)
  ce 206     �    I with circumflex accent         d7 � (850)
  cf 207     �    I with diaeresis                 d8 � (850)
  d0 208     �    Icelandic Eth                    e8 � (850)
  d1 209     �    N with tilde                     a5 � (437)
  d2 210     �    O with grave accent              e3 � (850)
  d3 211     �    O with acute accent              e0 � (850)
  d4 212     �    O with circumflex accent         e2 � (850)
  d5 213     �    O with tilde                     e5 � (850)
  d6 214     �    O with diaeresis                 99 � (437)
  d7 215     �    multiplication sign              9e � (850)
  d8 216     �    slash O                          9d � (850)
  d9 217     �    U with grave accent              eb � (850)
  da 218     �    U with acute accent              e9 � (850)
  db 219     �    U with circumflex accent         ea � (850)
  dc 220     �    U with diaeresis                 9a � (437)
  dd 221     �    Y with acute accent              ed � (850)
  de 222     �    capital Icelandic Thorn          d1 � (850)
  df 223     �    small german letter sharp s      e1 � (437)
  e0 224     �    a with grave accent              85 � (437)
  e1 225     �    a with acute accent              a0 � (437)
  e2 226     �    a with circumflex accent         83 � (437)
  e3 227     �    a with tilde                     c6 � (850)
  e4 228     �    a with diaeresis                 84 � (437)
  e5 229     �    a with ring above                86 � (437)
  e6 230     �    small ae-ligature                91 � (437)
  e7 231     �    c with cedilla                   87 � (437)
  e8 232     �    e with grave accent              8a � (437)
  e9 233     �    e with acute accent              82 � (437)
  ea 234     �    e with circumflex accent         88 � (437)
  eb 235     �    e with diaeresis                 89 � (437)
  ec 236     �    i with grave accent              8d � (437)
  ed 237     �    i with acute accent              a1 � (437)
  ee 238     �    i with circumflex                8c � (437)
  ef 239     �    i with diaeresis                 8b � (437)
  f0 240     �    small Icelandic Eth              e7 � (850)
  f1 241     �    n with tilde                     a4 � (437)
  f2 242     �    o with grave accent              95 � (437)
  f3 243     �    o with acute accent              a2 � (437)
  f4 244     �    o with circumflex accent         93 � (437)
  f5 245     �    o with tilde                     e4 � (850)
  f6 246     �    o with diaeresis                 94 � (437)
  f7 247     �    division sign                    f6 � (437)
  f8 248     �    small o slash                    9b � (850)
  f9 249     �    u with grave accent              97 � (437)
  fa 250     �    u with acute accent              a3 � (437)
  fb 251     �    u with circumflex accent         96 � (437)
  fc 252     �    u with diaeresis                 81 � (437)
  fd 253     �    y with acute accent              ec � (850)
  fe 254     �    small icelandic thorn            d0 � (850)
  ff 255          y with diaeresis                 98 � (437)

* The pilcrow  and paragraph signs  are also found  in CP 437,  at 0x14 and
 0x15 respectively.  All  characters with CP listed  as 437 have the  same
 codes in CP 850 -  thus, viewing this list with  CP set to 850 will  give
 all the right characters.


Appendix B - A list of combined characters

    The  following  list  contains  the  escaped  representations  of  the
    majority of the IBM PCs special and accented characters not present in
    the base set,  as well as  some others. To  standardize how a  certain
    additional character is to be represented the FTSC will publish a list
    of such characters, similar to this one. The use of  other combination
    sequences than the ones approved by the FTSC is discouraged.

 hex string   bytes   character description          character (PC codepage)

 02 20 30     . 0     superscript zero               -
 02 20 34     . 4     superscript four               -
 02 20 35     . 5     superscript five               -
 02 20 36     . 6     superscript six                -
 02 20 37     . 7     superscript seven              -
 02 20 38     . 8     superscript eight              -
 02 20 39     . 9     superscript nine               -
 02 2e 30     . 0     subscript zero                 -
 02 20 69     . i     dot-less i                     d5 � (850)
 02 20 49     . I     I with dot                     -
 02 20 6e     . n     superscript n                  fc � (437)
 02 22 55     ."U     U with double acute accent     -
 02 22 75     ."u     u with double acute accent     -
 02 2e 31     ..1     subscript one                  -
 02 2e 32     ..2     subscript two                  -
 02 2e 33     ..3     subscript three                -
 02 2e 34     ..4     subscript four                 -
 02 2e 35     ..5     subscript five                 -
 02 2e 36     ..6     subscript six                  -
 02 2e 37     ..7     subscript seven                -
 02 2e 38     ..8     subscript eight                -
 02 2e 39     ..9     subscript nine                 -
 02 24 50     .$P     peseta sign                    9e � (437)
 02 24 66     .$f     guilder sign                   9f � (437)
 02 2c 41     .,A     A with cedilla                 -
 02 2c 45     .,E     E with cedilla                 -
 02 2c 53     .,S     S with cedilla                 -
 02 2c 61     .,a     a with cedilla                 -
 02 2c 65     .,e     e with cedilla                 -
 02 2c 73     .,s     s with cedilla                 -
 02 2d 3c     .-<     equal or less than             f3 � (437)
 02 2d 3d     .-=     defined as                     f0 � (437)
 02 2d 3e     .->     equal or greater than          f2 � (437)
 02 2d 7e     .-~     about equal                    f7 � (437)
 02 2d 43     .-C     complement of                  -
 02 2d 49     .-I     part of lot                    ee � (437)
 02 2d 53     .-S     Polish S with dash             -
 02 2d 5a     .-Z     Polish Z with dash             -
 02 2d 73     .-s     Polish s with dash             -
 02 2d 7a     .-z     Polish z with dash             -
 02 2e 53     ..S     Polish S with dot              -
 02 2e 5a     ..Z     Polish Z with dot              -
 02 2e 73     ..s     Polish s with dot              -
 02 2e 7a     ..z     Polish z with dot              -
 02 2f 4c     ./L     Polish L slash                 -
 02 2f 6c     ./l     Polish l slash                 -
 02 5e 47     .^G     G with inversed circ. accent   -
 02 5e 53     .^S     S with inversed circ. accent   -
 02 5e 67     .^g     g with inversed circ. accent   -
 02 5e 73     .^s     s with inversed circ. accent   -
 02 67 47     .gG     capital gamma                  e2 � (437)
 02 67 61     .ga     alpha                          e0 � (437)
 02 74 6d     .tm     trade mark sign                -

<end of list>

    The  number  enclosed  in  brackets  is  the IBM PC codepage number. A
    hyphen denotes a character that does not exist on the IBM PC.


Appendix C - Sample code

    Here is some sample C code. The first function combines sequences into
    their proper representation  in IBM PC  codepage 437, the  second does
    the reverse, ie. converts characters not found in the I51 base set  to
    their combination sequences.

void   cmbch(char *s)
{
   int     z, x, sl;

   sl = strlen(s);
   for (z = 0, x = 0; x <= sl; z++, x++)
       if (s[x] == '�')
           switch (s[++x]) {
               case '-':   switch (s[++x]) {
                   case '<':   s[z] = '�'; break;
                   case '=':   s[z] = '�'; break;
                   case '>':   s[z] = '�'; break;
                   case '~':   s[z] = '�'; break;
                   case 'I':   s[z] = '�'; break;
                   default:    s[z] = s[x]; break;
               }; break;
               case 'g':  switch (s[++x]) {
                   case 'G':   s[z] = '�'; break;
                   case 'a':   s[z] = '�'; break;
                   default:    s[z] = s[x]; break;
               }; break;
               default:    s[z] = s[++x];
           }
   else
       s[z] = s[x];
}

char *encode(char *s)
{
   char *t = s;

   while (*s) {
       switch (*s) {
           case '�':    *t++ = '\0x02'; *t++ = ' '; *t++ = 'n'; break;
           case '�':    *t++ = '\0x02'; *t++ = '$'; *t++ = 'P'; break;
           case '�':    *t++ = '\0x02'; *t++ = '$'; *t++ = 'f'; break;
           case '�':    *t++ = '\0x02'; *t++ = '-'; *t++ = '<'; break;
           case '�':    *t++ = '\0x02'; *t++ = '-'; *t++ = '='; break;
           case '�':    *t++ = '\0x02'; *t++ = '-'; *t++ = '>'; break;
           case '�':    *t++ = '\0x02'; *t++ = '-'; *t++ = '~'; break;
           case '�':    *t++ = '\0x02'; *t++ = '-'; *t++ = 'I'; break;
           case '�':    *t++ = '\0x02'; *t++ = 'g'; *t++ = 'G'; break;
           case '�':    *t++ = '\0x02'; *t++ = 'g'; *t++ = 'a'; break;
           default: *t++ = *s;
       }
       s++;
   }
   return (t);
}

     The code neccessary to translate between I51 hibit characters and any
     ordinary 8 bit character  set is trivial and  left as an exercise  to
     the reader..:-)


Appendix D - Comments on the base set

    It is of  course possible to  use any character  set as the  base set,
    even pure 7-bit ASCII. Earlier revisions of this standard were in fact
    based on ASCII. But, the usage  of ASCII as the base set  will require
    all non-ascii characters  to be encoded.   That would cause  a lot  of
    unneccessary  trouble  for  almost  all  foreign languages, and is not
    desirable. No one would want all 'strange' characters of his  language
    to be encoded, just because 'we  can't use 8 bits'. Mail sessions  are
    conducted in 8 bit, packets contain 8 bit data - so we can.

    Then, of course, it is unwise not to use an 8 bit set as the base set,
    since it will  save a lot  of space compared  to a 7  bit set, not  to
    mention a lot of  trouble. It is my  belief that among 8  bit sets ISO
    8859-1 is the most well-spread  and common around, and that  qualifies
    it to be the proposed base set of this standard.


Appendix E - Comments on the escape character

    The escape character  can in fact  be almost any  character, if proper
    measurements are  taken to  make the  ordinary use  for the  character
    chosen possible at  the same time.  To avoid too  much trouble, it  is
    wise to  select a  character seldom  found in  mail. 0x01  would be  a
    perfect escape character, were it not for the fact that it is  already
    used for  other purposes.  The next  character, however,  is currently
    unused. I therefore felt it wise  to use 0x02 as the escape  character
    in this standard. There are  several advantages related to the  use of
    this character  as the  escape character.  There are  of course  other
    characters (eg.  '\' or '~') that could be used, but there are reasons
    not to use  them.  '\',  for instance, is  commonly used in  Europe to
    represent a national character, and is therefore not well suited.  The
    '~' on  the other  hand is  not often  used, but  can't be  used as an
    escape character  due to  the fact  that it  itself is  an accent (see
    below).


Appendix F - During the change to I51, co-existence with other methods

    Any message  in which  the I51  standard is  used (whether  with extra
    codes present or not) will, during a limited period of time, have  the
    following kludge line in it:

^AI51<cr>

    With this kludge line present, a message editor at once will know that
    a certain message should be 'de-I51-ified'. How to interpret  messages
    lacking  this line is upon you decide. However, should you find a 0x02
    in a message lacking the kludge line, the message is to be  considered
    an I51 message.

    When a non-I51 message is quoted, its contents should be translated to
    the corresponding I51 codes, if possible. Characters not found in  the
    I51 standard (as defined in this document) are to be ignored, unless a
    similar I51 representation can be found.


Appendix G - Comments to the author

    Please feel free to contact me on 2:200/108 if you have any questions,
    comments  or   suggestions  regarding   this  document,   or  anything
    associated  with  it.   I  appreciate  any  suggestions  on additional
    'extra' characters to be added to this standard.