CURRENT_MEETING_REPORT_
Reported by Borka Jerman-Blazic/Jozef Stefan Institute
Minutes of the UCS Character Set BOF (UCS)
Introduction
A brief introductory tutorial was given by Borka Jerman-Blazic. She
described some of the problems which appear on the network due to the
lack of support for the national character sets used for inputting,
outputting, processing and displaying the text written in languages used
all over the world. She stressed the need for proper maintenance of the
character integrity over the network. The requirement for processing
and interchanging different character sets correctly is especially
relevant for some Internet services dealing with names of persons or
organizations.
Presentation of the Problems
Peter Svanberg gave a short overview of the level of support for
non-ASCII character sets in different Internet protocols. Some of the
protocols were identified as hostile to 8-bit characters. Among them
are: DNS, SMTP, FTP, NNTP, WAIS, MIME Text/Enhanced, NFS, AFS, Whois,
URN, Gopher, etc. The more recently developed protocols such as MIME
part 1 and part 2 as well as some currently on-going projects such as
Whois++, as mentioned by Simon Spero, support 16-bit coding and the
repertoires provided by such coding. He also mentioned, that several
IETF groups developing new protocols/services consider the importance of
the proper support of the character sets to be a problem. The level of
support for extended character sets in some protocols used on the
Internet is included in the Annex below.
The next speaker was Masataka Ohta. He presented his view regarding the
idea that the International Universal Coding system be recommended for
use over the Internet. He identifyed five properties which are required
to be present in the recommended coding system:
1. Identity for encoding and decoding, which he understands as unique
mapping between particular graphic character and its code (bit
combination);
2. Causality, understood as independence of a processed coded
character from the other incoming characters in the data stream;
3. Finite state recognition, state dependence of the code required for
presentation/display of multi-octet coded data;
1
4. Finite resynchronizability, which means that the state of
automation can be determined uniquely by reading a fixed, finite
number of octets; and
5. Equality, requirement that a character coded with a different
coding system can always be recognized as the same character.
Masataka looked for the required properties in ISO 10646 and found out
that full ISO 10646 (UCS4) satisfies none of the required properties.
He also pointed out that ISO 10646 level 1 satisfies all of the
required properties for the European languages.
He proposed an extension to the existing UCS code system consisting of
five additional bits which will enable the deficiency of the UCS coding
system to be overcome. The discussion showed that the proposed solution
is not in the general stream of the development of the standard
character set codes and their applications in the computing systems.
One of the possible solutions to the problems identified by Masataka
could be the use of the whole model of UCS, i.e., the four envisaged
octets which define the cell and row position for a character in the
Multilingual Basic Plane of ISO 10646 additional planes and groups.
There was a proposal that the required five additional bits be coded as
a private plane in the UCS scheme. John Klensin noted that such an
approach could clash with the reassignment of such a plane in the
standardization process of ISO JTC1/SC2. In the discussion the problem
of the handling of bidirectional text was also identified. Masataka
said that one of the five additional bits in his scheme is intended to
be used for indication of bidirectional text.
Harald Alvestrand pointed out that what is happening now is a sort of
transition period between 8-bit coding and 16-bit coding provided with
UCS. Another parallel stream for support of different national character
sets is ``character switching'' which is enabled by use of the code
extension technique of ISO 2022. It was obvious that this scheme is not
of practical use for the Internet except for special cases, i.e, the
Japanese e-mail solution.
Conclusions
The attendees then discussed possible work items which will result if
the IESG approves the formation of a working group. The chair
identified several documents which deal with character set problems such
as: RFC 1345, ``Character Mnemonics & Character Sets,'' the
Internet-Draft, ``X.400 use of extended character sets,'' and the
Internet-Draft, ``Characters and character sets for various languages.''
John Klensin pointed out that special precautions have to be taken in
the recommendation of UTF-2 as a data interchange method over the
Internet in connection with the possible assignments of additional
coding planes by JTC1/SC2. He also recommended the use of a mailing
list already working within IETF,
[email protected]. The
mailing list of the RARE working group on character sets could be added
2
to that mailing list. Other items were discussed and proposed by the
BOF attendees. It was decided that the IESG will be asked to consider
the possibility of setting up a working group to produce the following:
o A document defining how UCS can be used in a uniform way in
Internet protocols, especially taking into consideration the UTF-2
encoding of UCS. The document will provide guidance to other
protocols which have to deal with these items over the Internet.
o A document identifying the languages and the characters required
for coding text written in a particular natural language (a sort of
guideline for services dealing with multilinguality such as NIR
service based on the usage of plain text).
o A document defining a tool for coded character set conversion to be
provided within some services such as e-mail user agent including
fall-back representation of incoming characters that are outside
the supported character repertoire of the receiver.
o A proposal for extending the mandatory issues which have to be
covered in the RFC standardization process to include character set
consideration and support.
Annex
The level of support for extended character sets in some Internet
Standard protocols.
3
____________________________________________________________________
| CharSet | |CharSet | |
|_Support_|Protocol____________S|upport_|``Next_Generation''_Protocol_|
| 1 |SMTP | 3 |ESMTP |
| 1 |RFC822 | 4 |MIME part 1 + part 2 |
| 1 |DNS | | |
| 2 |FTP | | |
| 3 |Telnet | | |
| 2 |NNTP | | |
| 2 |Finger | | |
| 2 |POP3 | | |
| 2 |IMAP2 | 3 |IMAP2bis |
| 1 |NFS | | |
| 1 |AFS | | |
| 2 |MIME Text/Enhanced | | |
| ? |MIME Text/simplemail | | |
| 3 |STIF | | |
| 2 |Gopher | 3 |Gopher + |
| 1 |WAIS | | |
| ? |Prospero | | |
| 2 |HTML | | |
| 2 |Whois | 3 |Whois ++ |
| 2 |URL | | |
| 2 |URN | | |
|____3____|URM__________________|______|____________________________|
Legend:
1 -- hostile against 8-bit characters
2 -- no support for different character sets
3 -- some support for different character sets
4 -- well thought-out support for different character sets
5 -- uniform treatment of all characters
Attendees
Harald Alvestrand
[email protected]
Piet Bovenga
[email protected]
Maria Dimou-Zacharova
[email protected]
Tim Dixon
[email protected]
Olle Jarnefors
[email protected]
Borka Jerman-Blazic
[email protected]
Tomaz Kalin
[email protected]
John Klensin
[email protected]
Pekka Kytolaakso
[email protected]
Thomas Lenggenhager
[email protected]
Jun Matsukata
[email protected]
Keith Moore
[email protected]
Masataka Ohta
[email protected]
Geir Pedersen
[email protected]
4
Luc Rooijakkers
[email protected]
Rickard Schoultz
[email protected]
Milan Sova
[email protected]
Simon Spero
[email protected]
Peter Svanberg
[email protected]
Guido van Rossum
[email protected]
5