<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<TITLE>UTF-8 and Unicode FAQ</TITLE>
<LINK REL=Up HREF="http://www.cl.cam.ac.uk/~mgk25/">
<META NAME="keywords" CONTENT="Unicode, ISO 10646-1, UCS, X11, X Window
System, Linux, Unix, POSIX, character sets, ISO 8859-1, xterm">
<META NAME="description" CONTENT="All you need to know to use
Unicode/UTF-8 on Unix and Linux systems.">
</HEAD>
<BODY BGCOLOR="#efefef" TEXT="#000000">
<H1>UTF-8 and Unicode FAQ for Unix/Linux</H1>

<P>by <A HREF="http://www.cl.cam.ac.uk/~mgk25/">Markus Kuhn</A>

<P><B>This text is a very comprehensive one-stop information resource
on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). You
will find here both introductory information for every user as well as
detailed references for the experienced developer.</B>

<P><B>Unicode is well on the way to replace ASCII and Latin-1 in a few
years at all levels. It allows you not only to handle text in
practically any script and language used on this planet, it also
provides you with a comprehensive set of mathematical and technical
symbols that will simplify scientific information exchange.</B>

<P><B>The UTF-8 encoding allows Unicode to be used in a convenient and
backwards compatible way in environments that, like Unix, were
designed entirely around ASCII. UTF-8 is the way in which Unicode is
used under Unix, Linux, and similar systems. It is now time to make
sure that you are well familiar with it and that your software
supports UTF-8 smoothly.</B>

<H2>Contents</H2>

<UL>

<LI><A HREF="#ucs">What are UCS and ISO 10646?</A>
<LI><A HREF="#comb">What are combining characters?</A>
<LI><A HREF="#levels">What are UCS implementation levels?</A>
<LI><A HREF="#national">Has UCS been adopted as a national standard?</A>
<LI><A HREF="#unicode">What is Unicode?</A>
<LI><A HREF="#diffs">So what is the difference between Unicode and ISO 10646?</A>
<LI><A HREF="#utf-8">What is UTF-8?</A>
<LI><A HREF="#examples">Where do I find nice UTF-8 example files?</A>
<LI><A HREF="#ucsutf">What different encodings are there?</A>
<LI><A HREF="#lang">What programming languages do support Unicode?</A>
<LI><A HREF="#linux">How should Unicode be used under Linux?</A>
<LI><A HREF="#mod">How do I have to modify my software?</A>
<LI><A HREF="#c">C support for Unicode and UTF-8</A>
<LI><A HREF="#activate">How should the UTF-8 mode be activated?</A>
<LI><A HREF="#getxterm">How do I get a UTF-8 version of xterm?</A>
<LI><A HREF="#xterm">How much of Unicode does xterm support?</A>
<LI><A HREF="#fonts">Where do I find ISO 10646-1 X11 fonts?</A>
<LI><A HREF="#term">What are the issues related to UTF-8 terminal emulators?</A>
<LI><A HREF="#apps">What UTF-8 enabled applications are already available?</A>
<FONT COLOR="#ff0000">[UPDATED]</FONT>
<LI><A HREF="#patches">What patches to improve UTF-8 support are available?</A>
<LI><A HREF="#libs">Are there free libraries for dealing with Unicode available?</A>
<LI><A HREF="#widgets">What is the status of Unicode support for various X widget libraries?</A>
<LI><A HREF="#wip">What packages with UTF-8 support are currently under development?</A>
<LI><A HREF="#solaris">How does UTF-8 support work under Solaris?</A>
<LI><A HREF="#ps">How are Postscript glyph names related to UCS codes?</A>
<LI><A HREF="#subsets">Are there any well-defined UCS subsets?</A>
<LI><A HREF="#conv">What issues are there to consider when converting encodings</A>
<LI><A HREF="#x11">Is X11 ready for Unicode?</A>
<LI><A HREF="#lists">Are there any good mailing lists on these issues?</A>
<LI><A HREF="#refs">Further References</A>

</UL>

<H2><A NAME="ucs">What are UCS and ISO 10646?</A></H2>

<P>The international standard <B>ISO 10646</B> defines the
<B>Universal Character Set (UCS)</B>. UCS is a superset of all other
character set standards. It guarantees round-trip compatibility to
other character sets. If you convert any text string to UCS and then
back to the original encoding, then no information will be lost.

<P>UCS contains the characters required to represent practically all
known languages. This includes not only the Latin, Greek, Cyrillic,
Hebrew, Arabic, Armenian, and Georgian scripts, but also also Chinese,
Japanese and Korean Han ideographs as well as scripts such as
Hiragana, Katakana, Hangul, Devangari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo,
Tibetian, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian,
Ogham, Myanmar, Sinhala, Thaana, Yi, and others. For scripts not yet
covered, research on how to best encode them for computer usage is
still going on and they will be added eventually. This includes not
only <A HREF="http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n1639/n1639.htm"
>Cuneiform</A>, <A
HREF="http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n1637/n1637.htm"
>Hieroglyphs</A> and various Indo-European languages, but even some
selected artistic scripts such as Tolkien's <A
HREF="http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n1641/n1641.htm"
>Tengwar</A> and <A
HREF="http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n1642/n1642.htm"
>Cirth</A>. UCS also covers a large number of graphical,
typographical, mathematical and scientific symbols, including those
provided by TeX, Postscript, APL, MS-DOS, MS-Windows, Macintosh, OCR
fonts, as well as many word processing and publishing systems, and
more are being added.

<P>ISO 10646 defines formally a 31-bit character set. However, of this
huge code space, so far characters have been assigned only to the
first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is
called the <B>Basic Multilingual Plane (BMP)</B> or Plane 0. The
characters that are expected to be encoded outside the 16-bit BMP
belong all to rather exotic scripts (e.g., Hieroglyphs) that are only
used by specialists for historic and scientific purposes. Current
plans suggest that there will never be characters assigned outside the
21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over
one million potential future characters. The ISO 10646-1 standard was
first published in 1993 and defines the architecture of the character
set and the content of the BMP. A second part ISO 10646-2 which
defines characters encoded outside the BMP is under preparation, but
it might take a few years until it is finished. New characters are
still being added to the BMP on a continuous basis, but the existing
characters will not be changed any more and are stable.

<P>UCS assigns to each character not only a code number but also an
official name. A hexadecimal number that represents a UCS or Unicode
value is commonly preceded by "U+" as in U+0041 for the character
"Latin capital letter A". The UCS characters U+0000 to U+007F are
identical to those in US-ASCII (ISO 646 IRV) and the range U+0000 to
U+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 to
U+F8FF and also larger ranges outside the BMP are reserved for private
use.

<P>The full name of the UCS standard is

<BLOCKQUOTE>
International Standard ISO/IEC 10646-1, Information technology --
Universal Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane. Second edition,
International Organization for Standardization, Geneva, 2000-09-15.
</BLOCKQUOTE>

<P>It can be <A
HREF="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">ordered
online from ISO</A> as a set of PDF files on CD-ROM for 80 CHF (~53
EUR, ~45 USD, ~32 GBP).

<H2><A NAME="comb">What are combining characters?</A></H2>

<P>Some code points in UCS have been assigned to <B>combining
characters</B>. These are similar to the non-spacing accent keys on a
typewriter. A combining character is not a full character by itself.
It is an accent or other diacritical mark that is added to the
previous character. This way, it is possible to place any accent on
any character. The most important accented characters, like those used
in the orthographies of common languages, have codes of their own in
UCS to ensure backwards compatibility with older character sets.
Accented characters that have their own code position, but could also
be represented as a pair of another character followed by a combining
character, are known as <B>precomposed characters</B>. Precomposed
characters are available in UCS for backwards compatibility with older
encodings such as ISO 8859 that had no combining characters. The
combining character mechanism allows to add accents and other
diacritical marks to any character, which is especially important for
scientific notations such as mathematical formulae and the
International Phonetic Alphabet, where any possible combination of a
base character and one or several diacritical marks could be needed.

<P>Combining characters follow the character which they modify. For
example, the German umlaut character � ("Latin capital letter A
with diaeresis") can either be represented by the precomposed UCS code
U+00C4, or alternatively by the combination of a normal "Latin capital
letter A" followed by a "combining diaeresis": U+0041 U+0308. Several
combining characters can be applied when it is necessary to stack
multiple accents or add combining marks both above and below the base
character. For example with the Thai script, up to two combining
characters are needed on a single base character.

<H2><A NAME="levels">What are UCS implementation levels?</A></H2>

<P>Not all systems are expected to support all the advanced mechanisms
of UCS such as combining characters. Therefore, ISO 10646 specifies
the following three implementation levels:

<DL>

<DT>Level 1<DD>Combining characters and Hangul Jamo characters (a
special, more complicated encoding of the Korean script, where Hangul
syllables are coded as two or three subcharacters) are not supported.

<DT>Level 2<DD>Like level 1, however in some scripts, a fixed list of
combining characters is now allowed (e.g., for Hebrew, Arabic,
Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada,
Malayalam, Thai and Lao). These scripts cannot be represented
adequately in UCS without support for at least certain combining
characters.

<DT>Level 3<DD>All UCS characters are supported, such that for example
mathematicians can place a tilde or an arrow (or both) on any
arbitrary character.

</DL>

<H2><A NAME="national">Has UCS been adopted as a national standard?</A></H2>

<P>Yes, a number of countries have published national adoptions of ISO
10646-1:1993, sometimes after adding additional annexes with
cross-references to older national standards and specifications of
various national implementation subsets:

<UL>

<LI>China: GB 13000.1-93

<LI>Japan: JIS X 0221-1995

<LI>Korea: KS X 1005-1:1995 (includes ISO 10646-1:1993 amendments 1-7)

</UL>

<H2><A NAME="unicode">What is Unicode?</A></H2>

<P>Historically, there have been two independent attempts to create a
single unified character set. One was the ISO 10646 project of the <A
HREF="http://www.iso.ch/">International Organization for
Standardization (ISO)</A>, the other was the <A
HREF="http://www.unicode.org/">Unicode Project</A> organized by a
consortium of (initially mostly US) manufacturers of multi-lingual
software. Fortunately, the participants of both projects realized in
around 1991 that two different unified character sets is not what the
world needs. They joined their efforts and worked together on creating
a single code table. Both projects still exist and publish their
respective standards independently, however the Unicode Consortium and
ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode
and ISO 10646 standards compatible and they closely coordinate any
further extensions. Unicode 1.1 corresponded to ISO 10646-1:1993 and
Unicode 3.0 corresponds to ISO 10646-1:2000. All Unicode versions
since 2.0 are compatible, only new characters will be added, no
existing characters will be removed or renamed in the future.

<P>The Unicode Standard can be ordered like any normal book, for
instance via <A
HREF="http://www.amazon.com/exec/obidos/ASIN/0201616335/mgk25">amazon.com</A>
for around 50 USD:

<BLOCKQUOTE>
The Unicode Consortium: <A
HREF="http://www.amazon.com/exec/obidos/ASIN/0201616335/mgk25">The
Unicode Standard, Version 3.0</A>,<BR> Reading, MA, Addison-Wesley
Developers Press, 2000,<BR> ISBN 0-201-61633-5.
</BLOCKQUOTE>

<P>If you work frequently with text processing and character sets, you
definitely should get a copy. It is also available <A
HREF="http://www.unicode.org/unicode/uni2book/u2.html">online</A> now.

<H2><A NAME="diffs">So what is the difference between Unicode and ISO 10646?</A></H2>

<P>The <A
HREF="http://www.unicode.org/unicode/standard/standard.html">Unicode
Standard</A> published by the Unicode Consortium contains exactly the
ISO 10646-1 Basic Multilingual Plane at implementation level 3. All
characters are at the same positions and have the same names in both
standards.

<P>The Unicode Standard defines in addition much more semantics
associated with some of the characters and is in general a better
reference for implementors of high-quality typographic publishing
systems. Unicode specifies algorithms for rendering presentation forms
of some scripts (say Arabic), handling of bi-directional texts that
mix for instance Latin and Hebrew, algorithms for sorting and string
comparison, and much more.

<P>The ISO 10646 standard on the other hand is not much more than a
simple character set table, comparable to the well-known ISO 8859
standard. It specifies some terminology related to the standard,
defines some encoding alternatives, and it contains specifications of
how to use UCS in connection with other established ISO standards such
as ISO 6429 and ISO 2022. There are other closely related ISO
standards, for instance ISO 14651 on sorting UCS strings. A nice
feature of the ISO 10646-1 standard is that it provides CJK example
glyphs in five different style variants, while the Unicode standard
shows the CJK ideographs only in a Chinese variant.

<H2><A NAME="utf-8">What is UTF-8?</A></H2>

<P>UCS and Unicode are first of all just code tables that assign
integer numbers to characters. There exist several alternatives for
how a sequence of such characters or their respective integer values
can be represented as a sequence of bytes. The two most obvious
encodings store Unicode text as sequences of either 2 or 4 bytes
sequences. The official terms for these encodings are UCS-2 and UCS-4
respectively. Unless otherwise specified, the most significant byte
comes first in these (Bigendian convention). An ASCII or Latin-1 file
can be transformed into a UCS-2 file by simply inserting a 0x00 byte
in front of every ASCII byte. If we want to have a UCS-4 file, we have
to insert three 0x00 bytes instead before every ASCII byte.

<P>Using UCS-2 (or UCS-4) under Unix would lead to very severe
problems. Strings with these encodings can contain as parts of many
wide characters bytes like '\0' or '/' which have a special meaning in
filenames and other C library function parameters. In addition, the
majority of UNIX tools expects ASCII files and can't read 16-bit words
as characters without major modifications. For these reasons,
<B>UCS-2</B> is not a suitable external encoding of <B>Unicode</B> in
filenames, text files, environment variables, etc.

<P>The <B>UTF-8</B> encoding defined in ISO 10646-1:2000 <A
HREF="ucs/ISO-10646-UTF-8.html">Annex D</A> and also described in <A
HREF="ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt">RFC
2279</A> as well as section 3.8 of the Unicode 3.0 standard does not
have these problems. It is clearly the way to go for using
<B>Unicode</B> under Unix-style operating systems.

<P>UTF-8 has the following properties:

<UL>

<LI>UCS characters U+0000 to U+007F (ASCII) are encoded simply as
bytes 0x00 to 0x7F (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the same
encoding under both ASCII and UTF-8.

<LI>All UCS characters >U+007F are encoded as a sequence of several
bytes, each of which has the most significant bit set. Therefore, no
ASCII byte (0x00-0x7F) can appear as part of any other character.

<LI>The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range 0xC0 to 0xFD and it indicates how
many bytes follow for this character. All further bytes in a multibyte
sequence are in the range 0x80 to 0xBF. This allows easy
resynchronization and makes the encoding stateless and robust against
missing bytes.

<LI>All possible 2<SUP>31</SUP> UCS codes can be encoded.

<LI>UTF-8 encoded characters may theoretically be up to six bytes
long, however 16-bit BMP characters are only up to three bytes long.

<LI>The sorting order of Bigendian UCS-4 byte strings is preserved.

<LI>The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

</UL>

<P>The following byte sequences are used to represent a character. The
sequence to be used depends on the Unicode number of the character:

<P><DIV ALIGN=CENTER><TABLE BORDER=1>
<TR><TD>U-00000000 - U-0000007F:
<TD>0<I>xxxxxxx</I>
<TR><TD>U-00000080 - U-000007FF:
<TD>110<I>xxxxx</I> 10<I>xxxxxx</I>
<TR><TD>U-00000800 - U-0000FFFF:
<TD>1110<I>xxxx</I> 10<I>xxxxxx</I> 10<I>xxxxxx</I>
<TR><TD>U-00010000 - U-001FFFFF:
<TD>11110<I>xxx</I> 10<I>xxxxxx</I> 10<I>xxxxxx</I> 10<I>xxxxxx</I>
<TR><TD>U-00200000 - U-03FFFFFF:
<TD>111110<I>xx</I> 10<I>xxxxxx</I> 10<I>xxxxxx</I> 10<I>xxxxxx</I>
10<I>xxxxxx</I>
<TR><TD>U-04000000 - U-7FFFFFFF:
<TD>1111110<I>x</I> 10<I>xxxxxx</I> 10<I>xxxxxx</I> 10<I>xxxxxx</I>
10<I>xxxxxx</I> 10<I>xxxxxx</I>
</TABLE></DIV>

<P>The <I>xxx</I> bit positions are filled with the bits of the
character code number in binary representation. The rightmost <I>x</I>
bit is the least-significant bit. Only the shortest possible multibyte
sequence which can represent the code number of the character can be
used. Note that in multibyte sequences, the number of leading 1 bits
in the first byte is identical to the number of bytes in the entire
sequence.

<P><B>Examples:</B> The Unicode character <SAMP>U+00A9 = 1010
1001</SAMP> (copyright sign) is encoded in UTF-8 as

<PRE>
11000010 10101001 = 0xC2 0xA9
</PRE>

<P> and character <SAMP>U+2260 = 0010 0010 0110 0000</SAMP> (not equal
to) is encoded as:

<PRE>
11100010 10001001 10100000 = 0xE2 0x89 0xA0
</PRE>

<P>The official name and spelling of this encoding is UTF-8, where UTF
stands for <B>U</B>CS <B>T</B>ransformation <B>F</B>ormat. Please do
not write UTF-8 in any documentation text in other ways (such as utf8
or UTF_8), unless of course you refer to a variable name and not the
encoding itself.

<P><B>An important note for developers of UTF-8 decoding routines:</B>
For security reasons, a UTF-8 decoder <A
HREF="http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html">must
not</A> accept UTF-8 sequences that are longer than necessary to
encode a character. For example, the character U+000A (line feed) must
be accepted from a UTF-8 stream <B>only</B> in the form 0x0A, but not
in any of the following five possible overlong forms:

<PRE>
0xC0 0x8A
0xE0 0x80 0x8A
0xF0 0x80 0x80 0x8A
0xF8 0x80 0x80 0x80 0x8A
0xFC 0x80 0x80 0x80 0x80 0x8A
</PRE>

<P>Any overlong UTF-8 sequence could be abused to bypass UTF-8
substring tests that look only for the shortest possible encoding. All
overlong UTF-8 sequences start with one of the following byte
patterns:

<P><DIV ALIGN=CENTER><TABLE BORDER=1>
<TR><TD>1100000<I>x</I> (10<I>xxxxxx</I>)
<TR><TD>11100000 100<I>xxxxx</I> (10<I>xxxxxx</I>)
<TR><TD>11110000 1000<I>xxxx</I> (10<I>xxxxxx</I> 10<I>xxxxxx</I>)
<TR><TD>11111000 10000<I>xxx</I> (10<I>xxxxxx</I> 10<I>xxxxxx</I>
10<I>xxxxxx</I>)
<TR><TD>11111100 100000<I>xx</I> (10<I>xxxxxx</I> 10<I>xxxxxx</I>
10<I>xxxxxx</I> 10<I>xxxxxx</I>)
</TABLE></DIV>

<P>Also note that the code positions U+D800 to U+DFFF (UTF-16
surrogates) as well as U+FFFE and U+FFFF must not occur in normal
UTF-8 or UCS-4 data. UTF-8 decoders should treat them like malformed
or overlong sequences for safety reasons.

<P><A HREF="ucs/examples/UTF-8-test.txt">Markus Kuhn's UTF-8 decoder
stress test file</A> contains a systematic collection of malformed and
overlong UTF-8 sequences and will help you to verify the robustness of
your decoder.

<H2><A NAME="examples">Where do I find nice UTF-8 example files?</A></H2>

<P>A few interesting UTF-8 example files for tests and demonstrations
are:

<UL>

<LI><A HREF="http://www.columbia.edu/kermit/utf8.html">UTF-8
Sampler</A> web page by the Kermit project

<LI><A HREF="ucs/examples/">Markus Kuhn's example plain-text
files</A>, including among others the classic <A
HREF="ucs/examples/UTF-8-demo.txt">demo</A>, <A
HREF="ucs/examples/UTF-8-test.txt">decoder test</A>, <A
HREF="ucs/examples/TeX.txt">TeX repertoire</A>, <A
HREF="ucs/wgl4.txt">WGL4 repertoire</A>, <A HREF="eurotest/">euro test
pages</A>, and Robert Brady's <A
HREF="ucs/examples/lyrics-ipa.txt">IPA lyrics</A>.

<LI><A
HREF="http://www.macchiato.com/unicode/Unicode_transcriptions.html"
>Unicode Transcriptions</A>

</UL>

<H2><A NAME="ucsutf">What different encodings are there?</A></H2>

<P>Both the UCS and Unicode standards are first of all large tables
that assign to every character an integer number. If you use the term
"UCS", "ISO 10646", or "Unicode", this just refers to a mapping
between characters and integers. This does not yet specify how to
store these integers as a sequence of bytes in memory.

<P>ISO 10646-1 defines the UCS-2 and UCS-4 encodings. These are
sequences of 2 bytes and 4 bytes per character, respectively. ISO
10646 was from the beginning designed as a 31-bit character set (with
possible code positions ranging from U-00000000 to U-7FFFFFFF),
however only very recently characters have been assigned beyond the
Basic Multilingual Plane (BMP), that is beyond the first
2<SUP>16</SUP> character positions (see ISO 10646-2 and <A
HREF="http://www.unicode.org/unicode/reports/tr27/">Unicode 3.1</A>).
UCS-4 can represent all UCS and Unicode characters, UCS-2 can
represent only those from the BMP (U+0000 to U+FFFF).

<P>"Unicode" originally implied that the encoding was UCS-2 and it
initially didn't make any provisions for characters outside the BMP
(U+0000 to U+FFFF). When it became clear that more than 64k characters
would be needed for certain special applications (historic alphabets
and ideographs, mathematical and musical typesetting, etc.), Unicode
was turned into a sort of 21-bit character set with possible code
points in the range U-00000000 to U-0010FFFF. The 2×1024
surrogate characters (U+D800 to U+DFFF) were introduced into the BMP
to allow 1024×1024 non-BMP characters to be represented as a
sequence of two 16-bit surrogate characters. This way <A
HREF="ucs/ISO-10646-UTF-16.html">UTF-16</A> was born, which represents
the extended "21-bit" Unicode in a way backwards compatible with
UCS-2. The term <A
HREF="http://www.unicode.org/unicode/reports/tr19/">UTF-32</A> was
introduced in Unicode to mean a 4-byte encoding of the extended
"21-bit" Unicode. UTF-32 is the exact same thing as UCS-4, except that
by definition UTF-32 is never used to represent characters above
U-0010FFFF, while UCS-4 can cover all 2<SUP>31</SUP> code positions up
to U-7FFFFFFF.

<P>In addition to all that, <A HREF="#utf-8">UTF-8</A> was introduced
to provide an ASCII backwards compatible multi-byte encoding. The
definitions of UTF-8 in UCS and Unicode differ actually slightly,
because in UCS, up to 6-byte long UTF-8 sequences are possible to
represent characters up to U-7FFFFFFF, while in Unicode only up to
4-byte long UTF-8 sequences are defined to represent characters up to
U-0010FFFF. The difference is in essence the same as between UCS-4 and
UTF-32, except that no two different names have been introduced for
UTF-8 covering the UCS and Unicode ranges.

<P>No endianess is implied by UCS-2, UCS-4, UTF-16, and UTF-32, though
ISO 10646-1 says that Bigendian should be preferred unless otherwise
agreed. It has become customary to append the letters "BE" (Bigendian,
high-byte first) and "LE" (Littleendian, low-byte first) to the
encoding names in order to explicitly specify a byte order.

<P>In order to allow the automatic detection of the byte order, it has
become customary on some platforms (notably Win32) to start every
Unicode file with the character U+FEFF (ZERO WIDTH NO-BREAK SPACE),
also known as the Byte-Order Mark (BOM). Its byte-swapped equivalent
U+FFFE is not a valid Unicode character, therefore it helps to
unambiguously distinguish the Bigendian and Littleendian variants of
UTF-16 and UTF-32.

<P>A full featured character encoding converter will have to provide
the following 13 encoding variants of Unicode and UCS:

<BLOCKQUOTE>
UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE, UCS-4BE, UTF-8, UTF-16,
UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE
</BLOCKQUOTE>

<P>Where no byte order is explicitly specified, use the byte order of
the CPU on which the conversion takes place and in an input stream
swap the byte order whenever U+FFFE is encountered. The difference
between outputting UCS-4 versus UTF-32 and UTF-16 versus UCS-2 lies in
handling out-of-range characters. The fallback mechanism for
non-representable characters has to be activated in UTF-32 (for
characters > U-0010FFFF) or UCS-2 (for characters > U+FFFF) even where
UCS-4 or UTF-16 respectively would offer a representation.

<P>Really just of historic interest are <A
HREF="http://www.itscj.ipsj.or.jp/ISO-IR/178.pdf">UTF-1</A>, <A
HREF="ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc2152.txt">UTF-7</A>,
<A HREF="http://www.unicode.org/unicode/reports/tr6/">SCSU</A> and a
dozen other less widely publicised UCS encoding proposals with various
properties, none of which ever enjoyed any significant use. Their use
should be avoided.

<P>A good encoding converter will also offer options for adding or
removing the BOM:

<UL>

<LI>Unconditionally prefix the output text with U+FEFF.

<LI>Prefix the output text with U+FEFF unless it is already there.

<LI>Remove the first character if it is U+FEFF.

</UL>

<P>It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB
0xBF) as a signature to mark the beginning of a UTF-8 file. This
practice should definitely <STRONG>not</STRONG> be used on POSIX
systems for several reasons:

<UL>

<LI>On POSIX systems, the locale and not magic file type codes define
the encoding of plain text files. Mixing the two concepts would add a
lot of complexity and break existing functionality.

<LI>Adding a UTF-8 signature at the start of a file would interfere
with many established conventions such as the kernel looking for "#!"
at the beginning of a plaintext executable to locate the appropriate
interpreter.

<LI>Handling BOMs properly would add undesirable complexity even to
simple programs like <SAMP>cat</SAMP> or <SAMP>grep</SAMP> that mix
contents of several files into one.

</UL>

In addition to the encoding alternatives, Unicode also specifies
various <A
HREF="http://www.unicode.org/unicode/reports/tr15/">Normalization
Forms</A>, which provide reasonable subsets of Unicode, especially to
remove encoding ambiguities caused by the presence of precomposed and
compatibility characters:

<UL>

<LI><B>Normalization Form D (NFD):</B> Split up (decompose)
precomposed characters into combining sequences where possible,
e.g. use U+0041 U+0308 (LATIN CAPITAL LETTER A, COMBINING DIAERESIS)
instead of U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS). Also avoid
deprecated characters, e.g. use U+0041 U+030A (LATIN CAPITAL LETTER A,
COMBINING RING ABOVE) instead of U+212B (ANGSTROM SIGN).

<LI><B>Normalization Form C (NFC):</B> Use precomposed characters
instead of combining sequences where possible, e.g. use U+00C4 ("Latin
capital letter A with diaeresis") instead of U+0041 U+0308 ("Latin
capital letter A", "combining diaeresis"). Also avoid deprecated
characters, e.g. use U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE)
instead of U+212B (ANGSTROM SIGN).<BR><EM>NFC is the preferred form for
Linux and WWW.</EM>

<LI><B>Normalization Form KD (NFKD):</B> Like NFD, but avoid in
addition the use of compatibility characters, e.g. use "fi" instead of
U+FB01 (LATIN SMALL LIGATURE FI).

<LI><B>Normalization Form KC (NFKC):</B> Like NFC, but avoid in
addition the use of compatibility characters, e.g. use "fi" instead of
U+FB01 (LATIN SMALL LIGATURE FI).

</UL>

<P>A full-featured character encoding converter should also offer
conversion between normalization forms. Care should be used with
mapping to NFKD or NFKC, as semantic information might be lost (for
instance U+00B2 (SUPERSCRIPT TWO) maps to 2) and extra mark-up
information might have to be added to preserve it (e.g.,
<SAMP><SUP>2</SUP></SAMP> in HTML).

<H2><A NAME="lang">What programming languages do support Unicode?</A></H2>

<P>More recent programming languages that were developed after around
1993 have already special data types for Unicode/ISO 10646-1
characters. This is the case with Ada95, Java, TCL, Perl, Python, C#
and others.

<P>ISO C 90 specifies mechanisms to handle multi-byte encoding and
wide characters. These facilities were improved with <A
HREF="http://www.lysator.liu.se/c/na1.html">Amendment 1 to ISO C
90</A> in 1994 and even further improvements were made in the new <A
HREF="volatile/ISO-C-FDIS.1999-04.txt">ISO C 99</A> standard. These
facilities were designed originally with various East-Asian encodings
in mind. They are on one side slightly more sophisticated than what
would be necessary to handle UCS (handling of "shift sequences"), but
also lack support for more advanced aspects of UCS (combining
characters, etc.). UTF-8 is an example encoding for what the ISO C
standard calls a multi-byte encoding and the type <VAR>wchar_t</VAR>,
which is in modern environments usually a signed 32-bit integer, can
be used to hold Unicode characters.

<P>Unfortunately, <VAR>wchar_t</VAR> was already widely used for
various Asian 16-bit encodings throughout the 1990s, therefore the ISO
C 99 standard could for backwards compatibility not be changed any
more to require <VAR>wchar_t</VAR> to be used with UCS, like Java and
Ada95 managed to do. However, the C compiler can at least signal to an
application that <VAR>wchar_t</VAR> is guaranteed to hold UCS values
in all locales by defining the macro <SAMP>__STDC_ISO_10646__</SAMP>
to be an integer constant of the form <VAR>yyyymm</VAR>L (for example,
200009L for ISO/IEC 10646-1:2000; the year and month refer to the
version of ISO/IEC�10646 and its amendments that have been
implemented).

<H2><A NAME="linux">How should Unicode be used under Linux?</A></H2>

<P>Before UTF-8 emerged, Linux users all over the world had to use
various different language-specific extensions of ASCII. Most popular
were ISO 8859-1 and ISO 8859-2 in Europe, ISO 8859-7 in Greece, KOI-8
/ ISO 8859-5 / CP1251 in Russia, EUC and Shift-JIS in Japan, etc. This
made the exchange of files difficult and application software had to
worry about various small differences between these encodings. Support
for these encodings was usually incomplete, untested, and
unsatisfactory, because the application developers rarely used all
these encodings themselves.

<P>Because of these difficulties, the major Linux distributors and
application developers now foresee and hope that Unicode will
eventually replace all these older legacy encodings, primarily in the
UTF-8 form. UTF-8 will be used in

<UL>
<LI>text files (source code, HTML files, email messages, etc.)
<LI>file names
<LI>standard input and standard output, pipes
<LI>environment variables
<LI>cut and paste selection buffers
<LI>telnet, modem, and serial port connections to terminal emulators
<LI>and in any other places where byte sequences used to be interpreted
in ASCII
</UL>

<P>In UTF-8 mode, terminal emulators such as xterm or the Linux
console driver transform every keystroke into the corresponding UTF-8
sequence and send it to the stdin of the foreground process.
Similarly, any output of a process on stdout is sent to the terminal
emulator, where it is processed with a UTF-8 decoder and then
displayed using a 16-bit font.

<P>Full Unicode functionality with all bells and whistles (e.g.
high-quality typesetting of the Arabic and Indic scripts) can only be
expected from sophisticated multi-lingual word-processing packages.
What Linux will use on a broad base to replace ASCII and the other
8-bit character sets is far simpler. Linux terminal emulators and
command line tools will in the first step only switch to UTF-8. This
means that only a Level 1 implementation of ISO 10646-1 is used (no
combining characters), and only scripts such as Latin, Greek,
Cyrillic, Armenian, Georgian, CJK, and many scientific symbols are
supported that need no further processing support. At this level, UCS
support is very comparable to ISO 8859 support and the only
significant difference is that we have now thousands of different
characters available, that characters can be represented by multibyte
sequences, and that ideographic Chinese/Japanese/Korean characters
require two terminal character positions (double-width).

<P>Combining characters might also be supported under Linux eventually
(there is even some experimental terminal emulator support available
today), but even then the precomposed characters should be preferred
over combining character sequences where available. More formally, the
preferred way of encoding text in Unicode under Linux should be
<EM>Normalization Form C</EM> as defined in <A
HREF="http://www.unicode.org/unicode/reports/tr15/">Unicode Technical
Report #15</A>.

<P>One influential non-POSIX PC operating system vendor (whom we shall
leave unnamed here) suggested that all Unicode files should start with
the character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this role
also referred to as the "signature" or "byte-order mark (BOM)", in
order to identify the encoding and byte-order used in a file.
Linux/Unix does <STRONG>not</STRONG> use any BOMs and signatures. They
would break far too many existing ASCII-file syntax conventions. On
POSIX systems, the selected locale identifies already the encoding
expected in all input and output files of a process. It has also been
suggested to call UTF-8 files without a signature "UTF-8N" files, but
this non-standard term is usually not used in the POSIX world.

<P>Before you start using UTF-8 under Linux, update your installation
to use glibc 2.2 and XFree86 4.0.3 or newer. This is the case for
example starting with the SuSE 7.1 and Red Hat 7.1 distributions.
Earlier Linux distributions lack UTF-8 locale support and ISO10646-1
X11 fonts.

<H2><A NAME="mod">How do I have to modify my software?</A></H2>

<P>If you are a developer, there are two approaches to add UTF-8
support, which I will call soft and hard conversion. In soft
conversion, data is kept in its UTF-8 form everywhere and only very
few software changes are necessary. In hard conversion, UTF-8 data
that the program reads will be converted into wide-character arrays
using standard C library functions and will be handled as such
everywhere inside the application. Strings will only be converted back
to UTF-8 at output time.

<P>Most applications can do very fine with just soft conversion. This
is what makes the introduction of UTF-8 on Unix feasible at all. For
example, programs such as <SAMP>cat</SAMP> and <SAMP>echo</SAMP> do
not have to be modified at all. They can remain completely ignorant as
to whether their input and output is ISO 8859-2 or UTF-8, because they
handle just byte streams without processing them. They only recognize
ASCII characters and control codes such as <SAMP>'\n'</SAMP> which do
not change in any way under UTF-8. Therefore the UTF-8 encoding and
decoding is done for these applications completely in the terminal
emulator.

<P>A small modification will be necessary for all programs that
determine the number of characters in a string by counting the bytes.
In UTF-8 mode, they must not count any bytes in the range 0x80 - 0xBF,
because these are just continuation bytes and not characters of their
own. C's <SAMP>strlen(s)</SAMP> counts the number of bytes, but not
necessarily the number of characters in a string correctly. Instead,
<SAMP>mbstowcs(NULL,s,0)</SAMP> can be used to count characters if a
UTF-8 locale has been selected.

<P>The <SAMP>strlen</SAMP> function does not have to be replaced where
the result is used as a byte count, for example to allocate a suitably
sized buffer for a string. The second most common use of
<SAMP>strlen</SAMP> is to predict, how many columns the cursor of the
terminal will advance if a string is printed out. With UTF-8, a
character count will also not be satisfactory to predict column width,
because ideographic characters (Chinese, Japanese, Korean) will occupy
two column positions. To determine the width of a string on the
terminal screen, it is necessary to decode the UTF-8 sequence and then
use the <SAMP>wcwidth</SAMP> function to test the display width of each
character.

<P>For instance, the <SAMP>ls</SAMP> program had to be modified,
because it has to know the column widths of filenames to format the
table layout in which the directories are presented to the user.
Similarly, all programs that assume somehow that the output is
presented in a fixed-width font and format it accordingly have to
learn how to count columns in UTF-8 text. Editor functions such as
deleting a single character have to be slightly modified to delete all
bytes that might belong to one character. Affected are for instance
editors (<SAMP>vi</SAMP>, <SAMP>emacs</SAMP>, <SAMP>readline</SAMP>,
etc.) as well as programs that use the <SAMP>ncurses</SAMP> library.

<P>Any Unix-style kernel can do fine with soft conversion and needs
only very minor modifications to fully support UTF-8. Most kernel
functions that handle strings (e.g. file names, environment variables,
etc.) are not affected at all by the encoding. Modifications might be
necessary in the following places:

<UL>

<LI>The console display and keyboard driver (another VT100 emulator)
has to encode and decode UTF-8 and should support at least some subset
of the Unicode character set. This had already been available in Linux
since kernel 1.2 (send ESC %G to the console to activate UTF-8 mode).

<LI>External file system drivers such as VFAT and WinNT have to
convert file name character encodings. UTF-8 has to be added to the
list of already available conversion options, and the
<SAMP>mount</SAMP> command has to tell the kernel driver that user
processes shall see UTF-8 file names. Since VFAT and WinNT use already
Unicode anyway, UTF-8 has the advantage of guaranteeing a lossless
conversion here.

<LI>The tty driver of any POSIX system supports a "cooked" mode, in
which some primitive line editing functionality is available. In order
to allow the character erase function to work properly,
<SAMP>stty</SAMP> has to set a UTF-8 mode in the tty driver such that
it does not count continuation bytes in the range 0x80-0xBF as
characters. There exist some <A
HREF="ftp://ftp.ilog.fr/pub/Users/haible/utf8/">Linux patches</A> for
<SAMP>stty</SAMP> and the kernel tty driver from Bruno Haible.

</UL>

<H2><A NAME="c">C support for Unicode and UTF-8</A></H2>

<P>Starting with GNU glibc 2.2, the type <SAMP>wchar_t</SAMP> is
officially intended to be used only for 32-bit ISO 10646 values,
independent of the currently used locale. This is signalled to
applications by the definition of the <SAMP>__STDC_ISO_10646__</SAMP>
macro as required by ISO C99. The ISO C multi-byte conversion
functions (<SAMP>mbsrtowcs()</SAMP>, <SAMP>wcsrtombs()</SAMP>, etc.)
are fully implemented in glibc 2.2 or higher and can be used to
convert between <SAMP>wchar_t</SAMP> and any locale-dependent
multibyte encoding, including UTF-8, ISO 8859-1, etc.

<P>For example, you can write

<PRE> #include <stdio.h>
#include <locale.h>

int main()
{
if (!setlocale(LC_CTYPE, "")) {
fprintf(stderr, "Can't set the specified locale! "
"Check LANG, LC_CTYPE, LC_ALL.\n");
return 1;
}
printf("%ls\n", L"Sch�ne Gr��e");
return 0;
}
</PRE>

<P>Call this program with the locale setting <SAMP>LANG=de_DE</SAMP>
and the output will be in ISO 8859-1. Call it with
<SAMP>LANG=de_DE.UTF-8</SAMP> and the output will be in UTF-8. The
<SAMP>%ls</SAMP> format specifier in <SAMP>printf</SAMP> calls
<SAMP>wcsrtombs</SAMP> in order to convert the wide character argument
string into the local-dependent multi-byte encoding.

<H2><A NAME="activate">How should the UTF-8 mode be activated?</A></H2>

<P>If your application is soft converted and does not use the standard
locale-dependent C multibyte routines (<SAMP>mbsrtowcs()</SAMP>,
<SAMP>wcsrtombs()</SAMP>, etc.) to convert everything into
<SAMP>wchar_t</SAMP> for processing, then it might have to find out in
some way, whether it is supposed to assume that the text data it
handles is in some 8-bit encoding (like ISO 8859-1, where 1 byte = 1
character) or UTF-8. Hopefully, in a few years everyone will only be
using UTF-8 and you can just make it the default, but until then both
the classical 8-bit sets and UTF-8 will have to be supported.

<P>The first wave of applications with UTF-8 support used a whole lot
of different command line switches to activate their respective UTF-8
modes, for instance the famous <SAMP>xterm -u8</SAMP>. That turned out
to be a very bad idea. Having to remember a special command line
option or other configuration mechanism for <EM>every</EM> application
is very tedious, which is why command line options are
<STRONG>not</STRONG> the proper way of activating a UTF-8 mode.

<P>The proper way to activate UTF-8 is the POSIX locale mechanism. A
locale is a configuration setting that contains information about
culture-specific conventions of software behaviour, including the
character encoding, the date/time notation, alphabetic sorting rules,
the measurement system and common office paper size, etc. The names of
locales usually consist of <A
HREF="http://lcweb.loc.gov/standards/iso639-2/iso639jac.html">ISO
639-1</A> language and <A HREF=
"http://www.din.de/gremien/nas/nabd/iso3166ma/codlstp1/en_listp1.html">ISO
3166-1</A> country codes, sometimes with additional encoding names or
other qualifiers.

<P>You can get a list of all locales installed on your system (usually
in <SAMP>/usr/lib/locale/</SAMP>) with the command <SAMP>locale
-a</SAMP>. Set the environment variable <SAMP>LANG</SAMP> to the name
of your preferred locale. When a C program executes the
<SAMP>setlocale(LC_CTYPE, "")</SAMP> function, the library will test
the environment variables <SAMP>LC_ALL</SAMP>, <SAMP>LC_CTYPE</SAMP>,
and <SAMP>LANG</SAMP> in that order, and the first one of these that
has a value will determine which locale data is loaded for the
<SAMP>LC_CTYPE</SAMP> category (which controls the multibyte
conversion functions). The locale data is split up into separate
categories. For example, <SAMP>LC_CTYPE</SAMP> defines the character
encoding and <SAMP>LC_COLLATE</SAMP> defines the string sorting order.
The <SAMP>LANG</SAMP> environment variable is used to set the default
locale for all categories, but the <SAMP>LC_*</SAMP> variables can be
used to override individual categories. Don't worry too much about the
country identifiers in the locales. Locales such as <SAMP>en_GB</SAMP>
(English in Great Britain) and <SAMP>en_AU</SAMP> (English in
Australia) differ usually only in the <SAMP>LC_MONETARY</SAMP>
category (name of currency, rules for printing monetary amounts),
which practically no Linux application ever uses.
<SAMP>LC_CTYPE=en_GB</SAMP> and <SAMP>LC_CTYPE=en_AU</SAMP> have
exactly the same effect.

<P>You can query the name of the character encoding in your current
locale with the command <SAMP>locale charmap</SAMP>. This should say
<SAMP>UTF-8</SAMP> if you successfully picked a UTF-8 locale in the
LC_CTYPE category. The command <SAMP>locale -m</SAMP> provides a list
with the names of all installed character encodings.

<P>If you use exclusively C library multibyte functions to do all the
conversion between the external character encoding and the
<SAMP>wchar_t</SAMP> encoding that you use internally, then the C
library will take care of using the right encoding according to
<SAMP>LC_CTYPE</SAMP> for you and your program does not even have to
know explicitly what the current multibyte encoding is.

<P>However, if you prefer not to do everything using the libc
multi-byte functions (e.g., because you think this would require too
many changes in you software or is not efficient enough), then your
application has to find out for itself when to activate the UTF-8
mode. To do this, on any X/Open compliant systems, where <A HREF=
"http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html"
><SAMP><langinfo.h></SAMP></A> is available, you can use a line
such as

<PRE> utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);</PRE>

<P>in order to detect whether the current locale uses the UTF-8
encoding. You have of course to add a <SAMP>setlocale(LC_CTYPE,
"")</SAMP> at the beginning of your application to set the locale
according to the environment variables first. The standard function
call <SAMP>nl_langinfo(CODESET)</SAMP> is also what <SAMP>locale
charmap</SAMP> calls to find the name of the encoding specified by the
current locale for you. It is available on pretty much every modern
Unix, except for FreeBSD, which has unfortunately still quite abysmal
locale support. If you need an autoconf test for the availability of
<SAMP>nl_langinfo(CODESET)</SAMP>, here is the one Bruno Haible
suggested:

<PRE>======================== m4/codeset.m4 ================================
#serial AM1

dnl From Bruno Haible.

AC_DEFUN([AM_LANGINFO_CODESET],
[
AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
[AC_TRY_LINK([#include <langinfo.h>],
[char* cs = nl_langinfo(CODESET);],
am_cv_langinfo_codeset=yes,
am_cv_langinfo_codeset=no)
])
if test $am_cv_langinfo_codeset = yes; then
AC_DEFINE(HAVE_LANGINFO_CODESET, 1,
[Define if you have <langinfo.h> and nl_langinfo(CODESET).])
fi
])
=======================================================================
</PRE>

<P>[You could also try to query the locale environment variables
yourself without using <SAMP>setlocale()</SAMP>. In the sequence
<SAMP>LC_ALL</SAMP>, <SAMP>LC_CTYPE</SAMP>, <SAMP>LANG</SAMP>, look
for the first of these environment variables that has a value. Make
the UTF-8 mode the default (still overridable by command line
switches) when this value contains the substring <SAMP>UTF-8</SAMP>,
as this indicates reasonably reliably that the C library has been
asked to use a UTF-8 locale. An example code fragment that does this
is

<PRE> char *s;
int utf8_mode = 0;

if (((s = getenv("LC_ALL")) && *s) ||
((s = getenv("LC_CTYPE")) && *s) ||
((s = getenv("LANG")) && *s)) {
if (strstr(s, "UTF-8"))
utf8_mode = 1;
}

</PRE>

<P>This relies of course on all UTF-8 locales having the name of the
encoding in their name, which is not always the case, therefore the
<SAMP>nl_langinfo()</SAMP> query is clearly the better method. If you
are concerned about that calling <SAMP>nl_langinfo()</SAMP> might not
be portable enough (e.g., FreeBSD still doesn't have it), then use <A
HREF="http://www.gnu.org/software/libiconv/">libcharset</A>, which is
a portable library for determining the current locale's character
encoding. That's also what several of the GNU packages use. There is
also a portable public domain <A
HREF="ucs/langinfo.c"><SAMP>nl_langinfo(CODESET)</SAMP> emulator</A>
for systems that don't have the real thing, and you can use the <A
HREF="ucs/norm_charmap.c"><SAMP>norm_charmap()</SAMP></A> function to
standardize the output of the <SAMP>nl_langinfo(CODESET)</SAMP> on
different platforms.]

<H2><A NAME="getxterm">How do I get a UTF-8 version of xterm?</A></H2>

<P>The <A HREF="http://dickey.his.com/xterm/xterm.html">xterm</A>
version that comes with <A HREF="http://www.xfree86.org/">XFree86</A>
4.0 or higher (maintained by <A HREF="http://dickey.his.com/">Thomas
Dickey</A>) includes already UTF-8 support. To activate it, start
xterm in a UTF-8 locale and use a font with <SAMP>iso10646-1</SAMP>
encoding, for instance with

<PRE> LC_CTYPE=en_GB.UTF-8 xterm \
-fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
</PRE>

<P>and then cat some example file, such as <A
HREF="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/">UTF-8-demo.txt</A>
in the newly started xterm and enjoy what you see.

<P>If you are not using XFree86 4.0 or newer, then you can
alternatively download the <A
HREF="ftp://dickey.his.com/xterm/xterm.tar.gz">latest xterm
development version</A> separately and compile it yourself with
"<SAMP>./configure --enable-wide-chars ; make</SAMP>" or alternatively
with "<SAMP>xmkmf; make Makefiles; make; make install; make
install.man</SAMP>".

<P>If you do not have UTF-8 locale support available, use command line
option <SAMP>-u8</SAMP> when you invoke xterm to switch input and
output to UTF-8.

<H2><A NAME="xterm">How much of Unicode does xterm support?</A></H2>

<P>Xterm in XFree86 4.0.1 only supported Level 1 (no combining
characters) of ISO 10646-1 with a fixed character width and
left-to-right writing direction. In other words, the terminal
semantics were basically the same as for ISO 8859-1, except that it
can now decode UTF-8 and can access 16-bit characters.

<P>With XFree86 4.0.3, two important functions were added:

<UL>

<LI>automatic switching to a double-width font for CJK ideographs
<LI>simple overstriking combining characters

</UL>

If the selected normal font is <VAR>X</VAR>×<VAR>Y</VAR> pixels
large, then xterm will now attempt to load in addition a
<VAR>2X</VAR>×<VAR>Y</VAR> pixels large font (same XLFD, except
for a doubled value of the <SAMP>AVERAGE_WIDTH</SAMP> property). It
will use this font to represent all Unicode characters that have been
assigned the <EM>East Asian Wide (W)</EM> or <EM>East Asian FullWidth
(F)</EM> property in <A HREF=
"http://www.unicode.org/unicode/reports/tr11/">Unicode Technical
Report #11</A>.

<P>The following fonts coming with XFree86 4.x are suitable for
display of Japanese and Korean Unicode text with terminal emulators
and editors:

<PRE>
6x13 -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
6x13B -Misc-Fixed-Bold-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
6x13O -Misc-Fixed-Medium-O-SemiCondensed--13-120-75-75-C-60-ISO10646-1
12x13ja -Misc-Fixed-Medium-R-Normal-ja-13-120-75-75-C-120-ISO10646-1

9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
9x18B -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1
18x18ja -Misc-Fixed-Medium-R-Normal-ja-18-120-100-100-C-180-ISO10646-1
18x18ko -Misc-Fixed-Medium-R-Normal-ko-18-120-100-100-C-180-ISO10646-1
</PRE>

<P>Some simple support for nonspacing or enclosing combining
characters (i.e., those with <A HREF=
"ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html#General Category"
>general category code</A> Mn or Me in the <A HREF=
"ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt">Unicode
database</A>) is now also available, which is implemented by just
overstriking (logical OR-ing) a base-character glyph with up to two
combining-character glyphs. This produces acceptable results for
accents below the base line and accents on top of small characters. It
also works well for example for Thai fonts that were specifically
designed for use with overstriking. However the results might not be
fully satisfactory for combining accents on top of tall characters in
some fonts, especially with the fonts of the "fixed" family, therefore
precomposed characters will continue to be preferable where available.

<P>The following fonts coming with XFree86 4.x are suitable for
display of Latin etc. combining characters (extra head-space), other
fonts will only look nice with combining accents on small x-high
characters:

<PRE>
6x12 -Misc-Fixed-Medium-R-Semicondensed--12-110-75-75-C-60-ISO10646-1
9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
9x18B -Misc-Fixed-Bold-R-Normal--18-120-100-100-C-90-ISO10646-1
</PRE>

<P>The following fonts coming with XFree86 4.x are suitable for
display of Thai combining characters:

<PRE>
6x13 -Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1
9x15 -Misc-Fixed-Medium-R-Normal--15-140-75-75-C-90-ISO10646-1
9x15B -Misc-Fixed-Bold-R-Normal--15-140-75-75-C-90-ISO10646-1
10x20 -Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO10646-1
9x18 -Misc-Fixed-Medium-R-Normal--18-120-100-100-C-90-ISO10646-1
</PRE>

<P><B>A note for programmers of text mode applications:</B>

<P>With support for CJK ideographs and combining characters, the
output of xterm behaves a little bit more like with a proportional
font, because a Latin/Greek/Cyrillic/etc. character requires one
column position, a CJK ideograph two, and a combining character zero.

<P>The Open Group's <A
HREF="http://www.UNIX-systems.org/online.html">Single UNIX
Specification</A> specifies the two C functions <A
HREF="ucs/wcwidth.html">wcwidth()</A> and <A
HREF="ucs/wcswidth.html">wcswidth()</A> that allow an application to
test how many column positions a character will occupy:

<PRE> #include <wchar.h>
int wcwidth(wchar_t wc);
int wcswidth(const wchar_t *pwcs, size_t n);
</PRE>

<P><A HREF="ucs/wcwidth.c">Markus Kuhn's free wcwidth()
implementation</A> can be used by applications on platforms where the C
library does not yet provide a suitable function.

<P>Xterm will for the foreseeable future probably not support the
following functionality, which you might expect from a more
sophisticated full Unicode rendering engine:

<UL>

<LI>bidirectional output of Hebrew and Arabic characters
<LI>substitution of
<A HREF="http://www.unicode.org/unicode/uni2book/ch08.pdf">Arabic</A>
presentation forms
<LI>substitution of
<A HREF="http://www.unicode.org/unicode/uni2book/ch09.pdf">Indic</A>/Syriac
ligatures
<LI>Hangul Jamo
<LI>arbitrary stacks of combining characters

</UL>

<P>Hebrew and Arabic users will therefore have to use application
programs that reverse and left-pad Hebrew and Arabic strings before
sending them to the terminal. In other words, the bidirectional
processing has to be done by the application and not by xterm. The
situation for Hebrew and Arabic improves over ISO 8859 at least in the
form of the availability of precomposed glyphs and presentation forms.
It is far from clear at the moment, whether bidirectional support
should really go into xterm and how precisely this should work. Both
<A HREF= "http://www.ecma.ch/ecma1/STAND/ECMA-048.HTM">ISO 6429 =
ECMA-48</A> and the <A
HREF="http://www.unicode.org/unicode/reports/tr9/">Unicode bidi
algorithm</A> provide alternative starting points. See also <A
HREF="http://www.ecma.ch/ecma1/TECHREP/E-TR-053.HTM">ECMA Technical
Report TR/53</A>.

<P>If you plan to support bidirectional text output in your
application, have a look at either Dov Grobgeld's <A HREF=
"http://fribidi.sourceforge.net/">FriBidi</A> or Mark Leisher's <A
HREF= "http://crl.nmsu.edu/~mleisher/ucdata.html">Pretty Good Bidi
Algorithm</A>, two free implementations of the Unicode bidi algorithm.

<P>Xterm currently does not support the Arabic, Syriac, Hangul Jamo,
or Indic text formatting algorithms, although Robert Brady has
published some <A
HREF="http://www.zepler.org/~rwb197/xterm/">experimental patches</A>
towards bidi support. It is still unclear whether it is feasible or
preferable to do this in a VT100 emulator at all. Applications can
apply the Arabic and Hangul formatting algorithms themselves easily,
because xterm allows them to output all the necessary presentation
forms. For Indic scripts, the X font mechanism does at the moment not
even support the encoding of the necessary ligature variants, so there
is little xterm could offer anyway. Applications requiring Indic or
Syriac output should better use a proper Unicode X11 rendering library
such as <A HREF= "http://www.pango.org/">Pango</A> instead of a VT100
emulator like xterm.

<H2><A NAME="fonts">Where do I find ISO 10646-1 X11 fonts?</A></H2>

Quite a number of Unicode fonts have become available for X11 over
the past few months, and the list is growing quickly:

<UL>

<LI>Markus Kuhn together with a number of other volunteers has
extended the old <SAMP>-misc-fixed-*-iso8859-1</SAMP> fonts that come
with X11 towards a repertoire that covers all European characters
(Latin, Greek, Cyrillic, intl. phonetic alphabet, mathematical and
technical symbols, in some fonts even Armenian, Georgian, Katakana,
Thai, and more). For more information see the <A
HREF="ucs-fonts.html">Unicode fonts and tools for X11</A> page. These
fonts are now also distributed with <A
HREF="http://www.xfree86.org/">XFree86</A> 4.0.1 or higher.

<LI>Markus has also prepared <A HREF=
"http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts-75dpi100dpi.tar.gz">ISO
10646-1 versions of all the Adobe and B&H BDF fonts in the X11R6.4
distribution</A>. These fonts contained already the full Postscript
font repertoire (around 30 additional characters, mostly those used
also by CP1252 MS-Windows, e.g. smart quotes, dashes, etc.), which
were however not available under the ISO 8859-1 encoding. They are now
all accessible in the ISO 10646-1 version, along with many additional
precomposed characters covering ISO 8859-1,2,3,4,9,10,13,14,15. These
fonts are now also distributed with <A
HREF="http://www.xfree86.org/">XFree86</A> 4.1 or higher.

<LI>XFree86 4.0 comes with an <A
HREF="http://www.dcs.ed.ac.uk/home/jec/programs/xfsft/">integrated
TrueType font engine</A> that can make available any Apple/Microsoft
font to your X application in the ISO 10646-1 encoding.

<LI>Some future XFree86 release might also remove most old BDF fonts
from the distribution and replace them with ISO 10646-1 encoded
versions. The X server will be extended with an automatic encoding
converter that creates other font encodings such as ISO 8859-* from
the ISO 10646-1 font file on-the-fly when such a font is requested by
old 8-bit software. Modern software should preferably use the ISO
10646-1 font encoding directly.

<LI><A HREF="ftp://crl.nmsu.edu/CLR/multiling/unicode/fonts/">ClearlyU
(cu12)</A> is a 12 point, 100 dpi proportional ISO 10646-1 BDF font
for X11 with over 3700 characters by <A
HREF="mailto:[email protected]">Mark Leisher</A> (<A
HREF="http://crl.nmsu.edu/~mleisher/cu/cu-examples.html">example
images</A>).

<LI>The <A HREF="http://openlab.ring.gr.jp/efont/">Electronic Font
Open Laboratory</A> in Japan is also working on a family of Unicode
bitmap fonts.

<LI>Dmitry Yu. Bolkhovityanov created a <A
HREF="http://www.inp.nsk.su/~bolkhov/files/fonts/univga/index.html">Unicode
VGA font</A> in BDF for use by text mode IBM PC emulators etc.

<LI>Roman Czyborra's <A HREF="http://czyborra.com/unifont/">GNU Unicode
font</A> project works on collecting a complete and free
8×16/16×16 pixel Unicode font. It currently covers over
34000 characters.

<LI><A HREF=
"ftp://ftp.x.org/contrib/fonts/etl-unicode.tar.gz">etl-unicode</A> is
an ISO 10646-1 BDF font prepared by <A
HREF="mailto:[email protected]">Primoz Peterlin</A>.

<LI>George Williams has created a <A
HREF="http://bibliofile.mc.duke.edu/gww/fonts/Unicode.html">Type1
Unicode font family</A>, which is also available in BDF. He also
developed the <A
HREF="http://bibliofile.mc.duke.edu/gww/FreeWare/PfaEdit/">PfaEdit</A>
Postscript and bitmap font editor.

<LI><A
HREF="http://www.evertype.com/emono/">EversonMono</A> is a
shareware monospaced font with over 3000 European glyphs, also
available from the <A HREF="ftp://dkuug.dk/CEN/TC304/EversonMono10646"
>DKUUG server</A>.

<LI><A HREF="http://members.nbci.com/langkjer/">Birger Langkjer</A>
has prepared a <A
HREF="http://members.nbci.com/langkjer/unicode.bdf.gz">Unicode VGA
Console Font</A> for Linux.

<LI><A HREF="http://www.microsoft.com/typography/fontpack/default.htm"
>Microsoft's fontpack</A> also contains a number of free TrueType
Unicode fonts.

<LI>Christoph Singer had a list of freely available
Unicode TrueType fonts.

<LI>Alan Wood has a list of <A
HREF="http://www.hclrss.demon.co.uk/unicode/fontsbyrange.html">Microsoft
fonts that support various Unicode ranges</A>.

</UL>

<P>Unicode X11 font names end with <SAMP>-ISO10646-1</SAMP>. This is
now the officially <A HREF="ftp://ftp.x.org/pub/DOCS/registry"
>registered</A> value for the <A HREF=
"ftp://sunsite.doc.ic.ac.uk/packages/X11/pub/R6.4/xc/doc/hardcopy/XLFD/xlfd.PS.gz"
>X Logical Font Descriptor (XLFD)</A> fields
<SAMP>CHARSET_REGISTRY</SAMP> and <SAMP>CHARSET_ENCODING</SAMP> for
all Unicode and ISO 10646-1 16-bit fonts. The
<SAMP>*-ISO10646-1</SAMP> fonts contain some unspecified subset of the
entire Unicode character set, and users have to make sure that
whatever font they select covers the subset of characters needed by
them.

<P>The <SAMP>*-ISO10646-1</SAMP> fonts usually also specify a
<SAMP>DEFAULT_CHAR</SAMP> value that points to a special non-Unicode
glyph for representing any character that is not available in the font
(usually a dashed box, the size of an H, located at 0x00). This
ensures that users at least see clearly that there is an unsupported
character. The smaller fixed-width fonts such as 6x13 etc. for xterm
will never be able to cover all of Unicode, because many scripts such
as Kanji can only be represented in considerably larger pixel sizes
than those widely used by European users. Typical Unicode fonts for
European usage will contain only subsets of between 1000 and 3000
characters, such as the <A
HREF="http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf">CEN MES-3
repertoire</A>.

<P>You might notice that in the *-ISO10646-1 fonts the <A
HREF="ucs/quotes.html">shapes of the ASCII quotation marks</A> has
slightly changed to bring them in line with the standards and practice
on other platforms.

<H2><A NAME="term">What are the issues related to UTF-8 terminal emulators?</A></H2>

<P><A HREF="http://vt100.net/">VT100</A> terminal emulators accept ISO
2022 (=<A
HREF="http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM">ECMA-35</A>) ESC
sequences in order to switch between different character sets.

<P>UTF-8 is in the sense of ISO 2022 an "other coding system" (see
section 15.4 of ECMA 35). UTF-8 is outside the ISO 2022
SS2/SS3/G0/G1/G2/G3 world, so if you switch from ISO 2022 to UTF-8,
all SS2/SS3/G0/G1/G2/G3 state becomes meaningless until you leave
UTF-8 and switch back to ISO 2022. UTF-8 is a stateless encoding, i.e.
a self-terminating short byte sequence determines completely which
character is meant, independent of any switching state. G0 and G1 in
ISO 10646-1 are those of ISO 8859-1, and G2/G3 do not exist in ISO
10646, because every character has a fixed position and no switching
takes place. With UTF-8, it is not possible that your terminal remains
switched to strange graphics-character mode after you accidentally
dumped a binary file to it. This makes a terminal in UTF-8 mode much
more robust than with ISO 2022 and it is therefore useful to have a
way of locking a terminal into UTF-8 mode such that it can't
accidentally go back to the ISO 2022 world.

<P>The ISO 2022 standard specifies a range of ESC % sequences for
leaving the ISO 2022 world (designation of other coding system, DOCS),
and a number of such sequences have been registered for <A
HREF="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/ISO-10646-UTF-8.html">UTF-8</A>
in section 2.8 of the <A
HREF="http://www.itscj.ipsj.or.jp/ISO-IR/">ISO 2375 International
Register of Coded Character Sets</A>:

<UL>

<LI><SAMP>ESC %G</SAMP> activates UTF-8 with an unspecified
implementation level from ISO 2022 in a way that allows to go back
to ISO 2022 again.

<LI><SAMP>ESC %@</SAMP> goes back from UTF-8 to ISO 2022 in case
UTF-8 had been entered via <SAMP>ESC %G</SAMP>.

<LI><SAMP>ESC %/G</SAMP> switches to UTF-8 Level 1 with no return.
<LI><SAMP>ESC %/H</SAMP> switches to UTF-8 Level 2 with no return.
<LI><SAMP>ESC %/I</SAMP> switches to UTF-8 Level 3 with no return.

</UL>

<P>While a terminal emulator is in UTF-8 mode, any ISO 2022 escape
sequences such as for switching G2/G3 etc. are ignored. The only ISO
2022 sequence on which a terminal emulator might act in UTF-8 mode is
<SAMP>ESC %@</SAMP> for returning from UTF-8 back to the ISO 2022
scheme.

<P>UTF-8 still allows you to use C1 control characters such as CSI,
even though UTF-8 also uses bytes in the range 0x80-0x9F. It is
important to understand that a terminal emulator in UTF-8 mode must
apply the UTF-8 decoder to the incoming byte stream
<STRONG>before</STRONG> interpreting any control characters. C1
characters are UTF-8 decoded just like any other character above
U+007F.

<P>Many text-mode applications available today expect to speak to the
terminal using a legacy encoding or to use ISO 2022 sequences for
switching terminal fonts. In order to use such applications within a
UTF-8 terminal emulator, it is possible to use a conversion layer that
will translate between ISO 2022 and UTF-8 on the fly. One such utility
is Juliusz Chroboczek's <A
HREF="http://www.pps.jussieu.fr/~jch/software/luit/">luit</A>. If all
you need is ISO 8859 support in a UTF-8 terminal, you can also use
Michael Schroeder's <A
HREF="ftp://ftp.uni-erlangen.de/pub/utilities/screen/">screen</A>
(version 3.9.9 or newer). As implementation of ISO 2022 is a complex
and error-prone task, better avoid implementing ISO 2022 yourself,
implement only UTF-8 and point users who need ISO 2022 at luit (or
screen).

<H2><A NAME="apps">What UTF-8 enabled applications are already available?</A></H2>

<UL>

<LI><A HREF= "http://dickey.his.com/xterm/xterm.html">xterm</A> as
shipped with XFree86 4.0 or higher (compile with "<SAMP>./configure
--enable-wide-chars ; make</SAMP>" and use a UTF-8 locale or command
line option <SAMP>-u8</SAMP> when you invoke xterm).

<LI><FONT COLOR="#ff0000">NEW:</FONT> <A
HREF="http://www.vim.org/">Vim</A> (a popular clone of the classic vi
editor) supports UTF-8 with wide characters and up to two combining
characters starting with version 6.0 (2001-09-26). Read Bram
Moolenaar's <A HREF=
"http://mail.nl.linux.org/linux-utf8/2000-07/msg00036.html"
>announcement</A> for details.

<LI><A HREF="http://www.yudit.org/">Yudit 2.0</A> is Gaspar Sinai's
free X11 Unicode editor.

<LI><A HREF="http://www.inf.fu-berlin.de/~wolff/mined.html">Mined
2000</A> by <A HREF="http://www.inf.fu-berlin.de/~wolff/">Thomas
Wolff</A> is a UTF-8 capable text editor.

<LI><A HREF="http://cooledit.org/">Cooledit</A> offers UTF-8 and UCS
support starting with version 3.15.0.

<LI><A HREF="http://www-stud.enst.fr/~bellard/qemacs/">QEmacs</A> is a
small editor for use on UTF-8 terminals.

<LI><A HREF="http://www.greenwoodsoftware.com/less/">less</A> version
346 or later has UTF-8 support. (Unfortunately, version 358 still has
a <A HREF="http://mail.nl.linux.org/linux-utf8/2001-05/msg00022.html">bug</A>
related to the handling of UTF-8 characters and backspace
underlining/boldification as used by nroff/man, for which a <A
HREF="http://mail.nl.linux.org/linux-utf8/2001-05/msg00023.html">patch</A>
is available.)

<LI><A HREF="http://www.columbia.edu/kermit/ckermit.html">C-Kermit
7.0</A> supports UTF-8 as the transfer, terminal, and file character
set.

<LI><A HREF="http://www.perl.org/">Perl</A> has some <A
HREF="http://rf.net/~james/perli18n.html">core UTF-8 support</A>
starting with version 5.6 when requested with "use utf8;", which means
that strings are stored as UTF-8 (and tagged as UTF-8) and length()
returns characters instead of bytes. A lot of work on enhancing UTF-8
support is still going on at the moment (see also the <A
HREF="mailto:[email protected]">[email protected]</A>
mailing list). Read <SAMP>perldoc perlunicode</SAMP> and <SAMP>perldoc
utf8</SAMP> to see the capabilities and limitations. Perl 5.8 with
revised and much improved Unicode support will be released soon.

<LI><A HREF="http://www.python.org/1.6/">Python 1.6</A> now has <A
HREF="http://www.lemburg.com/files/python/unicode-proposal.txt">Unicode
support</A> integrated.

<LI><A HREF="http://dev.scriptics.com/">Tcl/Tk</A> started using <A
HREF="http://dev.scriptics.com/doc/howto/i18n.html">Unicode as its
base character set</A> with version 8.1. ISO10646-1 fonts are
supported in Tk 8.3.3 or newer.

<LI><A HREF="http://www.beedub.com/exmh/">Exmh</A> is a GUI frontend
for the MH mail system and supports Unicode starting with version
2.1.1 if Tcl/Tk 8.3.3 or newer is used. To be able to display UTF-8
email, make sure you have the *-iso10646-1 fonts installed and add to
Xdefaults the line "exmh.mimeUCharsets: utf-8".

<LI><A HREF="http://clisp.cons.org">CLISP</A> can work with all
multi-byte encodings (including UTF-8) and with the functions
<SAMP>char-width</SAMP> and <SAMP>string-width</SAMP> there is an API
comparable to <SAMP>wcwidth()</SAMP> and <SAMP>wcswidth()</SAMP>
available.

<LI><A HREF="http://mlterm.sourceforge.net/">mlterm</A> is a
multi-lingual terminal emulator that supports UTF-8 among many other
encodings, combining characters, XIM.

<LI><A HREF="http://www.cs.yorku.ca/~oz/wily/">Wily</A> started out as
a Unix implementation of the Plan9 Acme editor and is a
mouse-oriented, text-based working environment for programmers.

<LI><A
HREF="http://hawkwind.utcs.utoronto.ca:8001/mlists/sam.html">Sam</A>
is the Plan9 UTF-8 editor, similar to vi and also available for Linux
and Win32. (<A HREF="http://plan9.bell-labs.com/plan9dist/">Plan9</A> was
the first operating system that <A HREF=
"ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz">switched
completely to UTF-8 as its character encoding</A>.)

<LI><A HREF="http://www.cs.usyd.edu.au/~matty/9term/">9term</A> by
<A HREF="http://www.cs.usyd.edu.au/~matty/">Matty Farrow</A> is a
Unix port of the Unicode/UTF-8 terminal emulator of the Plan9
operating system.

<LI><A HREF="ftp://ftp.dcs.ed.ac.uk/pub/jec/programs/">ucm-0.1</A> is
<A HREF="http://www.pps.jussieu.fr/~jch/">Juliusz Chroboczek</A>'s
Unicode Character Map, a little tool that allows to select and paste
any Unicode character into your application.

<LI><A
HREF="http://www.geocities.com/CapeCanaveral/Lab/5735/1/txtbdf2ps.html">txtbdf2ps</A>
by Serge Winitzki is a Perl script to print UTF-8 plaintext to
PostScript using BDF pixel fonts.

<LI><A HREF="http://st-www.cs.uiuc.edu/users/chai/figlet.html">FIGlet
2.2</A> or newer is a tool to output banner text in large letters
using monospaced characters as block graphics elements.

<LI><A HREF="http://www.rano.org/">Edmund Grimley Evans</A> extended
the <A HREF="http://www.msu.edu/user/pfaffben/">BOGL</A> Linux
framebuffer graphics library with UCS font support and built a simple
UTF-8 console terminal emulator called <SAMP>bterm</SAMP> with it.

<LI>The <A HREF="http://www.mutt.org/">Mutt</A> email client has
worked since version 1.3.24 in UTF-8 locales, if it is used with
sufficiently recent versions of ncurses and slang.

<LI><A HREF="http://www.abisource.com/">Abiword</A>.

<LI><A HREF="http://www.postgresql.org/">PostgreSQL</A> 7.1 has full
support for UTF-8, both as the frontend encoding, and as the backend
storage encoding. Data conversion between frontend and backend
encodings is performed automatically.

</UL>

<H2><A NAME="patches">What patches to improve UTF-8 support are available?</A></H2>

<UL>

<LI>A collection of UTF-8 patches for bash and other tools as well as
a UTF-8 support status list is in Bruno Haible's <A
HREF="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-4.html">Unicode-HOWTO</A>.

<LI>Bruno Haible has also prepared <A
HREF="ftp://ftp.ilog.fr/pub/Users/haible/utf8/">various patches</A>
for stty, the Linux kernel tty, etc.

<LI>The Advanced Utility Development subgroup of the Li18nux project
have prepared various <A
HREF="http://www.li18nux.org/subgroups/utildev/dli18npatch2.html"
>internationalization patches</A> for tools such as bash, cut, fold,
glibc, join, sed, uniq, xterm, etc. that might improve UTF-8 support.

<LI>Miyashita Hisashi has written <A
HREF="ftp://ftp.m17n.org/pub/mule/Mule-UCS/">MULE-UCS</A>, a character
set translation package for Emacs 20.6 or higher, which can translate
between the Mule encoding (used internally by Emacs) and ISO 10646.

<LI>Otfried Cheong provides on his <A
HREF="http://www.cs.uu.nl/~otfried/Mule/">Unicode encoding for GNU
Emacs</A> page an extension to MULE-UCS that covers the entire BMP by
adding <SAMP>utf-8</SAMP> as another Emacs character set. His page
also contains a short installation guide for MULE-UCS.

<LI><A HREF="http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/">UTF-8
xemacs patch</A> by Tomohiko Morioka.

<LI>The <A
HREF="http://www2u.biglobe.ne.jp/~hsaka/w3m/">multilingualization
patch (w3m-m17n)</A> for the text-mode web browser <A
HREF="http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/eng/">w3m</A>
allows you to view documents in all the common encodings on a UTF-8
terminal like xterm (also switch option "Use alternate expression with
ASCII for entity" to OFF after pression "o"). Another <A
HREF="http://pub.ks-and-ks.ne.jp/prog/w3mmee/">multilingual version
(w3mmee)</A> is available as well (haven't tried that yet).

<LI><A HREF="http://noa.tm/utf-8/">Daniel Resare</A> has a web page on
what he had to do to make RedHat Linux 7.1 better suited for using
UTF-8.

</UL>

<H2><A NAME="libs">Are there free libraries for dealing with Unicode available?</A></H2>

<UL>

<LI>Ulrich Drepper's <A HREF="http://sources.redhat.com/glibc/">GNU C
library glibc 2.2.x</A> contains full multi-byte locale support for
UTF-8, a Unicode sorting order algorithm, and it can recode into many
other encodings. All recent Linux distributions come already with
glibc 2.2.2, so you definitely should upgrade if you are still using
an earlier Linux C library.

<LI>The <A HREF="http://oss.software.ibm.com/icu/">International
Components for Unicode (ICU)</A> (formerly <A
HREF="http://www.alphaworks.ibm.com/tech/icu/">IBM Classes for
Unicode</A>) have become the probably most powerful cross-platform
standard library for more advanced Unicode character processing
functions.

<LI>X.Net's <A HREF="http://www.xnetinc.com/xiua/">xIUA</A> is a
package designed to retrofit existing code for ICU support by
providing locale management so that users do not have to modify
internal calling interfaces to pass locale parameters. It uses more
familiar APIs, for example to collate you use xiua_strcoll and is
thread safe.

<LI><A HREF="http://crl.nmsu.edu/~mleisher/">Mark Leisher</A>'s UCData
Unicode character property and bidi library as well as his
<SAMP>wchar_t</SAMP> support test code.

<LI>Bruno Haible's <A
HREF="http://www.gnu.org/software/libiconv/">libiconv</A>
character-set conversion library provides an <A HREF=
"http://www.opengroup.org/onlinepubs/007908799/xsh/iconv.h.html"
>iconv()</A> implementation, for use on systems which don't have one,
or whose implementation cannot convert from/to Unicode.

<BR>It also contains the libcharset character-encoding query library
allows applications to determine in a highly portable way the
character encoding of the current locale, avoiding the portability
concerns of using <A HREF=
"http://www.opengroup.org/onlinepubs/007908799/xsh/langinfo.h.html"
>nl_langinfo(CODESET)</A> directly.

<LI><A HREF="ftp://ftp.ilog.fr/pub/Users/haible/utf8/">Bruno Haible's
libutf8</A> provides various functions for handling UTF-8 strings,
especially for platforms that do not yet offer proper UTF-8 locales.

<LI><A HREF="mailto:[email protected]">Tom Tromey</A>'s <A HREF=
"http://people.redhat.com/otaylor/pango-mirror/download/libunicode-0.4.tar.gz"
>libunicode</A> library is part of the Gnome Desktop project, but can
be built independently of Gnome. It contains various character class
and conversion functions. (<A
HREF="http://cvs.gnome.org/lxr/source/libunicode/">CVS</A>)

<LI><A HREF="http://fribidi.sourceforge.net/">FriBidi</A> is Dov
Grobgeld's free implementation of the Unicode bidi algorithm.

<LI><A HREF="http://czyborra.com/arabjoin/">Arabjoin</A> is Roman
Czyborra's little Perl tool that takes Arabic UTF-8 text (encoded in
the U+06<VAR>xx</VAR> Arabic block in logical order) as input,
performs Arabic glyph joining, and outputs a UTF-8 octet stream that
is arranged in visual order. This gives readable results when
formatted with a simple Unicode renderer like xterm or yudit that does
not handle Arabic differently but simply outputs all glyphs in
left-to-right order.

<LI><A HREF="http://www.w3.org/International/charlint/">Charlint</A>
is a character normalization tool for the <A
HREF="http://www.w3.org/TR/charmod/">W3C character model</A>.

<LI><A HREF="ucs/wcwidth.c">Markus Kuhn's free wcwidth()
implementation</A> can be used by applications on platforms where the
C library does not yet provide an equivalent function to find out, how
many column positions a character or string will occupy on a UTF-8
terminal emulator screen.

<LI>Markus Kuhn's <A HREF="download/transtab.tar.gz">transtab</A> is a
transliteration table for applications that have to make a best-effort
conversion from Unicode to ASCII or some 8-bit character set. It
contains a comprehensive list of substitution strings for Unicode
characters, comparable to the fallback notations that people use
commonly in email and on typewriters to represent unavailable
characters. The table comes in <A
HREF="volatile/ISO-14652.pdf">ISO/IEC TR 14652</A> format, to allow
simple inclusion into POSIX locale definition files.

</UL>

<H2><A NAME="widgets">What is the status of Unicode support for various X widget libraries?</A></H2>

<UL>

<LI><A HREF="http://www.pango.org/">Pango - Unicode and Complex Text
Processing</A> is a project to add full-featured Unicode support to <A
HREF="http://www.gtk.org/">GTK+</A>.

<LI><A HREF="http://www.trolltech.com/company/announce/qt-200.html">Qt
2.0</A> and newer supports the use of *-ISO10646-1 fonts.

<LI>A <A HREF="http://oksid.ch/fltk-utf/">UTF-8 extension</A> for the
<A HREF="http://www.fltk.org/">Fast Light Tool Kit</A> was prepared by
Jean-Marc Lienher, based on his Xutf8 Unicode display library.

</UL>

<H2><A NAME="wip">What packages with UTF-8 support are currently under development?</A></H2>

<UL>

<LI>Native Unicode support is planned for Emacs 22. If you are
interested in contributing/testing, please ask <A
HREF="mailto:[email protected]">Eli Zaretskii</A> to put you onto the
<samp>emacs-unicode</samp><samp>@gnu.org</samp> mailing list.

<LI>The <A HREF="http://linuxconsole.sourceforge.net/">Linux Console
Project</A> works on a complete revision of the VT100 emulator built
into the Linux kernel, which will improve the simplistic UTF-8 support
already there.

</UL>

<H2><A NAME="solaris">How does UTF-8 support work under Solaris?</A></H2>

<P>Starting with Solaris 2.8, UTF-8 is at least partially supported.
To use it, just set one of the UTF-8 locales, for instance by typing

<PRE> setenv LANG en_US.UTF-8
</PRE>

in a C shell.

<P>Now the <SAMP>dtterm</SAMP> terminal emulator can be used to input
and output UTF-8 text and the <SAMP>mp</SAMP> print filter will print
UTF-8 files on PostScript printers. The <SAMP>en_US.UTF-8</SAMP>
locale is at the moment supported by Motif and CDE desktop
applications and libraries, but not by OpenWindows, XView, and
OPENLOOK DeskSet applications and libraries.

<P>For more information, read Sun's <A HREF=
"http://docs.sun.com:80/ab2/coll.45.13/I18NDG/@Ab2PageView/10821?Ab2Lang=C&Ab2Enc=iso-8859-1"
>Overview of en_US.UTF-8 Locale Support</A> web page.

<H2><A NAME="ps">How are Postscript glyph names related to UCS codes?</A></H2>

<P>See Adobe's <A
HREF="http://partners.adobe.com/asn/developer/type/unicodegn.html">Unicode
and Glyph Names</A> guide.

<H2><A NAME="subsets">Are there any well-defined UCS subsets?</A></H2>

<P>With over 40000 characters, a full and complete Unicode
implementation is an enormous project. However, it is often sufficient
(especially for the European market) to implement only a few hundred
or thousand characters as before and still enjoy the simplicity of
reaching all required characters in just one single simple encoding
via Unicode. A number of different UCS subsets have already been
established:

<UL>

<LI>The <A HREF=
"http://partners.adobe.com/asn/developer/opentype/appendices/wgl4.html">Windows
Glyph List 4.0 (WGL4)</A> is a set of 650 characters that covers all
the 8-bit MS-DOS, Windows, Mac, and ISO code pages that Microsoft had
used before. All Windows fonts now cover at least the WGL4 repertoire.
WGL4 is a superset of CEN MES-1. (<A HREF="ucs/wgl4.txt">WGL4 test
file</A>).

<LI>Three <A
HREF="http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf">European
UCS subsets MES-1, MES-2, and MES-3</A> have been defined by the
European standards committee CEN/TC304 in CWA 13873:

<UL>

<LI>MES-1 is a very small Latin subset with only 335 characters. It
contains exactly all characters found in ISO 6937 plus the EURO SIGN.
This means MES-1 contains all characters of ISO 8859 parts
1,2,3,4,9,10,15. [Note: If your aim is to provide only the cheapest
and simplest reasonable Central European UCS subset, I would implement
MES-1 plus the following important 14 additional characters found in
Windows code page 1252 but not in MES-1: U+0192, U+02C6, U+02DC,
U+2013, U+2014, U+201A, U+201E, U+2020, U+2021, U+2022, U+2026,
U+2030, U+2039, U+203A.]

<LI>MES-2 is a Latin/Greek/Cyrillic/Armenian/Georgian subset with 1052
characters. It covers every language and every 8-bit code page used in
Europe (not just the EU!) and European language countries. It also
adds a small collection of mathematical symbols for use in technical
documentation. MES-2 is a superset of MES-1. If you are developing
only for a European or Western market, MES-2 is the recommended
repertoire. [Note: For bizarre committee-politics reasons, the
following eight WGL4 characters are missing from MES-2: U+2113,
U+212E, U+2215, U+25A1, U+25AA, U+25AB, U+25CF, U+25E6. If you
implement MES-2, you should definitely also add those and then you can
claim WGL4 conformance in addition.]

<LI>MES-3 is a very comprehensive UCS subset with 2819 characters. It
simply includes every UCS collection that seemed of potential use to
European users. This is for the more ambitious implementors. MES-3 is
a superset of MES-2 and WGL4.

</UL>

<LI>JIS X 0221-1995 specifies 7 non-overlapping UCS subsets for
Japanese users:

<UL>

<LI>Basic Japanese (6884 characters): JIS X 0208-1997, JIS X 0201-1997

<LI>Japanese Non-ideographic Supplement (1913 characters): JIS X
0212-1990 non-kanji, plus various other non-kanji

<LI>Japanese Ideographic Supplement 1 (918 characters): some JIS X
0212-1990 kanji

<LI>Japanese Ideographic Supplement 2 (4883 characters): remaining JIS
X 0212-1990 kanji

<LI>Japanese Ideographic Supplement 3 (8745 characters): remaining
Chinese characters

<LI>Full-width Alphanumeric (94 characters): for compatibility

<LI>Half-width Katakana (63 characters): for compatibility

</UL>

<LI>The ISO 10646 standard splits up its repertoire into a number of
<A HREF= "http://www.evertype.com/standards/iso10646/ucs-collections.html"
>collections</A> that can be used to define and document implemented
subsets. Unicode defines similar, but not quite identical, <A
HREF="ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt">blocks</A> of
characters, which correspond to sections in the Unicode standard.

<LI><A HREF="ftp://sunsite.doc.ic.ac.uk/packages/rfc/rfc1815.txt">RFC
1815</A> is a memo written in 1995 by someone who obviously didn't
like ISO 10646 and was unaware of JIS X 0221-1995. It discusses a UCS
subset called "ISO-10646-J-1" consisting of 14 UCS collections, some
of which are intersected with JIS X 0208. This is just what a
particular font in an old Japanese Windows NT version from 1995
happened to implement. RFC 1815 is completely obsolete and irrelevant
today and should best be ignored.

<LI>Markus Kuhn has defined in the <A
HREF="download/ucs-fonts.tar.gz">ucs-fonts.tar.gz</A> README three UCS
subsets TARGET1, TARGET2, TARGET3 that are sensible extensions of the
corresponding MES subsets and that were the basis for the completion
of this xterm font package.

</UL>

<P>Markus Kuhn's <A HREF="download/uniset.tar.gz">uniset</A> Perl script
allows convenient set arithmetic over UCS subsets for anyone who wants
to define a new one or wants to check coverage of an implementation.

<H2><A NAME="conv">What issues are there to consider when converting encodings</A></H2>

<P>The Unicode Consortium maintains a <A
HREF="http://www.unicode.org/Public/MAPPINGS/">collection of mapping
tables</A> between Unicode and various older encoding standards. It is
important to understand that these tables alone are only suitable for
converting text from the older encodings to Unicode. Conversion in the
opposite direction from Unicode to a legacy character set requires
non-injective (= many-to-one) extensions of these mapping tables.
Several Unicode characters have to be mapped to a single code point in
a legacy encoding. This is necessary, because some legacy encodings
distinguished characters that others unified. The Unicode consortium
does currently not maintain standard many-to-one tables for this
purpose, but such tables can easily be generated from available
normalization information.

<P>Here are some examples for the many-to-one mappings that have to be
handled when converting from Unicode into something else:

<P><DIV ALIGN=CENTER><TABLE BORDER=1 CELLPADDING=5>
<TR><TH>UCS characters<TH>equivalent character<TH>in target code
<TR><TD>U+00B5 MICRO SIGN<BR>U+03BC GREEK SMALL LETTER MU
<TD>0xB5<TD>ISO 8859-1
<TR><TD>U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE<BR>U+212B ANGSTROM SIGN
<TD>0xC5<TD>ISO 8859-1
<TR><TD>U+03A9 GREEK CAPITAL LETTER OMEGA<BR>U+2126 OHM SIGN
<TD>0xEA<TD>CP437
<TR><TD>U+005C REVERSE SOLIDUS<BR>U+FF3C FULLWIDTH REVERSE SOLIDUS
<TD>0x2140<TD>JIS X 0208
</TABLE></DIV>

<P>The <A
HREF="http://www.unicode.org/Public/UNIDATA/UnicodeData.html">Unicode
database</A> does contain in field 5 the Character Decomposition
Mapping that can be used to generate the above example mappings
automatically. As a rule, the output of a Unicode-to-Something
converter should not depend on whether the Unicode input has first
been converted into <A
HREF="http://www.unicode.org/unicode/reports/tr15/">Normalization Form
C</A> or not. For equivalence information on Chinese, Japanese, and
Korean Han/Kanji/Hanja characters, use the <A
HREF="http://www.unicode.org/charts/unihan.html">Unihan database</A>.

<P>The Unicode mapping tables also have to be slightly modified
sometimes to preserve information in combination encodings. For
example, the standard mappings provide round-trip compatibility for
conversion chains ASCII to Unicode to ASCII as well as for JIS X 0208
to Unicode to JIS X 0208. However, the EUC-JP encoding covers the
union of ASCII and JIS X 0208, and the UCS repertoire covered by the
ASCII and JIS X 0208 mapping tables overlaps for one character, namely
U+005C REVERSE SOLIDUS. EUC-JP converters therefore have to use a
slightly modified JIS X 0208 mapping table, such that the JIS X 0208
code 0x2140 (0xA1 0xC0 in EUC-JP) gets mapped to U+FF3C FULLWIDTH
REVERSE SOLIDUS. This way, round-trip compatibility from EUC-JP to
Unicode to EUC-JP can be guaranteed without any loss of information.
<A HREF=
"http://www.unicode.org/unicode/reports/tr11/#Recommendation">Unicode
Standard Annex #11: East Asian Width</A> provides further guidance on
this issue.

<P>In addition to just using standard normalization mappings,
developers of code converters can also offer transliteration support.
Transliteration is the conversion of a Unicode character into a
graphically and/or semantically similar character in the target code,
even if the two are distinct characters in Unicode after
normalization. Examples of transliteration:

<P><DIV ALIGN=CENTER><TABLE BORDER=1 CELLPADDING=5>
<TR><TH>UCS characters<TH>equivalent character<TH>in target code
<TR><TD>U+0022 QUOTATION MARK<BR>U+201C LEFT DOUBLE QUOTATION MARK<BR>
U+201D RIGHT DOUBLE QUOTATION MARK<BR>
U+201E DOUBLE LOW-9 QUOTATION MARK<BR>
U+201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
<TD>0x22<TD>ISO 8859-1
</TABLE></DIV>

<P>The Unicode Consortium does not provide or maintain any standard
transliteration tables. Which transliterations are appropriate or not
can in some cases depend on language, application field, and even
personal preference. Available Unicode transliteration tables include
for example those found in Bruno Haible's <A
HREF="http://www.gnu.org/software/libiconv/">libiconv</A>,
the <A HREF="http://sources.redhat.com/glibc/">glibc 2.2</A> locales,
and Markus Kuhn's <A HREF="download/transtab.tar.gz">transtab</A>
package.

<H2><A NAME="x11">Is X11 ready for Unicode?</A></H2>

<P>The <A HREF="ftp://ftp.x.org/pub/R6.6/">X11 R6.6 release</A> (2001)
is the latest version of the X Consortium's sample implementation of
the X11 Window System standards. The bulk of the <A
HREF="ftp://ftp.x.org/pub/R6.6/xc/doc/hardcopy/">current X11
standards</A> and the sample implementation pre-date widespread
interest into Unicode under Unix. There are a number of problems and
inconveniences for Unicode users in both that really should be fixed
in the next X11 release:

<UL>

<LI><P><B>UTF-8 cut and paste:</B> The <A
HREF="ftp://ftp.x.org/pub/R6.6/xc/doc/hardcopy/ICCCM/icccm.PS.gz">ICCCM</A>
standard does not specify how to transfer UCS strings in selections.
Some vendors have added UTF-8 as yet another encoding to the existing
<A HREF= "ftp://ftp.x.org/pub/R6.6/xc/doc/hardcopy/CTEXT/ctext.PS.gz"
>COMPOUND_TEXT</A> mechanism (CTEXT). This is not a good solution for
at least the following reasons:</P>

<UL>

<LI>CTEXT is a rather complicated ISO 2022 mechanism and Unicode
offers the opportunity to provide not just another add-on to CTEXT,
but to replace the entire monster with something far simpler, more
convenient, and equally powerful.

<LI>Many existing applications can communicate selections via CTEXT,
but do not support a newly added UTF-8 option. A user of CTEXT has to
decide whether to use the old ISO 2022 encodings or the new UTF-8
encoding, but both cannot be offered simultaneously. In other words,
adding UTF-8 to CTEXT seriously breaks backwards compatibility with
existing CTEXT applications.

<LI>The current CTEXT specification even explicitly forbids the
addition of UTF-8 in section 6: "ISO registered 'other coding systems'
are not used in Compound Text; extended segments are the only
mechanism for non-2022 encodings."

</UL>

<P><A HREF="http://www.dcs.ed.ac.uk/home/jec/">Juliusz Chroboczek</A>
has written an <A
HREF="http://www.pps.jussieu.fr/~jch/software/UTF8_STRING/"
>Inter-Client Exchange of Unicode Text</A> draft proposal for an
extension of the ICCCM to handle UTF-8 selections with a new
UTF8_STRING atom that can be used as a property type and selection
target. This clean approach fixes all of the above problems.
UTF8_STRING is just as state-less and easy to use as the existing
STRING atom (which is reserved exclusively for ISO 8859-1 strings and
therefore not usable for UTF-8), and adding a new selection target
allows applications to offer selections in both the old CTEXT and the
new UTF8_STRING format simultaneously, which maximizes
interoperability. The use of UTF8_STRING can be negociated between the
selection holder and requestor, leading to no compatibility issues
whatsoever. Markus Kuhn has prepared an <A HREF="ucs/icccm.diff">ICCCM
patch</A> that adds the necessary definition to the standard. Current
status: The UTF8_STRING atom has now been officially <A
HREF="ftp://ftp.x.org/pub/DOCS/registry">registered</A> with X.Org,
and an update of the ICCCM is expected for the next release.

<LI><A NAME="xfontstruct"><B>Inefficient font data structures:</B></A>
The Xlib API and X11 protocol data structures used for representing
font metric information are extremely inefficient when handling
sparsely populated fonts. The most common way of accessing a font in
an X client is a call to XLoadQueryFont(), which allocates memory for
an XFontStruct and fetches its content from the server. XFontStruct
contains an array of XCharStruct entries (12 bytes each). The size of
this array is the code position of the last character minus the code
position of the first character plus one. Therefore, any
"*-iso10646-1" font that contains both U+0020 and U+FFFD will cause an
XCharStruct array with 65502 elements to be allocated (even for
CharCell fonts), which requires 786 kilobytes of client-side memory
and data transmission, even if the font contains only a thousand
characters.

<P>A few workarounds have been used so far:</P>

<UL>

<LI>The non-Asian <SAMP>-misc-fixed-*-iso10646-1</SAMP> fonts that
come with XFree86 4.0 contain no characters above U+31FF. This reduces
the memory requirement to 153 kilobytes, which is still bad, but much
less so. (There are actually many useful characters above U+31FF
present in the BDF files, waiting for the day when this problem will
be fixed, but they currently all have an encoding of -1 and are
therefore ignored by the X server. If you need these characters, then
just install the <A HREF="ucs-fonts.html">original fonts</A> without
applying the <SAMP>bdftruncate</SAMP> script).

<LI>Starting with XFree86 4.0.3, the truncation of a BDF font can also
be done by specifying a character code subrange at the end of the
XLFD, as described in the <A
HREF="ftp://ftp.x.org/pub/R6.4/xc/doc/hardcopy/XLFD/xlfd.PS.gz">XLFD
specification</A>, section 3.1.2.12. For example,
<PRE>
-Misc-Fixed-Medium-R-Normal--20-200-75-75-C-100-ISO10646-1[0x1200_0x137f]
</PRE>
will load only the Ethiopic part of this BDF font with a
correspondingly nice and small XFontStruct. Earlier X server versions
will simply ignore the font subset brackets and will give you the full
font, so there is no compatibility problem with using that.

<LI>Bruno Haible has written a BIGFONT protocol extension for XFree86
4.0, which uses a compressed transmission of XCharStruct from server
to client and also uses shared memory in Xlib between several clients
which have loaded the same font.

</UL>

<P>These workarounds do not solve the underlying problem that
XFontStruct is unsuitable for sparsely populated fonts, but they do
provide a significant efficiency improvement without requiring any
changes in the API or client source code. One real solution would be
to extend or substitute XFontStruct with something slightly more
flexible that contains a sorted list or hash table of characters as
opposed to an array. This redesign of XFontStruct would also allow to
add the urgently needed provisions for combining characters and
ligatures at the same time.

<P>Another approach would be to introduce a new font encoding, which
could be called for instance "ISO10646-C" (the C stands for for
combining, complex, compact, or character-glyph mapped, as you
prefer). In this encoding, the numbers assigned to each glyph are
really font-specific glyph numbers and are not equivalent to any UCS
character code positions. The information necessary to do a
character-to-glyph mapping would have to be stored in to be
standardized new properties. This new font encoding would be used by
applications together with a few efficient C functions that perform
the character-to-glyph code mapping:

<UL>

<LI><SAMP>makeiso10646cglyphmap(XFontStruct *font, iso10646cglyphmap
*map)</SAMP>

<BR>Reads the character-to-glyph mapping table from the font
properties into a compact and efficient in-memory representation.

<LI><SAMP>freeiso10646cglyphmap(iso10646cglyphmap *map)</SAMP>

<BR>Frees that in-memory representation.

<LI><SAMP>mbtoiso10646c(char *string, iso10646cglyphmap *map, XChar2b
*output)</SAMP>

<BR><SAMP>wctoiso10646c(wchar_t *string, iso10646cglyphmap *map,
XChar2b *output)</SAMP><BR>These take a Unicode character string and
convert it into a <SAMP>XChar2b</SAMP> glyph string suitable for
output by <SAMP>XDrawString16</SAMP> with the ISO10646-C font from
which the <SAMP>iso10646cglyphmap</SAMP> was extracted.

</UL>

<P>ISO10646-C fonts would still be limited to have not more than 64 <A
HREF="http://www.iec.ch/tclet6.pdf">kibi</A>glyphs, but these can come
from anywhere in UCS, not just from the BMP. This solution also easily
provides for glyph substitution, such that we can finally handle the
Indic fonts. It solves the huge-XFontStruct problem of ISO10646-1, as
XFontStruct grows now proportionally with the number of glyphs, not
with the highest characters. It could also provide for simple
overstriking combining characters, but then the glyphs for combining
characters would have to be stored with negative width inside an
ISO10646-C font. It can even provide support for variable combining
accent positions, by having several alternative combining glyphs with
accents at different heights for the same combining character, and the
ligature substitution tables would encode, which combining glyph to
use with which base character.

<P>TODO: write specification for ISO10646-C properties, write sample
implementations of the mapping routines, and add these to xterm, GTK,
and other applications and libraries. Any volunteers?

<LI><B>Keysyms:</B> The keysyms defined at the moment cover only a
tiny repertoire of Unicode. Markus Kuhn has suggested (and implemented
in xterm) that any UCS character in the range U-00000000 to U-00FFFFFF
can be represented by a keysym value in the range 0x01000000 to
0x01ffffff. This admittedly does not cover the entire 31-bit space of
UCS, but it does cover all the characters up to U-0010FFFF, which can
be represented by UTF-16, and more, and it is very unlikely that
higher UCS codes will ever be assigned by ISO (in fact there are
proposals to remove the code space above U-0010FFFF from ISO 10646 in
the future). So to get Unicode character U+ABCD you can directly use
keysym 0x0100abcd. See also the file <A
HREF="ucs/keysym2ucs.c">keysym2ucs.c</A> in the xterm source code for
a suggested conversion table between the classical keysyms and UCS,
something which should also go into the X11 standard. Markus also
wrote a proposed draft revision of the X protocol standard <A
HREF="ucs/X11.keysyms">Appendix A: KEYSYM Encoding</A> (<A
HREF="ucs/X11.keysyms.pdf">PDF</A>) that adds a UCS cross reference
table.

<LI><B>Combining characters:</B> The X11 specification does not
support combining characters in any way. The font information lacks
the data necessary to perform high-quality automatic accent placement
(as it is found for example in all TeX fonts). Various people have
experimented with implementing simplest overstriking combining
characters using zero-width characters with ink on the left side of
the origin, but details of how to do this exactly are unspecified
(e.g., are zero-width characters allowed in CharCell and Monospaced
fonts?) and this is therefore not yet widely established practice.

<LI><B>Ligatures:</B> The Indic scripts need font file formats that
support ligature substitution, which is at the moment just as
completely out of the scope of the X11 specification as are combining
characters.

<LI><B>UTF-8 locales:</B> The X11 R6.4 sample implementation did not
contain any support for UTF-8 locales. There is an old UTF locale, but
it is incomplete and uses the now obsolete <A
HREF="http://www.itscj.ipsj.or.jp/ISO-IR/178.pdf">UTF-1</A> encoding.
Implementing a UTF-8 locale not only requires the usual encoding
conversion routines, but also various keyboard entry methods, ranging
from mapping the existing ISO 8859 and keysym keyboards to UCS, over
vastly extended support for the compose key and <A
HREF="volatile/ISO-14755.pdf">ISO 14755</A> hexadecimal entry of
arbitrary characters to input entry support for Hangul and Han
characters.

<LI><B>Sample implementation:</B> A number of comprehensive Unicode
standard fonts as well as Unicode support for classic standard tools
such as xterm, xfontsel, the window managers, etc. should be added to
the sample implementation. Some work on this part has already been
done within XFree86, other work is currently delayed by the fact that
the previous points have not yet been resolved.

</UL>

<P>Several XFree86 team members are trying to work on these issues
with <A HREF="http://www.x.org/">X.Org</A>, which is the official
successor of the X Consortium and the Opengroup as the custodian of
the X11 standards and the sample implementation. But things are moving
rather slowly. Support for UTF8_STRING, UCS keysyms, and ISO10646-1
extensions of the core fonts will hopefully make it into R6.6.1 in
2001-Q4. With regard to the other font related problems, the solution
will probably be to dump the old server-side font mechanisms entirely
and use instead <A HREF="http://xfree86.org/~keithp/">Keith
Packard's</A> new <A HREF="http://xfree86.org/~keithp/render/">X
Render Extension</A>. Another work-in-progress is a new <A
HREF="http://www.xfree86.org/pipermail/i18n/2001-December/002727.html"
>Standard Type Services (ST)</A> framework that Sun has been working
on and plans to donate to XFree86 and X.org soon.

<H2><A NAME="lists">Are there any good mailing lists on these issues?</A></H2>

<P>You should certainly be on the <SAMP>[email protected]</SAMP>
mailing list. That's the place to meet for everyone interested in
working towards better UTF-8 support for GNU/Linux or Unix systems and
applications. To subscribe, send a message to <A
HREF="mailto:[email protected]?Subject=subscribe"
>[email protected]</A> with the subject
<SAMP>subscribe</SAMP>. You can also browse the <A
HREF="http://mail.nl.linux.org/linux-utf8/">linux-utf8 archive</A>.

<P>There is also the <A HREF=
"http://www.unicode.org/unicode/consortium/distlist.html"
><SAMP>[email protected]</SAMP></A> mailing list, which is the best
way of finding out what the authors of the Unicode standard and a lot
of other gurus have to say. To subscribe, send to <A
HREF="mailto:[email protected]">[email protected]</A>
a message with the subject line "subscribe" and the text "subscribe
<VAR>[email protected]</VAR> unicode".

<P>The relevant mailing lists for discussions about Unicode support in
Xlib and the X server are the <A
HREF="http://XFree86.Org/mailman/listinfo/fonts">[email protected]</A>
and <A
HREF="http://XFree86.Org/mailman/listinfo/i18n">[email protected]</A>
mailing lists.

<H2><A NAME="refs">Further References</A></H2>

<UL>

<LI>Bruno Haible's <A HREF=
"ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html">Unicode
HOWTO</A>.

<LI><A
HREF="http://www.amazon.com/exec/obidos/ASIN/0201616335/mgk25">The
Unicode Standard, Version 3.0</A>, Addison-Wesley, 2000. You
definitely should have a copy of the standard if you are doing
anything related to fonts and character sets.

<LI>Ken Lunde's <A
HREF="http://www.amazon.com/exec/obidos/ASIN/1565922247/mgk25"> CJKV
Information Processing</A>, O'Reilly & Associates, 1999. This is
clearly the best book available if you are interested in East Asian
character sets.

<LI><A HREF="http://www.unicode.org/unicode/reports/"
>Unicode Technical Reports</A>

<LI>Mark Davis' <A HREF="http://www.unicode.org/unicode/faq/" >Unicode
FAQ</A>

<LI><A
HREF="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819">ISO/IEC
10646-1:2000</A>

<LI><A HREF="http://people.netscape.com/ftang/i18n.html">Frank Tang's
I�t�rn�ti�n�liz�ti�n Secrets</A>

<LI><A HREF="http://www-106.ibm.com/developerworks/unicode/">IBM's Unicode
Zone</A>

<LI><A HREF=
"http://www.sun.com/software/white-papers/wp-unicode/">Unicode Support
in the Solaris 7 Operating Environment</A>

<LI>The USENIX paper by Rob Pike and Ken Thompson on the <A
HREF="ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz">introduction
of UTF-8 under Plan9</A> reports about the experience gained when <A
HREF="http://plan9.bell-labs.com/plan9dist/">Plan9</A> migrated as the
first operating system back in 1992 completely to UTF-8 (which was at
the time still called UTF-2). Must read!

<LI><A HREF="http://www.li18nux.net/">Li18nux</A> is a project
initiated by several Linux distributors to enhance Unicode support for
Linux. It has recently published the <A HREF=
"http://www.li18nux.net/li18nux2k/" >Li18nux 2000 Globalization
Specification</A> as well as some <A
HREF="http://www.li18nux.org/subgroups/utildev/dli18npatch.html">patches</A>.

<LI>The <A HREF="http://www.UNIX-systems.org/online.html">Online
Single Unix Specification</A> contains definitions of all the ISO C
Amendment 1 function, plus extensions such as wcwidth().

<LI>The Open Group's summary of <A
HREF="http://www.unix-systems.org/version2/whatsnew/login_mse.html">ISO
C Amendment 1</A>.

<LI><A HREF="http://sources.redhat.com/glibc/">GNU libc</A>

<LI><A HREF="http://lct.sourceforge.net/">The Linux Console Tools</A>

<LI>The Unicode Consortium <A
HREF="ftp://ftp.unicode.org/Public/UNIDATA/">character database</A>
and <A HREF="ftp://ftp.unicode.org/Public/MAPPINGS/">character set
conversion tables</A> are an essential resource for anyone developing
Unicode related tools.

<LI>Other conversion tables are available from <A HREF=
"http://www.microsoft.com/globaldev/reference/WinCP.asp">Microsoft</A>
and <A HREF="ftp://dkuug.dk/i18n/WG15-collection/charmaps/">Keld
Simonsen's WG15 archive</A>.

<LI>Michael Everson's <A
HREF="http://www.evertype.com/sc2wg2.html">Unicode and JTC1/SC2/WG2
Archive</A> contains online versions of many of the more recent ISO
10646-1 amendments, plus many other goodies. See also his <A
HREF="http://www.evertype.com/standards/iso10646/ucs-roadmap.html"
>Roadmaps to the Universal Character Set</A>.

<LI>An introduction into <A
HREF="http://www.stadlar.is/TC304/guidecharactersets/guideannexb.html"
>The Universal Character Set (UCS)</A>.

<LI>Otfried Cheong's essay on <A
HREF="http://www.cs.uu.nl/~otfried/Mule/unihan.html">Han Unification
in Unicode</A>

<LI>The <A HREF="http://www.ams.org/STIX/">AMS STIX</A> project is
working on revising and extending the mathematical characters for
Unicode 4.0 and ISO 10646-2.

<LI>Jukka Korpela's <A
HREF="http://www.malibutelecom.com/yucca/shy.html">Soft hyphen (SHY) -
a hard problem?</A> is an excellent discussion of the controversy
surrounding U+00AD.

<LI>James Briggs' <A HREF="http://rf.net/~james/perli18n.html">Perl,
Unicode and I18N FAQ</A>.

<LI>Mark Davis discusses in <A HREF=
"http://www-106.ibm.com/developerworks/library/utfencodingforms/">Forms
of Unicode</A> the tradeoffs between UTF-8, UTF-16, and UCS-4 (now
also called UTF-32 for political reasons).

<LI>Alan Wood has a good page on <A
HREF="http://www.hclrss.demon.co.uk/unicode/">Unicode and Multilingual
Support in Web Browsers and HTML</A>.

<LI><A HREF="http://anubis.dkuug.dk/jtc1/sc22/WG20/docs/projects"
>ISO/JTC1/SC22/WG20</A> produced various Unicode related standards
such as the <A
HREF="http://anubis.dkuug.dk/jtc1/sc22/WG20/docs/projects/n731-fdis14651.pdf"
>International String Ordering (ISO 14651)</A> and the <A
HREF="http://anubis.dkuug.dk/jtc1/sc22/WG20/docs/n690.pdf"
>Cultural Convention Specification TR (ISO TR 14652)</A> (an extension
of the POSIX locale format that covers for example transliteration of
wide character output).

<LI><A HREF="http://www.cse.cuhk.edu.hk/~irg/">ISO/JTC1/SC2/WG2/IRG</A>
(Ideographic Rapporteur Group)

<LI>The <A HREF="http://www.eki.ee/letter/">Letter Database</A>
answers queries on languages, character sets and names, as does the <A
HREF="http://zvon.org/other/charSearch/PHP/search.php">Zvon Character
Search</A>.

<LI>China has specified in <A
HREF="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf"
>GB 18030</A> a new encoding of UCS for use in Chinese government
systems that is backwards-compatible with the widely used GB 2312 and
GBK encodings for Chinese. It seems though that the first version
(released 2000-03) is somewhat buggy and will likely go through a
couple more revisions, so use with care. GB 18030 is probably more of
a temporary migration path to UCS and will probably not survive for
long against UTF-8 or UTF-16, even in Chinese government systems.

<LI><A
HREF="http://www.info.gov.hk/digital21/eng/hkscs/introduction.html">Hong
Kong Supplementary Character Set (HKSCS)</A>

<LI>Proceedings of the International Unicode Conferences: <A
HREF="http://www.unicode.org/iuc/iuc13/papers.html">ICU13</A>, <A
HREF="http://www.unicode.org/iuc/iuc14/papers.html">ICU14</A>, <A
HREF="http://www.unicode.org/iuc/iuc15/papers.html">ICU15</A>, <A
HREF="http://www.unicode.org/iuc/iuc16/papers.html">ICU16</A>, <A
HREF="http://www.unicode.org/iuc/iuc17/papers.html">ICU17</A>, <A
HREF="http://www.unicode.org/iuc/iuc18/papers.html">ICU18</A>, etc.

</UL>

<P>I add new material to this document very frequently, so please
check it regularly or ask <A HREF=
"http://www.netmind.com/URL-minder/new/register.html">Netminder</A> to
notify you of any changes. <A
HREF="mailto:[email protected]">Suggestions</A> for
improvement, as well as advertisement in the freeware community for
better UTF-8 support, are very welcome. UTF-8 use under Linux is quite
new, so expect a lot of progress in the next few months here.

<P>Special thanks to Ulrich Drepper, Bruno Haible, Robert Brady,
Juliusz Chroboczek, Shuhei Amakawa and many others for valuable
comments, and to SuSE GmbH, Nürnberg, for their support.

<P><A HREF="http://www.cl.cam.ac.uk/~mgk25/">Markus Kuhn</A>
<[email protected]><BR><SMALL>created 1999-06-04 -- last
modified 2002-01-08 --
http://www.cl.cam.ac.uk/~mgk25/unicode.html</SMALL>
</BODY>
</HTML>