Network Working Group                                         M. Crispin
Request for Comments: 5051                      University of Washington
Category: Standards Track                                   October 2007


        i;unicode-casemap - Simple Unicode Collation Algorithm

Status of This Memo

  This document specifies an Internet standards track protocol for the
  Internet community, and requests discussion and suggestions for
  improvements.  Please refer to the current edition of the "Internet
  Official Protocol Standards" (STD 1) for the standardization state
  and status of this protocol.  Distribution of this memo is unlimited.

Abstract

  This document describes "i;unicode-casemap", a simple case-
  insensitive collation for Unicode strings.  It provides equality,
  substring, and ordering operations.

1.  Introduction

  The "i;ascii-casemap" collation described in [COMPARATOR] is quite
  simple to implement and provides case-independent comparisons for the
  26 Latin alphabetics.  It is specified as the default and/or baseline
  comparator in some application protocols, e.g., [IMAP-SORT].

  However, the "i;ascii-casemap" collation does not produce
  satisfactory results with non-ASCII characters.  It is possible, with
  a modest extension, to provide a more sophisticated collation with
  greater multilingual applicability than "i;ascii-casemap".  This
  extension provides case-independent comparisons for a much greater
  number of characters.  It also collates characters with diacriticals
  with the non-diacritical character forms.

  This collation, "i;unicode-casemap", is intended to be an alternative
  to, and preferred over, "i;ascii-casemap".  It does not replace the
  "i;basic" collation described in [BASIC].

2.  Unicode Casemap Collation Description

  The "i;unicode-casemap" collation is a simple collation which is
  case-insensitive in its treatment of characters.  It provides
  equality, substring, and ordering operations.  The validity test
  operation returns "valid" for any input.





Crispin                     Standards Track                     [Page 1]

RFC 5051                   i;unicode-casemap                October 2007


  This collation allows strings in arbitrary (and mixed) character
  sets, as long as the character set for each string is identified and
  it is possible to convert the string to Unicode.  Strings which have
  an unidentified character set and/or cannot be converted to Unicode
  are not rejected, but are treated as binary.

  Each input string is prepared by converting it to a "titlecased
  canonicalized UTF-8" string according to the following steps, using
  UnicodeData.txt ([UNICODE-DATA]):

     (1) A Unicode codepoint is obtained from the input string.

         (a) If the input string is in a known charset that can be
             converted to Unicode, a sequence in the string's charset
             is read and checked for validity according to the rules of
             that charset.  If the sequence is valid, it is converted
             to a Unicode codepoint.  Note that for input strings in
             UTF-8, the UTF-8 sequence must be valid according to the
             rules of [UTF-8]; e.g., overlong UTF-8 sequences are
             invalid.

         (b) If the input string is in an unknown charset, or an
             invalid sequence occurs in step (1)(a), conversion ceases.
             No further preparation is performed, and any partial
             preparation results are discarded.  The original string is
             used unchanged with the i;octet comparator.

     (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
         are performed on the resulting codepoint from step (1)(a).

         (a) If the codepoint has a titlecase property in
             UnicodeData.txt (this is normally the same as the
             uppercase property), the codepoint is converted to the
             codepoints in the titlecase property.

         (b) If the resulting codepoint from (2)(a) has a decomposition
             property of any type in UnicodeData.txt, the codepoint is
             converted to the codepoints in the decomposition property.
             This step is recursively applied to each of the resulting
             codepoints until no more decomposition is possible
             (effectively Normalization Form KD).

         Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
         has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
         WITH SMALL LETTER Z WITH CARON).  Codepoint U+01C5 has a
         decomposition property of U+0044 (LATIN CAPITAL LETTER D)
         U+017E (LATIN SMALL LETTER Z WITH CARON).  U+017E has a
         decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c



Crispin                     Standards Track                     [Page 2]

RFC 5051                   i;unicode-casemap                October 2007


         (COMBINING CARON).  Neither U+0044, U+007A, nor U+030C have
         any decomposition properties.  Therefore, U+01C4 is converted
         to U+0044 U+007A U+030C by this step.

     (3) The resulting codepoint(s) from step (2) is/are appended, in
         UTF-8 format, to the "titlecased canonicalized UTF-8" string.

     (4) Repeat from step (1) until there is no more data in the input
         string.

  Following the above preparation process on each string, the equality,
  ordering, and substring operations are as for i;octet.

  It is permitted to use an alternative implementation of the above
  preparation process if it produces the same results.  For example, it
  may be more convenient for an implementation to convert all input
  strings to a sequence of UTF-16 or UTF-32 values prior to performing
  any of the step (2) actions.  Similarly, if all input strings are (or
  are convertible to) Unicode, it may be possible to use UTF-32 as an
  alternative to UTF-8 in step (3).

     Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
     because UTF-16 surrogates will cause i;octet to collate codepoints
     U+E0000 through U+FFFF after non-BMP codepoints.

  This collation is not locale sensitive.  Consequently, care should be
  taken when using OS-supplied functions to implement this collation.
  Functions such as strcasecmp and toupper are sometimes locale
  sensitive and may inconsistently casemap letters.

  The i;unicode-casemap collation is well suited to use with many
  Internet protocols and computer languages.  Use with natural language
  is often inappropriate; even though the collation apparently supports
  languages such as Swahili and English, in real-world use it tends to
  mis-sort a number of types of string:

  o  people and place names containing scripts that are not collated
     according to "alphabetical order".
  o  words with characters that have diacriticals.  However,
     i;unicode-casemap generally does a better job than i;ascii-casemap
     for most (but not all) languages.  For example, German umlaut
     letters will sort correctly, but some Scandinavian letters will
     not.
  o  names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
     in English),
  o  strings containing other non-letter symbols; e.g., euro and pound
     sterling symbols, quotation marks other than '"', dashes/hyphens,
     etc.



Crispin                     Standards Track                     [Page 3]

RFC 5051                   i;unicode-casemap                October 2007


3.  Unicode Casemap Collation Registration

  <?xml version='1.0'?>
  <!DOCTYPE collation SYSTEM 'collationreg.dtd'>
  <collation rfc="5051" scope="global" intendedUse="common">
  <identifier>i;unicode-casemap</identifier>
  <title>Unicode Casemap</title>
  <operations>equality order substring</operations>
  <specification>RFC 5051</specification>
  <owner>IETF</owner>
  <submitter>[email protected]</submitter>
  </collation>

4.  Security Considerations

  The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
  SECURITY] apply and are normative to this specification.

  The results from this comparator will vary depending upon the
  implementation for several reasons.  Implementations MUST consider
  whether these possibilities are a problem for their use case:

  1) New characters added in Unicode may have decomposition or
     titlecase properties that will not be known to an implementation
     based upon an older revision of Unicode.  This impacts step (2).

  2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
     does not require normalization of out-of-order diacriticals.
     However, an implementation MAY use an NFKD library routine that
     does such normalization.  This impacts step (2)(b) and possibly
     also step (1)(a), and is an issue only with ill-formed UTF-8
     input.

  3) The set of charsets handled in step (1)(a) is open-ended.  UTF-8
     (and, by extension, US-ASCII) are the only mandatory-to-implement
     charsets.  This impacts step (1)(a).

     Implementations SHOULD, as far as feasible, support all the
     charsets they are likely to encounter in the input data, in order
     to avoid poor collation caused by the fall through to the (1)(b)
     rule.

  4) Other charsets may have revisions which add new characters that
     are not known to an implementation based upon an older revision.
     This impacts step (1)(a) and possibly also step (1)(b).






Crispin                     Standards Track                     [Page 4]

RFC 5051                   i;unicode-casemap                October 2007


  An attacker may create input that is ill-formed or in an unknown
  charset, with the intention of impacting the results of this
  comparator or exploiting other parts of the system which process this
  input in different ways.  Note, however, that even well-formed data
  in a known charset can impact the result of this comparator in
  unexpected ways.  For example, an attacker can substitute U+0041
  (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
  U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
  non-match of strings which visually appear the same and/or causing
  the string to appear elsewhere in a sort.

5.  IANA Considerations

  The i;unicode-casemap collation defined in section 2 has been added
  to the registry of collations defined in [COMPARATOR].

6.  Normative References

  [COMPARATOR]          Newman, C., Duerst, M., and A. Gulbrandsen,
                        "Internet Application Protocol Collation
                        Registry", RFC 4790, February 2007.

  [STRINGPREP]          Hoffman, P. and M. Blanchet, "Preparation of
                        Internationalized Strings ("stringprep")", RFC
                        3454, December 2002.

  [UTF-8]               Yergeau, F., "UTF-8, a transformation format of
                        ISO 10646", STD 63, RFC 3629, November 2003.

  [UNICODE-DATA]        <http://www.unicode.org/Public/UNIDATA/
                        UnicodeData.txt>

                        Although the UnicodeData.txt file referenced
                        here is part of the Unicode standard, it is
                        subject to change as new characters are added
                        to Unicode and errors are corrected in Unicode
                        revisions.  As a result, it may be less stable
                        than might otherwise be implied by the
                        standards status of this specification.

  [UNICODE-SECURITY]    Davis, M. and M. Suignard, "Unicode Security
                        Considerations", February 2006,
                        <http://www.unicode.org/reports/tr36/>.








Crispin                     Standards Track                     [Page 5]

RFC 5051                   i;unicode-casemap                October 2007


7.  Informative References

  [BASIC]               Newman, C., Duerst, M., and A. Gulbrandsen,
                        "i;basic - the Unicode Collation Algorithm",
                        Work in Progress, March 2007.

  [IMAP-SORT]           Crispin, M. and K. Murchison, "Internet Message
                        Access Protocol - SORT and THREAD Extensions",
                        Work in Progress, September 2007.

Author's Address

  Mark R. Crispin
  Networks and Distributed Computing
  University of Washington
  4545 15th Avenue NE
  Seattle, WA  98105-4527

  Phone: +1 (206) 543-5762
  EMail: [email protected]































Crispin                     Standards Track                     [Page 6]

RFC 5051                   i;unicode-casemap                October 2007


Full Copyright Statement

  Copyright (C) The IETF Trust (2007).

  This document is subject to the rights, licenses and restrictions
  contained in BCP 78, and except as set forth therein, the authors
  retain all their rights.

  This document and the information contained herein are provided on an
  "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
  OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
  THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
  OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
  THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
  WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

  The IETF takes no position regarding the validity or scope of any
  Intellectual Property Rights or other rights that might be claimed to
  pertain to the implementation or use of the technology described in
  this document or the extent to which any license under such rights
  might or might not be available; nor does it represent that it has
  made any independent effort to identify any such rights.  Information
  on the procedures with respect to rights in RFC documents can be
  found in BCP 78 and BCP 79.

  Copies of IPR disclosures made to the IETF Secretariat and any
  assurances of licenses to be made available, or the result of an
  attempt made to obtain a general license or permission for the use of
  such proprietary rights by implementers or users of this
  specification can be obtained from the IETF on-line IPR repository at
  http://www.ietf.org/ipr.

  The IETF invites any interested party to bring to its attention any
  copyrights, patents or patent applications, or other proprietary
  rights that may cover technology that may be required to implement
  this standard.  Please address the information to the IETF at
  [email protected].












Crispin                     Standards Track                     [Page 7]