<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>HTML Document Representation</title>
<link rel="previous" href="conform.html">
<link rel="next" href="types.html">
<link rel="contents" href="cover.html#toc">
<link rel="stylesheet" type="text/css" href=
"
http://www.w3.org/StyleSheets/TR/W3C-REC">
<link rel="STYLESHEET" href="style/default.css" type="text/css">
</head>
<body>
<div class="navbar" align="center"> <a href="conform.html">previous</a>
<a href="types.html">next</a> <a href="cover.html#minitoc">
contents</a> <a href="index/elements.html">elements</a> <a href=
"index/attributes.html">attributes</a> <a href="index/list.html">
index</a>
<hr></div>
<h1 align="center"><a name="h-5">5</a> HTML Document Representation</h1>
<div class="subtoc">
<p><strong>Contents</strong></p>
<ol>
<li><a class="tocxref" href="#h-5.1">The Document Character Set</a></li>
<li><a class="tocxref" href="#h-5.2">Character encodings</a>
<ol>
<li><a class="tocxref" href="#h-5.2.1">Choosing an encoding</a>
<ul>
<li><a class="tocxref" href="#h-5.2.1.1">Notes on specific encodings</a></li>
</ul>
</li>
<li><a class="tocxref" href="#h-5.2.2">Specifying the character
encoding</a></li>
</ol>
</li>
<li><a class="tocxref" href="#h-5.3">Character references</a>
<ol>
<li><a class="tocxref" href="#h-5.3.1">Numeric character references</a></li>
<li><a class="tocxref" href="#h-5.3.2">Character entity references</a></li>
</ol>
</li>
<li><a class="tocxref" href="#h-5.4">Undisplayable characters</a></li>
</ol>
</div>
<p>In this chapter, we discuss how HTML documents are represented on a computer
and over the Internet.</p>
<p>The section on the <a href="#doc-char-set">document character set</a>
addresses the issue of what abstract <em>characters</em> may be part of an HTML
document. Characters include the Latin letter "A", the Cyrillic letter "I", the
Chinese character meaning "water", etc.</p>
<p>The section on <a href="#encodings">character encodings</a> addresses the
issue of how those characters may be <em>represented</em> in a file or when
transferred over the Internet. As some character encodings cannot directly
represent all characters an author may want to include in a document, HTML
offers other mechanisms, called <a href="#entities">character references</a>,
for referring to any character.</p>
<p>Since there are a great number of characters throughout human languages, and
a great variety of ways to represent those characters, proper care must be
taken so that documents may be understood by user agents around the world.</p>
<h2><a name="h-5.1">5.1</a> <a name="doc-char-set">The Document Character
Set</a></h2>
<p>To promote interoperability, SGML requires that each application (including
HTML) specify its <span class="index-def" title="document character
set|SGML::document character set"><a name="didx-document_character_set"><dfn>
document character set.</dfn></a></span> A document character set consists
of:</p>
<ul>
<li><span class="index-def" title="character repertoire"><a name="repertoire">
<dfn>A Repertoire</dfn></a></span>: A set of abstract <span class="index-inst"
title="characters::abstract"><a name="idx-characters">characters,</a></span>,
such as the Latin letter "A", the Cyrillic letter "I", the Chinese character
meaning "water", etc.</li>
<li><span class="index-def" title="code position"><a name="code-position"><dfn>
Code positions</dfn></a></span>: A set of integer references to characters in
the repertoire.</li>
</ul>
<p>Each SGML document (including each HTML document) is a sequence of
characters from the repertoire. Computer systems identify each character by its
code position; for example, in the ASCII character set, code positions 65, 66,
and 67 refer to the characters 'A', 'B', and 'C', respectively.</p>
<p>The ASCII character set is not sufficient for a global information system
such as the Web, so HTML uses the much more complete character set called the
<span class="index-inst" title="universal character set"><a name=
"idx-universal_character_set"><i>Universal Character Set (UCS)</i>,</a></span>
defined in <span class="index-inst" title="document character set::ISO10646"><a
name="idx-document_character_set-1" rel="biblioentry" href=
"./references.html#ref-ISO10646" class="normref">[ISO10646].</a></span> This
standard defines a repertoire of thousands of characters used by communities
all over the world.</p>
<p>The character set defined in <a rel="biblioentry" href=
"./references.html#ref-ISO10646" class="normref">[ISO10646]</a> is
character-by-character <span class="index-inst" title="document character
set::equivalence of ISO10646 and UNICODE"><a name=
"idx-document_character_set-2">equivalent</a></span> to Unicode (<a rel=
"biblioentry" href="./references.html#ref-UNICODE" class=
"normref">[UNICODE]</a>). Both of these standards are updated from time to time
with new characters, and the amendments should be consulted at the respective
Web sites. In the current specification, "[ISO10646]" is used to refer to the
document character set while "[UNICODE]" is reserved for references to the
Unicode <a href="struct/dirlang.html#bidirection">bidirectional text
algorithm.</a></p>
<p>The document character set, however, does not suffice to allow user agents
to correctly interpret HTML documents as they are typically exchanged --
encoded as a sequence of bytes in a file or during a network transmission. User
agents must also know the specific <a href="#encodings">character encoding</a>
that was used to transform the document character stream into a byte
stream.</p>
<h2><a name="h-5.2">5.2</a> <a name="encodings">Character encodings</a></h2>
<p>What this specification calls a <span class="index-def" title="character
encoding"><a name="didx-character_encoding"><dfn>character
encoding</dfn></a></span> is known by different names in other specifications
(which may cause some confusion). However, the concept is largely the same
across the Internet. Also, protocol headers, attributes, and parameters
referring to character encodings share the same name -- "charset" -- and use
the same values from the <a href="references.html#ref-IANA" class="normref">
[IANA]</a> registry (see <a rel="biblioentry" href=
"./references.html#ref-CHARSETS" class="informref">[CHARSETS]</a> for a
complete list).</p>
<p>The "charset" parameter identifies a character encoding, which is a method
of converting a sequence of bytes into a sequence of characters. This
conversion fits naturally with the scheme of Web activity: servers send HTML
documents to user agents as a stream of bytes; user agents interpret them as a
sequence of characters. The conversion method can range from simple one-to-one
correspondence to complex switching schemes or algorithms.</p>
<p>A simple one-byte-per-character encoding technique is not sufficient for
text strings over a character repertoire as large as <a rel="biblioentry" href=
"./references.html#ref-ISO10646" class="normref">[ISO10646]</a>. There are
several different encodings of parts of <a rel="biblioentry" href=
"./references.html#ref-ISO10646" class="normref">[ISO10646]</a> in addition to
encodings of the entire character set (such as UCS-4).</p>
<h3><a name="h-5.2.1">5.2.1</a> <span class="index-inst" title="character
encoding::choice of"><a name="idx-character_encoding-1">Choosing an
encoding</a></span></h3>
<p>Authoring tools (e.g., text editors) may encode HTML documents in the
character encoding of their choice, and the choice largely depends on the
conventions used by the system software. These tools may employ any convenient
encoding that covers most of the characters contained in the document, provided
the encoding is <a href="#spec-char-encoding">correctly labeled.</a> Occasional
characters that fall outside this encoding may still be represented by <a href=
"#entities">character references</a>. These always refer to the document
character set, not the character encoding.</p>
<p>Servers and proxies may change a character encoding (called <em>
transcoding</em>) on the fly to meet the requests of user agents (see section
14.2 of <a rel="biblioentry" href="./references.html#ref-RFC2616" class=
"normref">[RFC2616]</a>, the "Accept-Charset" HTTP request header). Servers and
proxies do not have to serve a document in a character encoding that covers the
entire document character set.</p>
<p><span class="index-inst" title="character encoding::common examples"><a
name="idx-character_encoding-2">Commonly</a></span> used character encodings on
the Web include ISO-8859-1 (also referred to as "Latin-1"; usable for most
Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a
Japanese encoding), EUC-JP (another Japanese encoding), and UTF-8 (an encoding
of ISO 10646 using a different number of bytes for different characters). Names
for character encodings are case-insensitive, so that for example "SHIFT_JIS",
"Shift_JIS", and "shift_jis" are equivalent.</p>
<p>This specification does not mandate which character encodings a user agent
must support.</p>
<p><a href="conform.html#conformance">Conforming user agents</a> must correctly
map to ISO 10646 all characters in any character encodings that they recognize
(or they must behave as if they did).</p>
<h4>Notes on specific encodings<a name="h-5.2.1.1"> </a></h4>
<p>When HTML text is transmitted in <span class="index-inst" title="character
encoding::UTF-16|UTF-16"><a name="idx-character_encoding-3">UTF-16</a></span>
(charset=UTF-16), text data should be transmitted in network byte order
("big-endian", high-order byte first) in accordance with <a rel="biblioentry"
href="./references.html#ref-ISO10646" class="normref">[ISO10646]</a>, Section
6.3 and <a rel="biblioentry" href="./references.html#ref-UNICODE" class=
"normref">[UNICODE]</a>, clause C3, page 3-1.</p>
<p>Furthermore, to maximize chances of proper interpretation, it is recommended
that documents transmitted as UTF-16 always begin with a ZERO-WIDTH
NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark
(BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character
guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal
FFFE as the first bytes of a text would know that bytes have to be reversed for
the remainder of the text.</p>
<p>The <span class="index-inst" title="character encoding::UTF-1|UTF-1"><a
name="idx-character_encoding-4">UTF-1</a></span> transformation format of <a
rel="biblioentry" href="./references.html#ref-ISO10646" class="normref">
[ISO10646]</a> (registered by IANA as ISO-10646-UTF-1), should not be used. For
information about ISO 8859-8 and the bidirectional algorithm, please consult
the section on <a href="struct/dirlang.html#bidi88598">bidirectionality and
character encoding</a>.</p>
<h3><a name="h-5.2.2">5.2.2</a> <span class="index-inst" title="character
encoding::specification of"><a name="spec-char-encoding">Specifying the
character encoding</a></span></h3>
<p>How does a server determine which character encoding applies for a document
it serves? Some servers examine the first few bytes of the document, or check
against a database of known files and encodings. Many modern servers give Web
masters more control over charset configuration than old servers do. Web
masters should use these mechanisms to send out a "charset" parameter whenever
possible, but should take care not to identify a document with the wrong
"charset" parameter value.</p>
<p>How does a user agent know which character encoding has been used? The
server should provide this information. The most straightforward way for a
server to inform the user agent about the character encoding of the document is
to use the "charset" parameter of the <span class="index-inst" title=
"HTTP::Content-Type header|Content-Type header"><a name="idx-HTTP">
"Content-Type"</a></span> header field of the HTTP protocol (<a rel=
"biblioentry" href="./references.html#ref-RFC2616" class=
"normref">[RFC2616]</a>, sections 3.4 and 14.17) For example, the following
HTTP header announces that the character encoding is EUC-JP:</p>
<pre>
Content-Type: text/html; charset=EUC-JP
</pre>
<p>Please consult the section on <a href="conform.html">conformance</a> for the
definition of <a href="conform.html#text-html">text/html</a>.</p>
<p>The HTTP protocol (<a rel="biblioentry" href="./references.html#ref-RFC2616"
class="normref">[RFC2616]</a>, section 3.7.1) mentions ISO-8859-1 as a default
character encoding when the "charset" parameter is absent from the
"Content-Type" header field. In practice, this recommendation has proved
useless because some servers don't allow a "charset" parameter to be sent, and
others may not be configured to send the parameter. Therefore, user agents must
not assume any default value for the "charset" parameter.</p>
<p>To address server or configuration limitations, HTML documents may include
explicit information about the document's character encoding; the <a href=
"struct/global.html#edef-META" class="noxref"><samp class="einst">
META</samp></a> element can be used to provide user agents with this
information.</p>
<p>For example, to specify that the character encoding of the current document
is "EUC-JP", a document should include the following <a href=
"struct/global.html#edef-META" class="noxref"><samp class="einst">
META</samp></a> declaration:</p>
<pre class="example">
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
</pre>
<p>The <a href="struct/global.html#edef-META" class="noxref"><samp class=
"einst">META</samp></a> declaration must only be used when the character
encoding is organized such that ASCII-valued bytes stand for ASCII characters
(at least until the <a href="struct/global.html#edef-META" class="noxref"><samp
class="einst">META</samp></a> element is parsed). <a href=
"struct/global.html#edef-META" class="noxref"><samp class="einst">
META</samp></a> declarations should appear as early as possible in the <a href=
"struct/global.html#edef-HEAD" class="noxref"><samp class="einst">
HEAD</samp></a> element.</p>
<p>For cases where neither the HTTP protocol nor the <a href=
"struct/global.html#edef-META" class="noxref"><samp class="einst">
META</samp></a> element provides information about the character encoding of a
document, HTML also provides the <a href="struct/links.html#adef-charset"
class="noxref"><samp class="ainst">charset</samp></a> attribute on several
elements. By combining these mechanisms, an author can greatly improve the
chances that, when the user retrieves a resource, the user agent will recognize
the character encoding.</p>
<p>To sum up, conforming user agents must observe the following <span class=
"index-inst" title="character encoding::user agent's determination of"><a name=
"idx-character_encoding-6">priorities</a></span> when determining a document's
<span class="index-inst" title="character encoding::default|default::character
encoding"><a name="idx-character_encoding-7">character encoding</a></span>
(from highest priority to lowest):</p>
<ol>
<li>An HTTP "charset" parameter in a "Content-Type" field.</li>
<li>A <a href="struct/global.html#edef-META" class="noxref"><samp class=
"einst">META</samp></a> declaration with "http-equiv" set to "Content-Type" and
a value set for "charset".</li>
<li>The <a href="struct/links.html#adef-charset" class="noxref"><samp class=
"ainst">charset</samp></a> attribute set on an element that designates an
external resource.</li>
</ol>
<p>In addition to this list of priorities, the user agent may use heuristics
and user settings. For example, many user agents use a heuristic to distinguish
the various encodings used for Japanese text. Also, user agents typically have
a user-definable, local default character encoding which they apply in the
absence of other indicators.</p>
<p>User agents may provide a mechanism that allows users to override incorrect
"charset" information. However, if a user agent offers such a mechanism, it
should only offer it for browsing and not for editing, to avoid the creation of
Web pages marked with an incorrect "charset" parameter.</p>
<div class="note">
<p><em><strong>Note.</strong> If, for a specific application, it becomes
necessary to refer to characters outside <a rel="biblioentry" href=
"./references.html#ref-ISO10646" class="normref">[ISO10646]</a>, characters
should be assigned to a private zone to avoid conflicts with present or future
versions of the standard. This is highly discouraged, however, for reasons of
portability.</em></p>
</div>
<h2><a name="h-5.3">5.3</a> <a name="entities">Character references</a></h2>
<p>A given character encoding may not be able to express all characters of the
document character set. For such encodings, or when hardware or software
configurations do not allow users to input some document characters directly,
authors may use SGML <span class="index-def" title="character reference"><a
name="didx-character_reference">character references.</a></span> Character
references are a character encoding-independent mechanism for entering any
character from the document character set.</p>
<p>Character references in HTML may appear in two forms:</p>
<ul>
<li>Numeric character references (either decimal or hexadecimal).</li>
<li>Character entity references.</li>
</ul>
<p>Character references within <span class="index-inst" title=
"comments::character references in"><a name="idx-comments">comments</a></span>
have no special meaning; they are comment data only.</p>
<div class="note">
<p><em><strong>Note.</strong> HTML provides other ways to present character
data, in particular <a href="./struct/objects.html">inline images</a>.</em></p>
</div>
<div class="note">
<p><em><strong>Note.</strong> In SGML, it is possible to eliminate the final
";" after a character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be eliminated
(e.g., in the middle of a word). We strongly suggest using the ";" in all cases
to avoid problems with user agents that require this character to be
present.</em></p>
</div>
<h3><a name="h-5.3.1">5.3.1</a> Numeric character references</h3>
<p><span class="index-def" title="numeric character reference"><a name=
"didx-numeric_character_reference"><dfn>Numeric character
references</dfn></a></span> specify the <a href="#code-position">code
position</a> of a character in the document character set. Numeric character
references may take two forms:</p>
<ul>
<li>The syntax "&#<em>D</em>;", where <em>D</em> is a decimal number,
refers to the ISO 10646 decimal character number <em>D</em>.</li>
<li>The syntax "&#x<em>H</em>;" or "&#X<em>H</em>;", where <em>H</em>
is a hexadecimal number, refers to the ISO 10646 hexadecimal character number
<em>H</em>. Hexadecimal numbers in numeric character references are <span
class="index-inst" title="case::of numeric character references"><a name=
"idx-case">case-insensitive.</a></span></li>
</ul>
<p>Here are some examples of numeric character references:</p>
<div class="example">
<ul>
<li>&#229; (in decimal) represents the letter "a" with a small circle above
it (used, for example, in Norwegian).</li>
<li>&#xE5; (in hexadecimal) represents the same character.</li>
<li>&#Xe5; (in hexadecimal) represents the same character as well.</li>
<li>&#1048; (in decimal) represents the Cyrillic capital letter "I".</li>
<li>&#x6C34; (in hexadecimal) represents the Chinese character for
water.</li>
</ul>
</div>
<div class="note">
<p><em><strong>Note.</strong> Although the hexadecimal representation is not
defined in <a rel="biblioentry" href="./references.html#ref-ISO8879" class=
"normref">[ISO8879]</a>, it is expected to be in the revision, as described in
<a rel="biblioentry" href="./references.html#ref-WEBSGML" class="normref">
[WEBSGML]</a>. This convention is particularly useful since character standards
generally use hexadecimal representations.</em></p>
</div>
<h3><a name="h-5.3.2">5.3.2</a> Character entity references</h3>
<p>In order to give authors a more intuitive way of referring to characters in
the document character set, HTML offers a set of <span class="index-def" title=
"character entity references"><a name="didx-character_entity_references"><dfn>
character entity references.</dfn></a></span> Character entity references use
symbolic names so that authors need not remember <a href="#code-position">code
positions.</a> For example, the character entity reference &aring; refers
to the lowercase "a" character topped with a ring; "&aring;" is easier to
remember than &#229;.</p>
<p>HTML 4 does not define a character entity reference for every character in
the document character set. For instance, there is no character entity
reference for the Cyrillic capital letter "I". Please consult the <a href=
"./sgml/entities.html">full list of character references</a> defined in HTML
4.</p>
<p>Character entity references are <span class="index-inst" title="case::of
character entity reference"><a name="idx-case-1">case-sensitive.</a></span>
Thus, &Aring; refers to a different character (uppercase A, ring) than
&aring; (lowercase a, ring).</p>
<p>Four character entity references deserve special mention since they are
frequently used to escape special characters:</p>
<ul>
<li>"&lt;" represents the < sign.</li>
<li>"&gt;" represents the > sign.</li>
<li>"&amp;" represents the & sign.</li>
<li>"&quot; represents the " mark.</li>
</ul>
<p>Authors wishing to put the "<" character in text should use "&lt;"
(ASCII decimal 60) to avoid possible confusion with the beginning of a tag
(start tag open delimiter). Similarly, authors should use "&gt;" (ASCII
decimal 62) in text instead of ">" to avoid problems with older user agents
that incorrectly perceive this as the end of a tag (tag close delimiter) when
it appears in quoted attribute values.</p>
<p>Authors should use "&amp;" (ASCII decimal 38) instead of "&" to
avoid confusion with the beginning of a character reference (entity reference
open delimiter). Authors should also use "&amp;" in attribute values since
character references are allowed within <a href="types.html#type-cdata">
CDATA</a> attribute values.</p>
<p>Some authors use the character entity reference "&quot;" to encode
instances of the double quote mark (") since that character may be used to
delimit attribute values.</p>
<h2><a name="h-5.4">5.4</a> <a name="undisplayable">Undisplayable
characters</a></h2>
<p>A user agent may not be able to <span class="index-inst" title=
"characters::rendering undisplayable"><a name="idx-characters-1">
render</a></span> all characters in a document meaningfully, for instance,
because the user agent lacks a suitable font, a character has a value that may
not be expressed in the user agent's internal character encoding, etc.</p>
<p>Because there are many different things that may be done in such cases, this
document does not prescribe any specific behavior. Depending on the
implementation, <span class="index-inst" title="characters::handling
undisplayable"><a name="idx-characters-2">undisplayable characters</a></span>
may also be handled by the underlying display system and not the application
itself. In the absence of more sophisticated behavior, for example tailored to
the needs of a particular script or language, we recommend the following
behavior for user agents:</p>
<ol>
<li>Adopt a clearly visible, but unobtrusive mechanism to alert the user of
missing resources.</li>
<li>If missing characters are presented using their numeric representation, use
the hexadecimal (not decimal) form since this is the form used in character set
standards.</li>
</ol>
<div class="navbar" align="center">
<hr><a href="conform.html">previous</a> <a href="types.html">next</a>
<a href="cover.html#minitoc">contents</a> <a href=
"index/elements.html">elements</a> <a href="index/attributes.html">
attributes</a> <a href="index/list.html">index</a></div>
</body>
</html>