libgrapheme.sh - libgrapheme - unicode string library | |
git clone git://git.suckless.org/libgrapheme | |
Log | |
Files | |
Refs | |
README | |
LICENSE | |
--- | |
libgrapheme.sh (6347B) | |
--- | |
1 cat << EOF | |
2 .Dd ${MAN_DATE} | |
3 .Dt LIBGRAPHEME 7 | |
4 .Os suckless.org | |
5 .Sh NAME | |
6 .Nm libgrapheme | |
7 .Nd unicode string library | |
8 .Sh SYNOPSIS | |
9 .In grapheme.h | |
10 .Sh DESCRIPTION | |
11 The | |
12 .Nm | |
13 library provides functions to properly handle Unicode strings according | |
14 to the Unicode specification in regard to character, word, sentence and | |
15 line segmentation and case detection and conversion. | |
16 .Pp | |
17 Unicode strings are made up of user-perceived characters (so-called | |
18 .Dq grapheme clusters , | |
19 see | |
20 .Sx MOTIVATION ) | |
21 that are composed of one or more Unicode codepoints, which in turn | |
22 are encoded in one or more bytes in an encoding like UTF-8. | |
23 .Pp | |
24 There is a widespread misconception that it was enough to simply | |
25 determine codepoints in a string and treat them as user-perceived | |
26 characters to be Unicode compliant. | |
27 While this may work in some cases, this assumption quickly breaks, | |
28 especially for non-Western languages and decomposed Unicode strings | |
29 where user-perceived characters are usually represented using multiple | |
30 codepoints. | |
31 .Pp | |
32 Despite this complicated multilevel structure of Unicode strings, | |
33 .Nm | |
34 provides methods to work with them at the byte-level (i.e. UTF-8 | |
35 .Sq char | |
36 arrays) while also offering codepoint-level methods. | |
37 Additionally, it is a | |
38 .Dq freestanding | |
39 library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on | |
40 a standard library. This makes it easy to use in bare metal environments. | |
41 .Pp | |
42 Every documented function's manual page provides a self-contained | |
43 example illustrating the possible usage. | |
44 .Sh SEE ALSO | |
45 .Xr grapheme_decode_utf8 3 , | |
46 .Xr grapheme_encode_utf8 3 , | |
47 .Xr grapheme_is_character_break 3 , | |
48 .Xr grapheme_is_lowercase 3 , | |
49 .Xr grapheme_is_lowercase_utf8 3 , | |
50 .Xr grapheme_is_titlecase 3 , | |
51 .Xr grapheme_is_titlecase_utf8 3 , | |
52 .Xr grapheme_is_uppercase 3 , | |
53 .Xr grapheme_is_uppercase_utf8 3 , | |
54 .Xr grapheme_next_character_break 3 , | |
55 .Xr grapheme_next_character_break_utf8 3 , | |
56 .Xr grapheme_next_line_break 3 , | |
57 .Xr grapheme_next_line_break_utf8 3 , | |
58 .Xr grapheme_next_sentence_break 3 , | |
59 .Xr grapheme_next_sentence_break_utf8 3 , | |
60 .Xr grapheme_next_word_break 3 , | |
61 .Xr grapheme_next_word_break_utf8 3 , | |
62 .Xr grapheme_to_lowercase 3 , | |
63 .Xr grapheme_to_lowercase_utf8 3 , | |
64 .Xr grapheme_to_titlecase 3 , | |
65 .Xr grapheme_to_titlecase_utf8 3 | |
66 .Xr grapheme_to_uppercase 3 , | |
67 .Xr grapheme_to_uppercase_utf8 3 , | |
68 .Sh STANDARDS | |
69 .Nm | |
70 is compliant with the Unicode ${UNICODE_VERSION} specification. | |
71 .Sh MOTIVATION | |
72 The idea behind every character encoding scheme like ASCII or Unicode | |
73 is to express abstract characters (which can be thought of as shapes | |
74 making up a written language). ASCII for instance, which comprises the | |
75 range 0 to 127, assigns the number 65 (0x41) to the abstract character | |
76 .Sq A . | |
77 This number is called a | |
78 .Dq codepoint , | |
79 and all codepoints of an encoding make up its so-called | |
80 .Dq code space . | |
81 .Pp | |
82 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its | |
83 first 128 codepoints are identical to ASCII's. The additional code | |
84 points are needed as Unicode's goal is to express all writing systems | |
85 of the world. | |
86 To give an example, the abstract character | |
87 .Sq \[u00C4] | |
88 is not expressible in ASCII, given no ASCII codepoint has been assigned | |
89 to it. | |
90 It can be expressed in Unicode, though, with the codepoint 196 (0xC4). | |
91 .Pp | |
92 One may assume that this process is straightforward, but as more and | |
93 more codepoints were assigned to abstract characters, the Unicode | |
94 Consortium (that defines the Unicode standard) was facing a problem: | |
95 Many (mostly non-European) languages have such a large amount of | |
96 abstract characters that it would exhaust the available Unicode code | |
97 space if one tried to assign a codepoint to each abstract character. | |
98 The solution to that problem is best introduced with an example: Consider | |
99 the abstract character | |
100 .Sq \[u01DE] , | |
101 which is | |
102 .Sq A | |
103 with an umlaut and a macron added to it. | |
104 In this sense, one can consider | |
105 .Sq \[u01DE] | |
106 as a two-fold modification (namely | |
107 .Dq add umlaut | |
108 and | |
109 .Dq add macron ) | |
110 of the | |
111 .Dq base character | |
112 .Sq A . | |
113 .Pp | |
114 The Unicode Consortium adapted this idea by assigning codepoints to | |
115 modifications. | |
116 For example, the codepoint 0x308 represents adding an umlaut and 0x304 | |
117 represents adding a macron, and thus, the codepoint sequence | |
118 .Dq 0x41 0x308 0x304 , | |
119 namely the base character | |
120 .Sq A | |
121 followed by the umlaut and macron modifiers, represents the abstract | |
122 character | |
123 .Sq \[u01DE] . | |
124 As a side-note, the single codepoint 0x1DE was also assigned to | |
125 .Sq \[u01DE] , | |
126 which is a good example for the fact that there can be multiple | |
127 representations of a single abstract character in Unicode. | |
128 .Pp | |
129 Expressing a single abstract character with multiple codepoints solved | |
130 the code space exhaustion-problem, and the concept has been greatly | |
131 expanded since its first introduction (emojis, joiners, etc.). A sequence | |
132 (which can also have the length 1) of codepoints that belong together | |
133 this way and represents an abstract character is called a | |
134 .Dq grapheme cluster . | |
135 .Pp | |
136 In many applications it is necessary to count the number of | |
137 user-perceived characters, i.e. grapheme clusters, in a string. | |
138 A good example for this is a terminal text editor, which needs to | |
139 properly align characters on a grid. | |
140 This is pretty simple with ASCII-strings, where you just count the number | |
141 of bytes (as each byte is a codepoint and each codepoint is a grapheme | |
142 cluster). | |
143 With Unicode-strings, it is a common mistake to simply adapt the | |
144 ASCII-approach and count the number of code points. | |
145 This is wrong, as, for example, the sequence | |
146 .Dq 0x41 0x308 0x304 , | |
147 while made up of 3 codepoints, is a single grapheme cluster and | |
148 represents the user-perceived character | |
149 .Sq \[u01DE] . | |
150 .Pp | |
151 The proper way to segment a string into user-perceived characters | |
152 is to segment it into its grapheme clusters by applying the Unicode | |
153 grapheme cluster breaking algorithm (UAX #29). | |
154 It is based on a complex ruleset and lookup-tables and determines if a | |
155 grapheme cluster ends or is continued between two codepoints. | |
156 Libraries like ICU and libunistring, which also offer this functionality, | |
157 are often bloated, not correct, difficult to use or not reasonably | |
158 statically linkable. | |
159 .Pp | |
160 Analogously, the standard provides algorithms to separate strings by | |
161 words, sentences and lines, convert cases and compare strings. | |
162 The motivation behind | |
163 .Nm | |
164 is to make unicode handling suck less and abide by the UNIX philosophy. | |
165 .Sh AUTHORS | |
166 .An Laslo Hunhold Aq Mt [email protected] | |
167 EOF |