index.md - sites - public wiki contents of suckless.org | |
git clone git://git.suckless.org/sites | |
Log | |
Files | |
Refs | |
--- | |
index.md (6288B) | |
--- | |
1 LIBGRAPHEME(7) - Miscellaneous Information Manual | |
2 | |
3 # NAME | |
4 | |
5 **libgrapheme** - unicode string library | |
6 | |
7 # SYNOPSIS | |
8 | |
9 **#include <grapheme.h>** | |
10 | |
11 # DESCRIPTION | |
12 | |
13 The | |
14 **libgrapheme** | |
15 library provides functions to properly handle Unicode strings according | |
16 to the Unicode specification in regard to character, word, sentence and | |
17 line segmentation and case detection and conversion. | |
18 | |
19 Unicode strings are made up of user-perceived characters (so-called | |
20 "grapheme clusters", | |
21 see | |
22 *MOTIVATION*) | |
23 that are composed of one or more Unicode codepoints, which in turn | |
24 are encoded in one or more bytes in an encoding like UTF-8. | |
25 | |
26 There is a widespread misconception that it was enough to simply | |
27 determine codepoints in a string and treat them as user-perceived | |
28 characters to be Unicode compliant. | |
29 While this may work in some cases, this assumption quickly breaks, | |
30 especially for non-Western languages and decomposed Unicode strings | |
31 where user-perceived characters are usually represented using multiple | |
32 codepoints. | |
33 | |
34 Despite this complicated multilevel structure of Unicode strings, | |
35 **libgrapheme** | |
36 provides methods to work with them at the byte-level (i.e. UTF-8 | |
37 'char' | |
38 arrays) while also offering codepoint-level methods. | |
39 Additionally, it is a | |
40 "freestanding" | |
41 library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on | |
42 a standard library. This makes it easy to use in bare metal environments. | |
43 | |
44 Every documented function's manual page provides a self-contained | |
45 example illustrating the possible usage. | |
46 | |
47 # SEE ALSO | |
48 | |
49 grapheme\_decode\_utf8(3), | |
50 grapheme\_encode\_utf8(3), | |
51 grapheme\_is\_character\_break(3), | |
52 grapheme\_is\_lowercase(3), | |
53 grapheme\_is\_lowercase\_utf8(3), | |
54 grapheme\_is\_titlecase(3), | |
55 grapheme\_is\_titlecase\_utf8(3), | |
56 grapheme\_is\_uppercase(3), | |
57 grapheme\_is\_uppercase\_utf8(3), | |
58 grapheme\_next\_character\_break(3), | |
59 grapheme\_next\_character\_break\_utf8(3), | |
60 grapheme\_next\_line\_break(3), | |
61 grapheme\_next\_line\_break\_utf8(3), | |
62 grapheme\_next\_sentence\_break(3), | |
63 grapheme\_next\_sentence\_break\_utf8(3), | |
64 grapheme\_next\_word\_break(3), | |
65 grapheme\_next\_word\_break\_utf8(3), | |
66 grapheme\_to\_lowercase(3), | |
67 grapheme\_to\_lowercase\_utf8(3), | |
68 grapheme\_to\_titlecase(3), | |
69 grapheme\_to\_titlecase\_utf8(3) | |
70 grapheme\_to\_uppercase(3), | |
71 grapheme\_to\_uppercase\_utf8(3), | |
72 | |
73 # STANDARDS | |
74 | |
75 **libgrapheme** | |
76 is compliant with the Unicode 15.0.0 specification. | |
77 | |
78 # MOTIVATION | |
79 | |
80 The idea behind every character encoding scheme like ASCII or Unicode | |
81 is to express abstract characters (which can be thought of as shapes | |
82 making up a written language). ASCII for instance, which comprises the | |
83 range 0 to 127, assigns the number 65 (0x41) to the abstract character | |
84 'A'. | |
85 This number is called a | |
86 "codepoint", | |
87 and all codepoints of an encoding make up its so-called | |
88 "code space". | |
89 | |
90 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its | |
91 first 128 codepoints are identical to ASCII's. The additional code | |
92 points are needed as Unicode's goal is to express all writing systems | |
93 of the world. | |
94 To give an example, the abstract character | |
95 'Ä' | |
96 is not expressable in ASCII, given no ASCII codepoint has been assigned | |
97 to it. | |
98 It can be expressed in Unicode, though, with the codepoint 196 (0xC4). | |
99 | |
100 One may assume that this process is straightfoward, but as more and | |
101 more codepoints were assigned to abstract characters, the Unicode | |
102 Consortium (that defines the Unicode standard) was facing a problem: | |
103 Many (mostly non-European) languages have such a large amount of | |
104 abstract characters that it would exhaust the available Unicode code | |
105 space if one tried to assign a codepoint to each abstract character. | |
106 The solution to that problem is best introduced with an example: Consider | |
107 the abstract character | |
108 'Ǟ', | |
109 which is | |
110 'A' | |
111 with an umlaut and a macron added to it. | |
112 In this sense, one can consider | |
113 'Ǟ' | |
114 as a two-fold modification (namely | |
115 "add umlaut" | |
116 and | |
117 "add macron") | |
118 of the | |
119 "base character" | |
120 'A'. | |
121 | |
122 The Unicode Consortium adapted this idea by assigning codepoints to | |
123 modifications. | |
124 For example, the codepoint 0x308 represents adding an umlaut and 0x304 | |
125 represents adding a macron, and thus, the codepoint sequence | |
126 "0x41 0x308 0x304", | |
127 namely the base character | |
128 'A' | |
129 followed by the umlaut and macron modifiers, represents the abstract | |
130 character | |
131 'Ǟ'. | |
132 As a side-note, the single codepoint 0x1DE was also assigned to | |
133 'Ǟ', | |
134 which is a good example for the fact that there can be multiple | |
135 representations of a single abstract character in Unicode. | |
136 | |
137 Expressing a single abstract character with multiple codepoints solved | |
138 the code space exhaustion-problem, and the concept has been greatly | |
139 expanded since its first introduction (emojis, joiners, etc.). A sequence | |
140 (which can also have the length 1) of codepoints that belong together | |
141 this way and represents an abstract character is called a | |
142 "grapheme cluster". | |
143 | |
144 In many applications it is necessary to count the number of | |
145 user-perceived characters, i.e. grapheme clusters, in a string. | |
146 A good example for this is a terminal text editor, which needs to | |
147 properly align characters on a grid. | |
148 This is pretty simple with ASCII-strings, where you just count the number | |
149 of bytes (as each byte is a codepoint and each codepoint is a grapheme | |
150 cluster). | |
151 With Unicode-strings, it is a common mistake to simply adapt the | |
152 ASCII-approach and count the number of code points. | |
153 This is wrong, as, for example, the sequence | |
154 "0x41 0x308 0x304", | |
155 while made up of 3 codepoints, is a single grapheme cluster and | |
156 represents the user-perceived character | |
157 'Ǟ'. | |
158 | |
159 The proper way to segment a string into user-perceived characters | |
160 is to segment it into its grapheme clusters by applying the Unicode | |
161 grapheme cluster breaking algorithm (UAX #29). | |
162 It is based on a complex ruleset and lookup-tables and determines if a | |
163 grapheme cluster ends or is continued between two codepoints. | |
164 Libraries like ICU and libunistring, which also offer this functionality, | |
165 are often bloated, not correct, difficult to use or not reasonably | |
166 statically linkable. | |
167 | |
168 Analogously, the standard provides algorithms to separate strings by | |
169 words, sentences and lines, convert cases and compare strings. | |
170 The motivation behind | |
171 **libgrapheme** | |
172 is to make unicode handling suck less and abide by the UNIX philosophy. | |
173 | |
174 # AUTHORS | |
175 | |
176 Laslo Hunhold ([[email protected]](mailto:[email protected])) | |
177 | |
178 suckless.org - 2022-10-06 |