| index.md - sites - public wiki contents of suckless.org | |
| git clone git://git.suckless.org/sites | |
| Log | |
| Files | |
| Refs | |
| --- | |
| index.md (6288B) | |
| --- | |
| 1 LIBGRAPHEME(7) - Miscellaneous Information Manual | |
| 2 | |
| 3 # NAME | |
| 4 | |
| 5 **libgrapheme** - unicode string library | |
| 6 | |
| 7 # SYNOPSIS | |
| 8 | |
| 9 **#include <grapheme.h>** | |
| 10 | |
| 11 # DESCRIPTION | |
| 12 | |
| 13 The | |
| 14 **libgrapheme** | |
| 15 library provides functions to properly handle Unicode strings according | |
| 16 to the Unicode specification in regard to character, word, sentence and | |
| 17 line segmentation and case detection and conversion. | |
| 18 | |
| 19 Unicode strings are made up of user-perceived characters (so-called | |
| 20 "grapheme clusters", | |
| 21 see | |
| 22 *MOTIVATION*) | |
| 23 that are composed of one or more Unicode codepoints, which in turn | |
| 24 are encoded in one or more bytes in an encoding like UTF-8. | |
| 25 | |
| 26 There is a widespread misconception that it was enough to simply | |
| 27 determine codepoints in a string and treat them as user-perceived | |
| 28 characters to be Unicode compliant. | |
| 29 While this may work in some cases, this assumption quickly breaks, | |
| 30 especially for non-Western languages and decomposed Unicode strings | |
| 31 where user-perceived characters are usually represented using multiple | |
| 32 codepoints. | |
| 33 | |
| 34 Despite this complicated multilevel structure of Unicode strings, | |
| 35 **libgrapheme** | |
| 36 provides methods to work with them at the byte-level (i.e. UTF-8 | |
| 37 'char' | |
| 38 arrays) while also offering codepoint-level methods. | |
| 39 Additionally, it is a | |
| 40 "freestanding" | |
| 41 library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on | |
| 42 a standard library. This makes it easy to use in bare metal environments. | |
| 43 | |
| 44 Every documented function's manual page provides a self-contained | |
| 45 example illustrating the possible usage. | |
| 46 | |
| 47 # SEE ALSO | |
| 48 | |
| 49 grapheme\_decode\_utf8(3), | |
| 50 grapheme\_encode\_utf8(3), | |
| 51 grapheme\_is\_character\_break(3), | |
| 52 grapheme\_is\_lowercase(3), | |
| 53 grapheme\_is\_lowercase\_utf8(3), | |
| 54 grapheme\_is\_titlecase(3), | |
| 55 grapheme\_is\_titlecase\_utf8(3), | |
| 56 grapheme\_is\_uppercase(3), | |
| 57 grapheme\_is\_uppercase\_utf8(3), | |
| 58 grapheme\_next\_character\_break(3), | |
| 59 grapheme\_next\_character\_break\_utf8(3), | |
| 60 grapheme\_next\_line\_break(3), | |
| 61 grapheme\_next\_line\_break\_utf8(3), | |
| 62 grapheme\_next\_sentence\_break(3), | |
| 63 grapheme\_next\_sentence\_break\_utf8(3), | |
| 64 grapheme\_next\_word\_break(3), | |
| 65 grapheme\_next\_word\_break\_utf8(3), | |
| 66 grapheme\_to\_lowercase(3), | |
| 67 grapheme\_to\_lowercase\_utf8(3), | |
| 68 grapheme\_to\_titlecase(3), | |
| 69 grapheme\_to\_titlecase\_utf8(3) | |
| 70 grapheme\_to\_uppercase(3), | |
| 71 grapheme\_to\_uppercase\_utf8(3), | |
| 72 | |
| 73 # STANDARDS | |
| 74 | |
| 75 **libgrapheme** | |
| 76 is compliant with the Unicode 15.0.0 specification. | |
| 77 | |
| 78 # MOTIVATION | |
| 79 | |
| 80 The idea behind every character encoding scheme like ASCII or Unicode | |
| 81 is to express abstract characters (which can be thought of as shapes | |
| 82 making up a written language). ASCII for instance, which comprises the | |
| 83 range 0 to 127, assigns the number 65 (0x41) to the abstract character | |
| 84 'A'. | |
| 85 This number is called a | |
| 86 "codepoint", | |
| 87 and all codepoints of an encoding make up its so-called | |
| 88 "code space". | |
| 89 | |
| 90 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its | |
| 91 first 128 codepoints are identical to ASCII's. The additional code | |
| 92 points are needed as Unicode's goal is to express all writing systems | |
| 93 of the world. | |
| 94 To give an example, the abstract character | |
| 95 'Ä' | |
| 96 is not expressable in ASCII, given no ASCII codepoint has been assigned | |
| 97 to it. | |
| 98 It can be expressed in Unicode, though, with the codepoint 196 (0xC4). | |
| 99 | |
| 100 One may assume that this process is straightfoward, but as more and | |
| 101 more codepoints were assigned to abstract characters, the Unicode | |
| 102 Consortium (that defines the Unicode standard) was facing a problem: | |
| 103 Many (mostly non-European) languages have such a large amount of | |
| 104 abstract characters that it would exhaust the available Unicode code | |
| 105 space if one tried to assign a codepoint to each abstract character. | |
| 106 The solution to that problem is best introduced with an example: Consider | |
| 107 the abstract character | |
| 108 'Ǟ', | |
| 109 which is | |
| 110 'A' | |
| 111 with an umlaut and a macron added to it. | |
| 112 In this sense, one can consider | |
| 113 'Ǟ' | |
| 114 as a two-fold modification (namely | |
| 115 "add umlaut" | |
| 116 and | |
| 117 "add macron") | |
| 118 of the | |
| 119 "base character" | |
| 120 'A'. | |
| 121 | |
| 122 The Unicode Consortium adapted this idea by assigning codepoints to | |
| 123 modifications. | |
| 124 For example, the codepoint 0x308 represents adding an umlaut and 0x304 | |
| 125 represents adding a macron, and thus, the codepoint sequence | |
| 126 "0x41 0x308 0x304", | |
| 127 namely the base character | |
| 128 'A' | |
| 129 followed by the umlaut and macron modifiers, represents the abstract | |
| 130 character | |
| 131 'Ǟ'. | |
| 132 As a side-note, the single codepoint 0x1DE was also assigned to | |
| 133 'Ǟ', | |
| 134 which is a good example for the fact that there can be multiple | |
| 135 representations of a single abstract character in Unicode. | |
| 136 | |
| 137 Expressing a single abstract character with multiple codepoints solved | |
| 138 the code space exhaustion-problem, and the concept has been greatly | |
| 139 expanded since its first introduction (emojis, joiners, etc.). A sequence | |
| 140 (which can also have the length 1) of codepoints that belong together | |
| 141 this way and represents an abstract character is called a | |
| 142 "grapheme cluster". | |
| 143 | |
| 144 In many applications it is necessary to count the number of | |
| 145 user-perceived characters, i.e. grapheme clusters, in a string. | |
| 146 A good example for this is a terminal text editor, which needs to | |
| 147 properly align characters on a grid. | |
| 148 This is pretty simple with ASCII-strings, where you just count the number | |
| 149 of bytes (as each byte is a codepoint and each codepoint is a grapheme | |
| 150 cluster). | |
| 151 With Unicode-strings, it is a common mistake to simply adapt the | |
| 152 ASCII-approach and count the number of code points. | |
| 153 This is wrong, as, for example, the sequence | |
| 154 "0x41 0x308 0x304", | |
| 155 while made up of 3 codepoints, is a single grapheme cluster and | |
| 156 represents the user-perceived character | |
| 157 'Ǟ'. | |
| 158 | |
| 159 The proper way to segment a string into user-perceived characters | |
| 160 is to segment it into its grapheme clusters by applying the Unicode | |
| 161 grapheme cluster breaking algorithm (UAX #29). | |
| 162 It is based on a complex ruleset and lookup-tables and determines if a | |
| 163 grapheme cluster ends or is continued between two codepoints. | |
| 164 Libraries like ICU and libunistring, which also offer this functionality, | |
| 165 are often bloated, not correct, difficult to use or not reasonably | |
| 166 statically linkable. | |
| 167 | |
| 168 Analogously, the standard provides algorithms to separate strings by | |
| 169 words, sentences and lines, convert cases and compare strings. | |
| 170 The motivation behind | |
| 171 **libgrapheme** | |
| 172 is to make unicode handling suck less and abide by the UNIX philosophy. | |
| 173 | |
| 174 # AUTHORS | |
| 175 | |
| 176 Laslo Hunhold ([[email protected]](mailto:[email protected])) | |
| 177 | |
| 178 suckless.org - 2022-10-06 |