| libgrapheme.sh - libgrapheme - unicode string library | |
| git clone git://git.suckless.org/libgrapheme | |
| Log | |
| Files | |
| Refs | |
| README | |
| LICENSE | |
| --- | |
| libgrapheme.sh (6347B) | |
| --- | |
| 1 cat << EOF | |
| 2 .Dd ${MAN_DATE} | |
| 3 .Dt LIBGRAPHEME 7 | |
| 4 .Os suckless.org | |
| 5 .Sh NAME | |
| 6 .Nm libgrapheme | |
| 7 .Nd unicode string library | |
| 8 .Sh SYNOPSIS | |
| 9 .In grapheme.h | |
| 10 .Sh DESCRIPTION | |
| 11 The | |
| 12 .Nm | |
| 13 library provides functions to properly handle Unicode strings according | |
| 14 to the Unicode specification in regard to character, word, sentence and | |
| 15 line segmentation and case detection and conversion. | |
| 16 .Pp | |
| 17 Unicode strings are made up of user-perceived characters (so-called | |
| 18 .Dq grapheme clusters , | |
| 19 see | |
| 20 .Sx MOTIVATION ) | |
| 21 that are composed of one or more Unicode codepoints, which in turn | |
| 22 are encoded in one or more bytes in an encoding like UTF-8. | |
| 23 .Pp | |
| 24 There is a widespread misconception that it was enough to simply | |
| 25 determine codepoints in a string and treat them as user-perceived | |
| 26 characters to be Unicode compliant. | |
| 27 While this may work in some cases, this assumption quickly breaks, | |
| 28 especially for non-Western languages and decomposed Unicode strings | |
| 29 where user-perceived characters are usually represented using multiple | |
| 30 codepoints. | |
| 31 .Pp | |
| 32 Despite this complicated multilevel structure of Unicode strings, | |
| 33 .Nm | |
| 34 provides methods to work with them at the byte-level (i.e. UTF-8 | |
| 35 .Sq char | |
| 36 arrays) while also offering codepoint-level methods. | |
| 37 Additionally, it is a | |
| 38 .Dq freestanding | |
| 39 library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on | |
| 40 a standard library. This makes it easy to use in bare metal environments. | |
| 41 .Pp | |
| 42 Every documented function's manual page provides a self-contained | |
| 43 example illustrating the possible usage. | |
| 44 .Sh SEE ALSO | |
| 45 .Xr grapheme_decode_utf8 3 , | |
| 46 .Xr grapheme_encode_utf8 3 , | |
| 47 .Xr grapheme_is_character_break 3 , | |
| 48 .Xr grapheme_is_lowercase 3 , | |
| 49 .Xr grapheme_is_lowercase_utf8 3 , | |
| 50 .Xr grapheme_is_titlecase 3 , | |
| 51 .Xr grapheme_is_titlecase_utf8 3 , | |
| 52 .Xr grapheme_is_uppercase 3 , | |
| 53 .Xr grapheme_is_uppercase_utf8 3 , | |
| 54 .Xr grapheme_next_character_break 3 , | |
| 55 .Xr grapheme_next_character_break_utf8 3 , | |
| 56 .Xr grapheme_next_line_break 3 , | |
| 57 .Xr grapheme_next_line_break_utf8 3 , | |
| 58 .Xr grapheme_next_sentence_break 3 , | |
| 59 .Xr grapheme_next_sentence_break_utf8 3 , | |
| 60 .Xr grapheme_next_word_break 3 , | |
| 61 .Xr grapheme_next_word_break_utf8 3 , | |
| 62 .Xr grapheme_to_lowercase 3 , | |
| 63 .Xr grapheme_to_lowercase_utf8 3 , | |
| 64 .Xr grapheme_to_titlecase 3 , | |
| 65 .Xr grapheme_to_titlecase_utf8 3 | |
| 66 .Xr grapheme_to_uppercase 3 , | |
| 67 .Xr grapheme_to_uppercase_utf8 3 , | |
| 68 .Sh STANDARDS | |
| 69 .Nm | |
| 70 is compliant with the Unicode ${UNICODE_VERSION} specification. | |
| 71 .Sh MOTIVATION | |
| 72 The idea behind every character encoding scheme like ASCII or Unicode | |
| 73 is to express abstract characters (which can be thought of as shapes | |
| 74 making up a written language). ASCII for instance, which comprises the | |
| 75 range 0 to 127, assigns the number 65 (0x41) to the abstract character | |
| 76 .Sq A . | |
| 77 This number is called a | |
| 78 .Dq codepoint , | |
| 79 and all codepoints of an encoding make up its so-called | |
| 80 .Dq code space . | |
| 81 .Pp | |
| 82 Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its | |
| 83 first 128 codepoints are identical to ASCII's. The additional code | |
| 84 points are needed as Unicode's goal is to express all writing systems | |
| 85 of the world. | |
| 86 To give an example, the abstract character | |
| 87 .Sq \[u00C4] | |
| 88 is not expressible in ASCII, given no ASCII codepoint has been assigned | |
| 89 to it. | |
| 90 It can be expressed in Unicode, though, with the codepoint 196 (0xC4). | |
| 91 .Pp | |
| 92 One may assume that this process is straightforward, but as more and | |
| 93 more codepoints were assigned to abstract characters, the Unicode | |
| 94 Consortium (that defines the Unicode standard) was facing a problem: | |
| 95 Many (mostly non-European) languages have such a large amount of | |
| 96 abstract characters that it would exhaust the available Unicode code | |
| 97 space if one tried to assign a codepoint to each abstract character. | |
| 98 The solution to that problem is best introduced with an example: Consider | |
| 99 the abstract character | |
| 100 .Sq \[u01DE] , | |
| 101 which is | |
| 102 .Sq A | |
| 103 with an umlaut and a macron added to it. | |
| 104 In this sense, one can consider | |
| 105 .Sq \[u01DE] | |
| 106 as a two-fold modification (namely | |
| 107 .Dq add umlaut | |
| 108 and | |
| 109 .Dq add macron ) | |
| 110 of the | |
| 111 .Dq base character | |
| 112 .Sq A . | |
| 113 .Pp | |
| 114 The Unicode Consortium adapted this idea by assigning codepoints to | |
| 115 modifications. | |
| 116 For example, the codepoint 0x308 represents adding an umlaut and 0x304 | |
| 117 represents adding a macron, and thus, the codepoint sequence | |
| 118 .Dq 0x41 0x308 0x304 , | |
| 119 namely the base character | |
| 120 .Sq A | |
| 121 followed by the umlaut and macron modifiers, represents the abstract | |
| 122 character | |
| 123 .Sq \[u01DE] . | |
| 124 As a side-note, the single codepoint 0x1DE was also assigned to | |
| 125 .Sq \[u01DE] , | |
| 126 which is a good example for the fact that there can be multiple | |
| 127 representations of a single abstract character in Unicode. | |
| 128 .Pp | |
| 129 Expressing a single abstract character with multiple codepoints solved | |
| 130 the code space exhaustion-problem, and the concept has been greatly | |
| 131 expanded since its first introduction (emojis, joiners, etc.). A sequence | |
| 132 (which can also have the length 1) of codepoints that belong together | |
| 133 this way and represents an abstract character is called a | |
| 134 .Dq grapheme cluster . | |
| 135 .Pp | |
| 136 In many applications it is necessary to count the number of | |
| 137 user-perceived characters, i.e. grapheme clusters, in a string. | |
| 138 A good example for this is a terminal text editor, which needs to | |
| 139 properly align characters on a grid. | |
| 140 This is pretty simple with ASCII-strings, where you just count the number | |
| 141 of bytes (as each byte is a codepoint and each codepoint is a grapheme | |
| 142 cluster). | |
| 143 With Unicode-strings, it is a common mistake to simply adapt the | |
| 144 ASCII-approach and count the number of code points. | |
| 145 This is wrong, as, for example, the sequence | |
| 146 .Dq 0x41 0x308 0x304 , | |
| 147 while made up of 3 codepoints, is a single grapheme cluster and | |
| 148 represents the user-perceived character | |
| 149 .Sq \[u01DE] . | |
| 150 .Pp | |
| 151 The proper way to segment a string into user-perceived characters | |
| 152 is to segment it into its grapheme clusters by applying the Unicode | |
| 153 grapheme cluster breaking algorithm (UAX #29). | |
| 154 It is based on a complex ruleset and lookup-tables and determines if a | |
| 155 grapheme cluster ends or is continued between two codepoints. | |
| 156 Libraries like ICU and libunistring, which also offer this functionality, | |
| 157 are often bloated, not correct, difficult to use or not reasonably | |
| 158 statically linkable. | |
| 159 .Pp | |
| 160 Analogously, the standard provides algorithms to separate strings by | |
| 161 words, sentences and lines, convert cases and compare strings. | |
| 162 The motivation behind | |
| 163 .Nm | |
| 164 is to make unicode handling suck less and abide by the UNIX philosophy. | |
| 165 .Sh AUTHORS | |
| 166 .An Laslo Hunhold Aq Mt [email protected] | |
| 167 EOF |