index.md - sites - public wiki contents of suckless.org | |
git clone git://git.suckless.org/sites | |
Log | |
Files | |
Refs | |
--- | |
index.md (6597B) | |
--- | |
1  | |
2 | |
3 libgrapheme is an extremely simple freestanding C99 library providing | |
4 utilities for properly handling strings according to the latest | |
5 Unicode standard 15.0.0. It offers fully Unicode compliant | |
6 | |
7 * __grapheme cluster__ (i.e. user-perceived character) __segmentation__ | |
8 * __word segmentation__ | |
9 * __sentence segmentation__ | |
10 * detection of permissible __line break opportunities__ | |
11 * __case detection__ (lower-, upper- and title-case) | |
12 * __case conversion__ (to lower-, upper- and title-case) | |
13 | |
14 on UTF-8 strings and codepoint arrays, which both can also be | |
15 null-terminated. | |
16 | |
17 The necessary lookup-tables are automatically generated from the Unicode | |
18 standard data (contained in the tarball) and heavily compressed. Over | |
19 10,000 automatically generated conformance tests and over 150 unit tests | |
20 ensure conformance and correctness. | |
21 | |
22 There is no complicated build-system involved and it's all done using | |
23 one POSIX-compliant Makefile. All you need is a C99 compiler, given | |
24 the lookup-table-generators and compressors that are only run at | |
25 build-time are also written in C99. | |
26 The resulting library is freestanding and thus not even dependent on a | |
27 standard library to be present at runtime, making it a suitable choice | |
28 for bare metal applications. | |
29 | |
30 It is also way smaller and much faster than the other established Unicode | |
31 string libraries (ICU, GNU's libunistring, libutf8proc). | |
32 | |
33 Development | |
34 ----------- | |
35 You can [browse](//git.suckless.org/libgrapheme) the source code | |
36 repository or get a copy with the following command: | |
37 | |
38 git clone https://git.suckless.org/libgrapheme | |
39 | |
40 Download | |
41 -------- | |
42 libgrapheme follows the [semantic versioning](https://semver.org/) schem… | |
43 | |
44 * [libgrapheme-2.0.2](//dl.suckless.org/libgrapheme/libgrapheme-2.0.2.ta… | |
45 * [libgrapheme-1.0.0](//dl.suckless.org/libgrapheme/libgrapheme-1.0.0.ta… | |
46 | |
47 | |
48 Getting Started | |
49 --------------- | |
50 Automatically configuring and installing libgrapheme via | |
51 | |
52 ./configure | |
53 make install | |
54 | |
55 will install the header grapheme.h and both the static library | |
56 libgrapheme.a and the dynamic library libgrapheme.so (with symlinks) in | |
57 the respective folders. The conformance and unit tests can be run with | |
58 | |
59 make test | |
60 | |
61 and comparative benchmarks against libutf8proc (which is the only Unicode | |
62 library compliant enough to be comparable to) can be run with | |
63 | |
64 make benchmark | |
65 | |
66 You can access the manual [here](man/) or via libgrapheme(7) by typing | |
67 | |
68 man libgrapheme | |
69 | |
70 and looking at the referred pages, e.g. | |
71 [grapheme\_next\_character\_break_utf8(3)](man/grapheme_next_character_b… | |
72 Each page contains code-examples and an extensive description. To give | |
73 one example that is also given in the manuals, the following code | |
74 separates a given string 'Tëst 👨👩👦 🇺🇸 नी ந�… | |
75 into its user-perceived characters: | |
76 | |
77 #include <grapheme.h> | |
78 #include <stdint.h> | |
79 #include <stdio.h> | |
80 | |
81 int | |
82 main(void) | |
83 { | |
84 /* UTF-8 encoded input */ | |
85 char *s = "T\xC3\xABst \xF0\x9F\x91\xA8\xE2\x80\x8D\xF0" | |
86 "\x9F\x91\xA9\xE2\x80\x8D\xF0\x9F\x91\xA6 \xF0" | |
87 "\x9F\x87\xBA\xF0\x9F\x87\xB8 \xE0\xA4\xA8\xE0" | |
88 "\xA5\x80 \xE0\xAE\xA8\xE0\xAE\xBF!"; | |
89 size_t ret, len, off; | |
90 | |
91 printf("Input: \"%s\"\n", s); | |
92 | |
93 /* print each grapheme cluster with byte-length */ | |
94 printf("grapheme clusters in NUL-delimited input:\n"); | |
95 for (off = 0; s[off] != '\0'; off += ret) { | |
96 ret = grapheme_next_character_break_utf8(s + off… | |
97 printf("%2zu bytes | %.*s\n", ret, (int)ret, s +… | |
98 } | |
99 printf("\n"); | |
100 | |
101 /* do the same, but this time string is length-delimited… | |
102 len = 17; | |
103 printf("grapheme clusters in input delimited to %zu byte… | |
104 for (off = 0; off < len; off += ret) { | |
105 ret = grapheme_next_character_break_utf8(s + off… | |
106 printf("%2zu bytes | %.*s\n", ret, (int)ret, s +… | |
107 } | |
108 | |
109 return 0; | |
110 } | |
111 | |
112 This code can be compiled with | |
113 | |
114 cc (-static) -o example example.c -lgrapheme | |
115 | |
116 and the output is | |
117 | |
118 Input: "Tëst 👨👩👦 🇺🇸 नी நி!" | |
119 grapheme clusters in NUL-delimited input: | |
120 1 bytes | T | |
121 2 bytes | ë | |
122 1 bytes | s | |
123 1 bytes | t | |
124 1 bytes | | |
125 18 bytes | 👨👩👦 | |
126 1 bytes | | |
127 8 bytes | 🇺🇸 | |
128 1 bytes | | |
129 6 bytes | नी | |
130 1 bytes | | |
131 6 bytes | நி | |
132 1 bytes | ! | |
133 | |
134 grapheme clusters in input delimited to 17 bytes: | |
135 1 bytes | T | |
136 2 bytes | ë | |
137 1 bytes | s | |
138 1 bytes | t | |
139 1 bytes | | |
140 11 bytes | 👨👩 | |
141 | |
142 Motivation | |
143 ---------- | |
144 The goal of this project is to be a suckless and statically linkable | |
145 alternative to the existing bloated, complicated, overscoped and/or | |
146 incorrect solutions for Unicode string handling (ICU, GNU's | |
147 libunistring, libutf8proc, etc.), motivating more hackers to properly | |
148 handle Unicode strings in their projects and allowing this even in | |
149 embedded applications. | |
150 | |
151 The problem can be easily seen when looking at the sizes of the respecti… | |
152 libraries: The ICU library (libicudata.a, libicui18n.a, libicuio.a, | |
153 libicutest.a, libicutu.a, libicuuc.a) is around 38MB and libunistring | |
154 (libunistring.a) is around 2MB, which is unacceptable for static | |
155 linking. Both take many minutes to compile even on a good computer and | |
156 require a lot of dependencies, including Python for ICU. On | |
157 the other hand libgrapheme (libgrapheme.a) only weighs in at around 300K | |
158 and is compiled (including Unicode data parsing and compression) in | |
159 under a second, requiring nothing but a C99 compiler and POSIX make(1). | |
160 | |
161 Some libraries, like libutf8proc and libunistring, are incorrect by | |
162 basing their API on assumptions that haven't been true for years | |
163 (e.g. offering stateless grapheme cluster segmentation even though the | |
164 underlying algorithm is not stateless). As an additional factor, | |
165 libutf8proc's UTF-8-decoder is unsafe, as it allows overlong encodings | |
166 that can be easily used for exploits. | |
167 | |
168 While ICU and libunistring offer a lot of functions and the weight mostly | |
169 comes from locale-data provided by the Unicode standard, which is applied | |
170 implementation-specifically (!) for some things, the same standard always | |
171 defines a sane 'default' behaviour as an alternative in such cases that | |
172 is satisfying in 99% of the cases and which you can rely on. | |
173 | |
174 For some languages, for instance, it is necessary to have a dictionary | |
175 on hand to always accurately determine when a word begins and ends. The | |
176 defaults provided by the standard, though, already do a great job | |
177 respecting the language's boundaries in the general case and are not too | |
178 taxing in terms of performance. | |
179 | |
180 Author | |
181 ------ | |
182 * Laslo Hunhold ([email protected]) | |
183 | |
184 Please contact me if you have information that could be added to this pa… |