Implement the Unicode Bidirectional Algorithm (UAX #9) - libgrapheme - unicode … | |
git clone git://git.suckless.org/libgrapheme | |
Log | |
Files | |
Refs | |
README | |
LICENSE | |
--- | |
commit 5998352d2d2e6e37531548f8e986abae5ff8ef02 | |
parent dd15fea026c3e0b389381ae8cc08e0f39fa1a8f7 | |
Author: Laslo Hunhold <[email protected]> | |
Date: Tue, 25 Oct 2022 13:20:47 +0200 | |
Implement the Unicode Bidirectional Algorithm (UAX #9) | |
To be frank, I never heard about this until I started learning more | |
about Unicode, but this is an absolute must for all languages that go | |
from right to left (Hebrew, Arabic, Farsi, etc.) and any case where you | |
mix RTL and LTR languages. | |
The Unicode Bidirectional Algorithm is the normative procedure you apply | |
on a string to obtain embedding levels that can then be used to reorder | |
the string such that you obtain the proper reading direction. The | |
central aspect is that strings are always stored LTR in memory and only | |
reordered for presentation on the screen. | |
Currently, only ICU and GNU fribidi implement the algorithm, and as | |
usual it's pretty convoluted to use them. There are many memory | |
allocations, kitchen-sink-madness and legacy cruft, but the demand is | |
there (there's even a bidi-patch for dwm[0]). | |
What's special about this implementation? There are no memory | |
allocations at runtime. The user provides a 32-bit-integer-array which | |
is then filled with the embedding levels. The levels themselves only | |
range from -1 to 125 (by the standard!) and would fit in a signed | |
8-bit-integer, but the algorithm naturally needs a scratchpad to store | |
processing data. | |
A complication of the algorithm is that you, at some point, have to | |
break the paragraph into lines and based on the line breaks the level | |
determination is affected. GNU fribidi and ICU make this very | |
complicated and hard to understand. The API is not final as you see it | |
here, but the final process will be (each number corresponding to a | |
function): | |
1) "preprocessing" the string up to the part where the algorithm | |
does not depend on the line breaks | |
2) determining line embedding levels for a line | |
(by specifying the preprocessed data buffer and an output | |
level-buffer) | |
3) reordering a line (by specifying the preprocessed data buffer | |
and an output string that is allowed to be the input string) | |
Conformance is obviously a large priority: There are literally over a | |
million automatic conformance tests for the bidirectional algorithm split | |
across the files BidiTest.txt and BidiCharacterTest.txt that are | |
automatically parsed into the header gen/bidirectional-test.h. | |
Currently, only BidiTest.txt is used for tests (which we all pass), | |
given bracket-pairs have not been implemented yet. This and (maybe) | |
arabic shaping are what is left to be implemented, but this here is | |
already a big step. | |
One more note: Yes, the data files are very large, but they compress | |
down very well and the tarball stays below 800K. It's very important | |
to me that there's no need to pull any data from the web for compilation | |
or testing for obvious reasons. | |
[0]:https://dwm.suckless.org/patches/bidi/ | |
Signed-off-by: Laslo Hunhold <[email protected]> | |
Diff is too large, output suppressed. |