\iffalse

\iffalse
# The `Lua-UCA` package
\fi

This package adds support for the [Unicode collation algorithm](https://unicode.org/reports/tr10/) for Lua 5.3 and later.
It is mainly intended for use with Lua\TeX and working \TeX\ distribution, but it can work also as a standalone
Lua module. You will need to install a required [Lua-uni-algos](https://github.com/latex3/lua-uni-algos) package by hand
in that case.

## Usage

To sort a table using Czech collation rules:

kpse.set_program_name "luatex"
local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"

local collator_obj = collator.new(ducet)
-- load Czech rules
collator_obj = languages.cs(collator_obj)

local t = {"cihla", "chochol", "hudba", "jasan", "čáp"}

table.sort(t, function(a,b)
return collator_obj:compare_strings(a,b)
end)

for _, v in ipairs(t) do
print(v)
end

The output:

> cihla
> čáp
> hudba
> chochol
> jasan

More samples of the library usage can be found in the source repository of this package on [Github](https://github.com/michal-h21/lua-uca).
% See `HACKING.md` file in the repo for more information.

## Use with Xindex processor

[Xindex](https://www.ctan.org/pkg/xindex) is flexible index processor written
in Lua by Herbert Voß. It has built-in `Lua-UCA` support starting with version
`0.23`. The support can be requested using the `-u` option:

xindex -u -l no -c norsk filename.idx

## Use with LuaJIT

The default version of `lua-uca-ducet` fails with Luajit. You can use alternative version of this file, `lua-uca-ducet-jit`.

## Change sorting rules

The simplest way to change the default sorting order is to use the
`tailor_string` method of the `collator_obj` object. It updates the collator object using
special syntax which is subset of the format used by the [Unicode locale data
markup
language](https://www.unicode.org/reports/tr35/tr35-collation.html#Orderings).

collator_obj:tailor_string "&a<b"

Full example with Czech rules:

kpse.set_program_name "luatex"
local ducet = require "lua-uca.lua-uca-ducet"
local collator = require "lua-uca.lua-uca-collator"
local languages = require "lua-uca.lua-uca-languages"

local collator_obj = collator.new(ducet)
local tailoring = function(s) collator_obj:tailor_string(s) end

tailoring "&c<č<<<Č"
tailoring "&h<ch<<<cH<<<Ch<<<CH"
tailoring "&R<ř<<<Ř"
tailoring "&s<š<<<Š"
tailoring "&z<ž<<<Ž"

Note that the sequence of letters `ch`, `Ch`, `cH` and `CH` will be sorted after `h`

It is also possible to expand a letter to multiple letters, like this example for DIN 2:

tailoring "&Ö=Oe"
tailoring "&ö=oe"

Some languages, like Norwegian, sort uppercase letters before lowercase. This
can be enabled using `collator_obj:uppercase_first()` function:

local tailoring = function(s) collator_obj:tailor_string(s) end
collator_obj:uppercase_first()
tailoring("&D<<đ<<<Đ<<ð<<<Ð")
tailoring("&th<<<þ")
tailoring("&TH<<<Þ")
tailoring("&Y<<ü<<<Ü<<ű<<<Ű")
tailoring("&ǀ<æ<<<Æ<<ä<<<Ä<ø<<<Ø<<ö<<<Ö<<ő<<<Ő<å<<<Å<<<aa<<<Aa<<<AA")
tailoring("&oe<<œ<<<Œ")

Some languages, for example Canadian French, sort accent backwards, like gêne < gëne < gêné.
In this case, you can set the `collator_obj.accents_backward` variable to `true`.

% More information on a new language support is in the `HACKING.md`
% document in the [`Lua-UCA` Github repo](https://github.com/michal-h21/lua-uca/blob/master/HACKING.md).

### Script reordering

Many languages sort different scripts after the script this language uses. As
Latin based scripts are sorted first, it is necessary to reorder scripts in
such cases.

The `collator_obj:reorder` function takes table with scripts that need to be reordered.
For example Cyrillic can be sorted before Latin using:

collator_obj:reorder {"cyrillic"}

In German or Czech, numbers should be sorted after all other characters. This can be done using:

collator_obj:reorder {"others", "digits"}

The special keyword "others" means that the scripts that follows in the table
will be sorted at the very end.

## Headers for index entries

In some languages, for example Czech, multiple letters may count as one
character. This is the case of the *ch* character.

Lua-UCA provides function `collator_obj:get_lowest_char()`. It returns table with UTF-8 codepoints
for correct first character for a given language that can be used for example as an index header.

local czech = collator.new(ducet)
languages.cs(czech)
-- first we need to convert string to codepoints
local codepoints = czech:string_to_codepoints("Chrobák")
local first_char = czech:get_lowest_char(codepoints)
-- it should print letters "ch"
print(utf8.char(table.unpack(first_char)))
-- you can also specify position of the character
local second_char = czech:get_lowest_char(codepoints, 2)
-- it should print letter "h", as it is second codepoint in the string
print(utf8.char(table.unpack(second_char)))

## Unicode normalization

By default, no Unicode normalization is used internally. You can explicitly request normalization that use the
[Uninormalize package](https://ctan.org/pkg/uninormalize?lang=en). Note that it will significantly increase the
procesing time.

There are two normalization methods, NFC and NFD. They can be enabled using
`collation.use_nfc()` and `collation.use_nfd()` functions.

# What is missing

- Algorithm for setting implicit sort weights of characters that are not explicitly listed in DUCET.
- Special handling of CJK scripts.

\iffalse
# Copyright

Michal Hoftich, 2021–2024. See LICENSE file for more details.

\fi