NAME
Unicode::Truncate - Unicode-aware efficient string truncation
SYNOPSIS
use utf8;
use Unicode::Truncate;
truncate_egc("hello world", 7);
## returns "hell…";
truncate_egc("hello world", 7, '');
## returns "hello w"
truncate_egc('深圳', 7);
## returns "深…"
truncate_egc("née Jones", 5)'
## returns "n…" (not "ne…", even in NFD)
truncate_egc("\xff", 10)
## throws exception:
## "input string not valid UTF-8 (detected at byte offset 0 in truncate_egc)"
my $str = "hello world";
truncate_egc_inplace($str, 8)
## $str is now "hello…";
DESCRIPTION
This module is for truncating UTF-8 encoded Unicode text to particular
byte lengths while inflicting the least amount of data corruption
possible. The resulting truncated string will be no longer than your
specified number of bytes (after UTF-8 encoding).
All truncated strings will continue to be valid UTF-8: it won't cut in
the middle of a UTF-8 encoded code-point. Furthermore, if your text
contains combining diacritical marks, this module will not cut in
between a diacritical mark and the base character. It will in general
try to preserve what users perceive as whole characters, with as little
as possible mutilation at the truncation site.
The "truncate_egc" function truncates only between extended grapheme
clusters
<
https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Charac
ters_grapheme_clusters_and_glyphs> (as defined by Unicode TR29
<
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>
version 7.0.0).
The "truncate_egc_inplace" function is identical to "truncate_egc"
except that the input string will be modified so that no copying occurs.
If you pass in a read-only value it will throw an exception.
Eventually I'd like to support other boundaries such as words and
sentences. Those functions will be named "truncate_word" and so on.
RATIONALE
Of course in a perfect world we would only need to worry about the
amount of space some text takes up on the screen, in the real world we
often have to or want to make sure things fit within certain byte size
capacity limits. Many databases, network protocols, and file-formats
require honouring byte-length restrictions. Even if they automatically
truncate for you, are they doing it properly and consistently? On many
file-systems, file and directory names are subject to byte-size limits.
Many APIs that use C structs have fixed limits as well. You may even
wish to do things like guarantee that a collection of news headlines
will fit in a single ethernet packet.
I knew I had to write this module after I asked Tom Christiansen about
the best way to truncate unicode to fit in fixed-byte fields and he got
angry and told me to never do that. :)
Why not just use "substr" on a string before UTF-8 encoding it? The main
problem with that is the number of bytes that an encoded string will
consume is not known until after you encode it. It depends on how many
"high" code-points are in the string, how "high" those code-points are,
the normalisation form chosen, and (relatedly) how many combining marks
are used. Even with perl unicode strings (ie before encoding), using
"substr" will cut in front of combining marks.
Truncating post-encoding may result in invalid UTF-8 partials at the end
of your string, as well as cutting in front of combining marks.
One interesting aspect of unicode's combining marks is that there is no
specified limit to the number of combining marks that can be applied. So
in some interpretations a single character/grapheme/whatever can take up
an arbitrarily large number of bytes. However, there are various
recommendations such as the Unicode UAX15-D3
<
http://www.unicode.org/reports/tr15/#UAX15-D3> "stream-safe" limit of
30. Reportedly the largest known "legitimate" use is a 1 base + 8
combining marks grapheme used in a Tibetan script.
ELLIPSIS
When a string is truncated, "truncate_egc" indicates this by appending
an ellipsis. The length of the truncated content including the ellipsis
is guaranteed to be no greater than the byte size limit you specified.
By default the ellipsis is the character U+2026 (…) however you can use
any other string by passing it in as the third argument. The ellipsis
string must not contain invalid UTF-8 (it can be encoded or can contain
perl high-code points, up to you). Note the default ellipsis consumes 3
bytes in UTF-8 encoding which is the same as 3 periods in a row.
IMPLEMENTATION
This module uses the ragel <
http://www.colm.net/open-source/ragel/>
state machine compiler to parse/validate UTF-8 and to determine the
presence of combining characters. Ragel is nice because we can determine
the truncation location with a single pass through the data in an
optimised C loop.
One of the requirements of this module was to additionally validate
UTF-8 encoding. This is so you can run it against strings with or
without having decoded them with "Encode::decode" first. This module
will throw exceptions if the strings to be truncated aren't UTF-8. This
property lets us minimise the amount of times a user-supplied string is
"decoded". With this module, you can accept an arbitrary string from a
web request (say), validate that it is UTF-8, truncate it if necessary,
and write it out to a DB, all with only a single pass over the data.
As mentioned, this module will not scan further than it needs to in
order to determine the truncation location. So creating a short
truncation of a really long string doesn't require traversing the entire
string. However, this module won't validate that the bytes beyond its
truncation location are valid UTF-8.
Another purpose of this module is to be a "proof of concept" for
Inline::Module::LeanDist and Inline::Filters::Ragel. This distribution
concept was of course heavily inspired by Inline::Module.
SEE ALSO
Unicode-Truncate github repo
<
https://github.com/hoytech/Unicode-Truncate>
Although efficient, as discussed above, "substr" will not be able to
give you a guaranteed byte-length output (if done pre-encoding) and will
corrupt text (pre or post-encoding).
There are several similar modules such as Text::Truncate,
String::Truncate, Text::Elide but they are all essentially wrappers
around "substr" and are subject to its limitations.
A reasonable "99%" solution is to encode your string as UTF-8, truncate
at the byte-level with "substr", decode with "Encode::FB_QUIET", and
then re-encode it to UTF-8. This will ensure that the output is always
valid UTF-8, but will still risk corrupting unicode text that contains
combining marks.
Ricardo Signes suggested an algorithm using Unicode::GCString which
would also be correct but likely less efficient.
It may be possible to use the regexp engine's "\X" combined with "(?{})"
in some way but I haven't been able to figure that out.
BUGS
Of course I can't test this module on all the writing systems of the
world so I don't know the severity of the corruption in all situations.
It's possible that the corruption can be minimised in additional ways
without sacrificing the simplicity or efficiency of the algorithm. If
you have any ideas please let me know and I'll try to incorporate them.
Eventually I'd like to truncate on other boundaries specified by
unicode, such as word, sentence, and line.
It would be nice to be able to apply an EGC limit such as 30.
This module doesn't handle the UTF-16 surrogate range in the grapheme
properties files because "Encode::encode" isn't encoding them the way
I'd need them to. That's OK because these aren't valid UTF-8 anyway.
Perl internally supports characters outside what is officially unicode.
This module only works with the official UTF-8 range so if you are using
this perl extension (perhaps for some sort of non-unicode sentinel
value) this module will throw an exception indicating invalid UTF-8
encoding (which is more of a feature than a bug given this module's
primary purpose of validating and truncating untrusted, user-provided
text).
AUTHOR
Doug Hoyte, "<
[email protected]>"
COPYRIGHT & LICENSE
Copyright 2014-2015 Doug Hoyte.
This module is licensed under the same terms as perl itself.