NAME
   Unicode::Truncate - Unicode-aware efficient string truncation

SYNOPSIS
       use utf8;
       use Unicode::Truncate;

       truncate_egc("hello world", 7);
       ## returns "hell…";

       truncate_egc("hello world", 7, '');
       ## returns "hello w"

       truncate_egc('深圳', 7);
       ## returns "深…"

       truncate_egc("née Jones", 5)'
       ## returns "n…" (not "ne…", even in NFD)

       truncate_egc("\xff", 10)
       ## throws exception:
       ##   "input string not valid UTF-8 (detected at byte offset 0 in truncate_egc)"

       my $str = "hello world";
       truncate_egc_inplace($str, 8)
       ## $str is now "hello…";

DESCRIPTION
   This module is for truncating UTF-8 encoded Unicode text to particular
   byte lengths while inflicting the least amount of data corruption
   possible. The resulting truncated string will be no longer than your
   specified number of bytes (after UTF-8 encoding).

   All truncated strings will continue to be valid UTF-8: it won't cut in
   the middle of a UTF-8 encoded code-point. Furthermore, if your text
   contains combining diacritical marks, this module will not cut in
   between a diacritical mark and the base character. It will in general
   try to preserve what users perceive as whole characters, with as little
   as possible mutilation at the truncation site.

   The "truncate_egc" function truncates only between extended grapheme
   clusters
   <https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Charac
   ters_grapheme_clusters_and_glyphs> (as defined by Unicode TR29
   <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>
   version 7.0.0).

   The "truncate_egc_inplace" function is identical to "truncate_egc"
   except that the input string will be modified so that no copying occurs.
   If you pass in a read-only value it will throw an exception.

   Eventually I'd like to support other boundaries such as words and
   sentences. Those functions will be named "truncate_word" and so on.

RATIONALE
   Of course in a perfect world we would only need to worry about the
   amount of space some text takes up on the screen, in the real world we
   often have to or want to make sure things fit within certain byte size
   capacity limits. Many databases, network protocols, and file-formats
   require honouring byte-length restrictions. Even if they automatically
   truncate for you, are they doing it properly and consistently? On many
   file-systems, file and directory names are subject to byte-size limits.
   Many APIs that use C structs have fixed limits as well. You may even
   wish to do things like guarantee that a collection of news headlines
   will fit in a single ethernet packet.

   I knew I had to write this module after I asked Tom Christiansen about
   the best way to truncate unicode to fit in fixed-byte fields and he got
   angry and told me to never do that. :)

   Why not just use "substr" on a string before UTF-8 encoding it? The main
   problem with that is the number of bytes that an encoded string will
   consume is not known until after you encode it. It depends on how many
   "high" code-points are in the string, how "high" those code-points are,
   the normalisation form chosen, and (relatedly) how many combining marks
   are used. Even with perl unicode strings (ie before encoding), using
   "substr" will cut in front of combining marks.

   Truncating post-encoding may result in invalid UTF-8 partials at the end
   of your string, as well as cutting in front of combining marks.

   One interesting aspect of unicode's combining marks is that there is no
   specified limit to the number of combining marks that can be applied. So
   in some interpretations a single character/grapheme/whatever can take up
   an arbitrarily large number of bytes. However, there are various
   recommendations such as the Unicode UAX15-D3
   <http://www.unicode.org/reports/tr15/#UAX15-D3> "stream-safe" limit of
   30. Reportedly the largest known "legitimate" use is a 1 base + 8
   combining marks grapheme used in a Tibetan script.

ELLIPSIS
   When a string is truncated, "truncate_egc" indicates this by appending
   an ellipsis. The length of the truncated content including the ellipsis
   is guaranteed to be no greater than the byte size limit you specified.

   By default the ellipsis is the character U+2026 (…) however you can use
   any other string by passing it in as the third argument. The ellipsis
   string must not contain invalid UTF-8 (it can be encoded or can contain
   perl high-code points, up to you). Note the default ellipsis consumes 3
   bytes in UTF-8 encoding which is the same as 3 periods in a row.

IMPLEMENTATION
   This module uses the ragel <http://www.colm.net/open-source/ragel/>
   state machine compiler to parse/validate UTF-8 and to determine the
   presence of combining characters. Ragel is nice because we can determine
   the truncation location with a single pass through the data in an
   optimised C loop.

   One of the requirements of this module was to additionally validate
   UTF-8 encoding. This is so you can run it against strings with or
   without having decoded them with "Encode::decode" first. This module
   will throw exceptions if the strings to be truncated aren't UTF-8. This
   property lets us minimise the amount of times a user-supplied string is
   "decoded". With this module, you can accept an arbitrary string from a
   web request (say), validate that it is UTF-8, truncate it if necessary,
   and write it out to a DB, all with only a single pass over the data.

   As mentioned, this module will not scan further than it needs to in
   order to determine the truncation location. So creating a short
   truncation of a really long string doesn't require traversing the entire
   string. However, this module won't validate that the bytes beyond its
   truncation location are valid UTF-8.

   Another purpose of this module is to be a "proof of concept" for
   Inline::Module::LeanDist and Inline::Filters::Ragel. This distribution
   concept was of course heavily inspired by Inline::Module.

SEE ALSO
   Unicode-Truncate github repo
   <https://github.com/hoytech/Unicode-Truncate>

   Although efficient, as discussed above, "substr" will not be able to
   give you a guaranteed byte-length output (if done pre-encoding) and will
   corrupt text (pre or post-encoding).

   There are several similar modules such as Text::Truncate,
   String::Truncate, Text::Elide but they are all essentially wrappers
   around "substr" and are subject to its limitations.

   A reasonable "99%" solution is to encode your string as UTF-8, truncate
   at the byte-level with "substr", decode with "Encode::FB_QUIET", and
   then re-encode it to UTF-8. This will ensure that the output is always
   valid UTF-8, but will still risk corrupting unicode text that contains
   combining marks.

   Ricardo Signes suggested an algorithm using Unicode::GCString which
   would also be correct but likely less efficient.

   It may be possible to use the regexp engine's "\X" combined with "(?{})"
   in some way but I haven't been able to figure that out.

BUGS
   Of course I can't test this module on all the writing systems of the
   world so I don't know the severity of the corruption in all situations.
   It's possible that the corruption can be minimised in additional ways
   without sacrificing the simplicity or efficiency of the algorithm. If
   you have any ideas please let me know and I'll try to incorporate them.

   Eventually I'd like to truncate on other boundaries specified by
   unicode, such as word, sentence, and line.

   It would be nice to be able to apply an EGC limit such as 30.

   This module doesn't handle the UTF-16 surrogate range in the grapheme
   properties files because "Encode::encode" isn't encoding them the way
   I'd need them to. That's OK because these aren't valid UTF-8 anyway.

   Perl internally supports characters outside what is officially unicode.
   This module only works with the official UTF-8 range so if you are using
   this perl extension (perhaps for some sort of non-unicode sentinel
   value) this module will throw an exception indicating invalid UTF-8
   encoding (which is more of a feature than a bug given this module's
   primary purpose of validating and truncating untrusted, user-provided
   text).

AUTHOR
   Doug Hoyte, "<[email protected]>"

COPYRIGHT & LICENSE
   Copyright 2014-2015 Doug Hoyte.

   This module is licensed under the same terms as perl itself.