=head1 NAME

=pod

=head1 NAME

README for dta-tokwrap - programs, scripts, and perl modules for DTA XML corpus tokenization

=cut

##======================================================================
=pod

=head1 DESCRIPTION

This package contains various utilities for
tokenization of DTA "base-format" XML documents.
see L</INSTALLATION> for requirements and installation instructions,
see L</USAGE> for a brief introduction to the high-level command-line interface,
and
see L</TOOLS> for an overview of the individual tools included in this distribution.

=cut

##======================================================================
=pod

=head1 INSTALLATION

=cut

##--------------------------------------------------------------
=pod

=head2 Requirements

=cut

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=pod

=head3 C Libraries

=over 4

=item expat

tested version(s): 1.95.8, 2.0.1

=item libxml2

tested version(s): 2.7.3, 2.7.8

=item libxslt

tested version(s): 1.1.24, 1.1.26

=back

=cut

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=pod

=head3 Perl Modules

See F<DTA-TokWrap/README.txt> for a full list of required
perl modules.

=cut

##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=pod

=head3 Development Tools

=over 4

=item C compiler

tested version(s): gcc / linux: v4.3.3, 4.4.6

=item GNU flex (development only)

tested version(s): 2.5.33, 2.5.35

Only needed if you plan on making changes to the lexer sources.

=item GNU autoconf (SVN only)

tested version(s): 2.61, 2.67

Required for building from SVN sources.

=item GNU automake (SVN only)

tested version(s): 1.9.6, 1.11.1

Required for building from SVN sources.

=back

=cut

##--------------------------------------------------------------
=pod

=head2 Building from SVN

To build this package from SVN sources, you must first run the shell command:

bash$ sh ./autoreconf.sh

from the distribution root directory B<BEFORE> running F<./configure>.
Building from SVN sources requires additional development tools to present
on the build system. Then, follow the instructions in L</"Building from Source">.

=cut

##--------------------------------------------------------------
=pod

=head2 Building from Source

To build and install the entire package, issue the following commands to the shell:

bash$ cd dta-tokwrap-0.01 # (or wherever you unpacked this distribution)
bash$ sh ./configure # configure the package
bash$ make # build the package
bash$ make install # install the package on your system

More details on the top-level installation process can be found in
the file F<INSTALL> in the distribution root directory.

More details on building and installing the DTA::TokWrap perl module included in this distribution
can be found in the F<perlmodinstall(1)> manpage.

=cut

##======================================================================
=pod

=head1 USAGE

The perl program L<dta-tokwrap.perl|/dta-tokwrap.perl> installed from the F<DTA-TokWrap/>
distribution subdirectory provides a flexible high-level command-line interface
to the tokenization of DTA XML documents.

=cut

##--------------------------------------------------------------
=pod

=head2 Input Format

The L<dta-tokwrap.perl|dta-tokwrap.perl> script takes as its input DTA "base-format" XML files,
which are simply (TEI-conformant) UTF-8 encoded XML files with one C<E<lt>cE<gt>>
element per character:

=over 4

=item *

the document B<MUST> be encoded in UTF-8,

=item *

all text nodes to be tokenized should be descendants of a C<E<lt>textE<gt>> element,
and may optionally be immediate daughters of a C<E<lt>cE<gt>> element
(XPath C<//text//text()|//text//c/text()>). C<E<lt>cE<gt>> elements may not be nested.

Prior to dta-tokwrap v0.38, C<E<lt>cE<gt>> elements were required.

=back

=cut

##--------------------------------------------------------------
=pod

=head2 Example: Tokenizing a single XML file

Assume we wish to tokenize a single DTA "base-format" XML file F<doc1.xml>.
Issue the following command to the shell:

bash$ dta-tokwrap.perl doc1.xml

.. This will create the following output files:

=over 4

=item F<doc1.t.xml>

"Master" tokenizer output file encoding sentence boundaries, token boundaries,
and tokenizer-provided token analyses. Source for various stand-off annotation formats.
This format can also be passed directly to and from the L<DTA::CAB(3pm)|DTA::CAB>
analysis suite using the L<DTA::CAB::Format::XmlNative(3pm)|DTA::CAB::Format::XmlNative>
formatter class.

=back

=cut

##--------------------------------------------------------------
=pod

=head2 Example: Tokenizing multiple XML files

Assume we wish to tokenize a corpus of three DTA "base-format" XML files
F<doc1.xml>, F<doc2.xml>, and F<doc3.xml>.
This is as easy as:

bash$ dta-tokwrap.perl doc1.xml doc2.xml doc3.xml

For each input document specified on the command line,
master output files and stand-off annotation files will be created.

See L<"the dta-tokwrap.perl manpage"|dta-tokwrap.perl> for more details.

=head2 Example: Tracing execution progess

Assume we wish to tokenize a large corpus of XML input files F<doc*.xml>,
and would like to have some feedback on the progress of the
tokenization process.
Try:

bash$ dta-tokwrap.perl -verbose=1 doc*.xml

or:

bash$ dta-tokwrap.perl -verbose=2 doc*.xml

or even:

bash$ dta-tokwrap.perl -traceAll doc*.xml

=cut

##--------------------------------------------------------------
=pod

=head2 Example: From TEI to TCF and Back

Assume we have a TEI-like document F<doc.tei.xml> which we want
to encode as TCF to the file F<doc.tei.tcf>, using only whitespace tokenizer "hints", but
not actually tokenizing the document yet. This can be accomplished by:

$ dta-tokwrap.perl -t=tei2tcf -weak-hints doc1.tei.xml

If the output should instead be written to STDOUT, just call:

$ dta-tokwrap.perl -t=tei2tcf -weak-hints -dO=tcffile=- doc1.tei.xml

Assume that the resulting TCF document has undergone further processing
(e.g. via L<WebLicht|http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page>)
to produce an annotated TCF document C<doc.out.tcf>.

selected TCF layers (in particular the C<tokens> and C<sentences> layers) can be spliced back into the TEI document as
F<doc.out.xml> by calling:

$ dta-tokwrap.perl -t=tcf2tei doc.out.tcf -dO=tcffile=doc.out.tcf -dO=tcfcwsfile=doc.out.xml

=cut

##======================================================================
=pod

=head1 TOOLS

This section provides a brief overview of the individual tools included
in the dta-tokwrap distribution.

=cut

##--------------------------------------------------------------
=pod

=head2 Perl Scripts & Programs

The perl scripts and programs included with this distribution are installed
by default in F</usr/local/bin> and/or wherever your perl installs
scripts by default (e.g. in C<`perl -MConfig -e 'print $Config{installsitescript}'`>).

=over 4

=item dta-tokwrap.perl

Top-level wrapper script for document tokenization
using the L<DTA::TokWrap|DTA::TokWrap> perl API.

=item dtatw-add-c.perl

Script to insert C<E<lt>cE<gt>> elements and/or
C<xml:id> attributes for such elements into an
XML document which does not yet contain them.
Guaranteed not to clobber any existing //c IDs.

//c/@xml:id attributes are generated by a simple document-global counter
("c1", "c2", ..., "c65536").

See L<"the dtatw-add-c.perl manpage"|dtatw-add-c.perl> for more details.

=item dtatw-cids2local.perl

Script to convert C<//c/@xml:id> attributes to page-local encoding.
Never really used.

See L<"the dtatw-cids2local.perl manpage"|dtatw-cids2local.perl> for more details.

=item dtatw-add-ws.perl

Script to splice C<E<lt>sE<gt>> and C<E<lt>wE<gt>> elements encoded from a standoff (.t.xml or .u.xml) XML file
into the I<original> "base-format" (.chr.xml) file, producing a .cws.xml file.
A tad too generous with partial word segments, due to strict adjacency and boundary criteria.

In earlier versions of dta-tokwrap, this functionality was split between the scripts
C<dtatw-add-w.perl> and C<dtatw-add-s.perl>, which required only an I<id-compatible>
base-format (.chr.xml) file as the splice target. As of dta-tokwrap v0.35, the splice target
base-format file must be I<original> source file itself, since the current implementation
uses byte offsets to perform the splice.

See L<"the dtatw-add-ws.perl manpage"|dtatw-add-ws.perl> for more details.

=item dtatw-splice.perl

Script to splice generic standoff attributes and/or content into a base file;
useful e.g. for merging flat DTA::CAB standoff analyses into TEI-structured
*.cws.xml files.

See L<"the dtatw-splice.perl manpage"|dtatw-splice.perl> for more details.

=item dtatw-get-ddc-attrs.perl

Script to insert DDC-relevant attributes extracted from a base file into a *.t.xml file,
producing a pre-DDC XML format file (by convention *.ddc.t.xml, a subset of the *.t.xml
format).

See L<"the dtatw-get-ddc-attrs.perl manpage"|dtatw-get-ddc-attrs.perl> for more details.

=item dtatw-get-header.perl

Simple script to extract a single header element from an XML file (e.g. for later
inclusion in a DDC XML format file).

See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details.

See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details.

=item dtatw-pn2p.perl

Script to conver insert E<lt>pE<gt>...E<lt>/pE<gt> wrappers for C<//s/@pn> key attributes
in "flat" *.t.xml files.

=item dtatw-xml2ddc.perl

Script to convert *.ddc.t.xml files and optional headers to DDC-XML format.

See L<"the dtatw-xml2ddc.perl manpage"|dtatw-xml2ddc.perl> for more details.

=item dtatw-t-check.perl

Simple script to check consistency of tokenizer output (*.t) offset + length
fields with input (*.txt) file.

=item dtatw-add-c.perl

Script to add C<E<lt>cE<gt>> elements to an XML document which does
not already contain them.
Not really useful as of dta-tokwrap v0.38.

=item dtatw-rm-c.perl

Script to remove C<E<lt>cE<gt>> elements from an XML document.
Regex hack, fast but not exceedingly robust, use with caution.
See also L</"dtatw-rm-c.xsl">

=item dtatw-rm-w.perl

Fast regex hack to remove C<E<lt>wE<gt>> elements from an XML document

=item dtatw-rm-s.perl

Fast regex hack to remove C<E<lt>sE<gt>> elements from an XML document.

=item dtatw-rm-lb.perl

Script to remove C<E<lt>lbE<gt>> (line-break) elements from an XML document,
replacing them with newlines.
Regex hack, fast but not robust, use with caution.
See also L</"dtatw-rm-lb.xsl">

=item dtatw-lb-encode.perl

Encodes newlines under //text//text() in an XML document as C<E<lt>lbE<gt>> (line-break) elements
using high-level file heuristics only.
Regex hack, fast but not robust, use with caution.
See also L</"dtatw-ensure-lb.perl">, L</"dtatw-add-lb.xsl">, L</"dtatw-rm-lb.perl">.

=item dtatw-ensure-lb.perl

Script to ensure that all //text//text() newlines in an XML document are explicitly encoded
with C<E<lt>lbE<gt>> (line-break) elements, using optional file-, element-,
and line-level heuristics.
Robust but slow, since it actually parses XML input documents.
See also L</"dtatw-lb-encode.perl">, L</"dtatw-add-lb.xsl">, L</"dtatw-rm-lb.perl">.

=item dtatw-tt-dictapply.perl

Script to apply a type-"dictionary" in one-word-per-line (.tt) format to a
token corpus in one-word-per-line (.tt) format. Especially useful together with
standard UNIX utilities such as cut, grep, sort, and uniq.

=item dtatw-cabtt2xml.perl

Script to convert DTA::CAB::Format::TT (one-word-per-line with variable analysis
fields identified by conventional prefixes) files to expanded .t.xml format used
by dta-tokwrap. The expanded format should be identical to that used by the
DTA::CAB::Format::Xml class. See also L<dtatw-txml2tt.xsl>.

=item file-substr.perl

Script to extract a portion of a file,
specified by byte offset and length.
Useful for debugging index files created by other tools.

=back

=cut

##--------------------------------------------------------------
=pod

=head2 GNU make build system template

The distribution directory F<make/> contains a "template"
for using GNU F<make> to
organizing the conversion of large corpora with
the dta-tokwrap utilities. This is useful because:

=over 4

=item *

F<make>'s intuitive, easy-to-read syntax provides a
wonderful vehicle for user-defined configuration files,
obviating the need to remember the names of all 64
(at last count)
C<dta-tokwrap.perl|/dta-tokwrap.perl> options,

=item *

F<make>
is very good at tracking complex dependencies of the sort
that exist between the various temporary files generated
by the dta-tokwrap utilities,

=item *

F<make>
jobs can be made "robust" simply by adding a C<-k>
(C<--keep-going>) to the command-line,
and

=item *

last but certainly not least,
F<make>
has built-in support for parallelization of complex
tasks by means of the C<-j N> (C<--jobs=N>) option,
allowing us to take advantage of multiprocessor systems.

=back

By default, the contents of the distribution F<make/>
subdirectory are installed to F</usr/local/share/dta-tokwrap/make/>.
See the comments at the top of F<make/User.mak> for instructions.

=cut

##--------------------------------------------------------------
=pod

=head2 Perl Modules

=over 4

=item L<DTA::TokWrap|DTA::TokWrap>

Top-level tokenization-wrapper module, used by L<dta-tokwrap.perl|dta-tokwrap.perl>.

=item L<DTA::TokWrap::Document|DTA::TokWrap::Document>

Object-oriented wrapper for documents to be processed.

=item L<DTA::TokWrap::Processor|DTA::TokWrap::Processor>

Abstract base class for elementary document-processing operations.

=back

See the L<DTA::TokWrap::Intro(3pm)|DTA::TokWrap::Intro> manpage for more details
on included modules, APIs, calling conventions, etc.

=cut

##--------------------------------------------------------------
=pod

=head2 XSL stylesheets

The XSL stylesheets included with this distribution are installed
by default in F</usr/local/share/dta-tokwrap/stylesheets>.

=over 4

=item dtatw-add-lb.xsl

Replaces newlines with C<E<lt>lb/E<gt>> elements in input document.

=item dtatw-assign-cids.xsl

Assigns missing C<//c/@xml:id> attributes using the XSL C<generate-id()> function.

=item dtatw-rm-c.xsl

Removes C<E<lt>cE<gt>> elements from the input document.
Slow but robust.

=item dtatw-rm-lb.xsl

Replaces C<E<lt>lb/E<gt>> elements with newlines.

=item dtatw-txml2tt.xsl

Converts "master" tokenized XML output format (F<*.t.xml>) to
TAB-separated one-word-per-line format
(F<*.mr.t>
aka F<*.t>
aka F<*.tt>
aka "tt"
aka "CSV"
aka DTA::CAB::Format::TT
aka "TnT"
aka "TreeTagger"
aka "vertical"
aka "moot-native"
aka ...).
See the F<mootfiles(5)> manpage for basic format details, and
see the top of the XSL script for some influential transformation parameters.

=back

=cut

##--------------------------------------------------------------
=pod

=head2 C Programs

Several C programs are included with the distribution.
These are used by the L<dta-tokwrap.perl|dta-tokwrap.perl> script
to perform various intermediate document processing operations,
and should not need to be called by the user directly.

B<Caveat Scriptor>: The following programs are meant for
internal use by the C<DTA::TokWrap> modules only, and their
names, calling conventions, and very presence is subject to change
without notice.

=over 4

=item dtatw-mkindex

Splits input document F<doc.xml>
into
a "character index" F<doc.cx> (CSV),
a "structural index" F<doc.sx> (XML),
and a
"text index" F<doc.tx> (UTF-8 text).

=item dtatw-rm-namespaces

Removes namespaces from any XML document by
renaming "C<xmlns>" attributes to "C<xmlns_>"
and "C<xmlns:*>" attributes to "C<xmlns_*>".
Useful because XSL's namespace handling is annoyingly slow and ugly.

=item dtatw-tokenize-dummy

Dummy C<flex> tokenizer. Useful for testing.

=item dtatw-txml2sxml

Converts "master" tokenized XML output format (F<*.t.xml>) to
sentence-level stand-off XML format (F<*.s.xml>).

=item dtatw-txml2wxml

Converts "master" tokenized XML output format (F<*.t.xml>) to
token-level stand-off XML format (F<*.w.xml>).

=item dtatw-txml2axml

Converts "master" tokenized XML output format (F<*.t.xml>) to
token-analysis-level stand-off XML format (F<*.a.xml>).

=back

=cut

##======================================================================
=pod

=head1 SEE ALSO

perl(1).

=head1 AUTHOR

Bryan Jurish E<lt>[email protected]<gt>

=head1 COPYRIGHT AND LICENSE

Copyright (C) 2009-2018 by Bryan Jurish

This package is free software. Redistribution and modification
of C portions of this package are subject to the terms of the
version 3 or greater of the GNU Lesser General Public License; see the
files COPYING and COPYING.LESSER which came with the distribution for details.

Redistribution and/or modification of the Perl portions of this package
are subject to the same terms as Perl itself, either Perl version 5.24.1 or,
at your option, any later version of Perl 5 you may have available.

=cut