=pod
=head1 NAME
README for dta-tokwrap - programs, scripts, and perl modules for DTA XML corpus tokenization
=cut
##======================================================================
=pod
=head1 DESCRIPTION
This package contains various utilities for
tokenization of DTA "base-format" XML documents.
see L</INSTALLATION> for requirements and installation instructions,
see L</USAGE> for a brief introduction to the high-level command-line interface,
and
see L</TOOLS> for an overview of the individual tools included in this distribution.
=cut
##======================================================================
=pod
=head1 INSTALLATION
=cut
##--------------------------------------------------------------
=pod
=head2 Requirements
=cut
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=pod
=head3 C Libraries
=over 4
=item expat
tested version(s): 1.95.8, 2.0.1
=item libxml2
tested version(s): 2.7.3, 2.7.8
=item libxslt
tested version(s): 1.1.24, 1.1.26
=back
=cut
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=pod
=head3 Perl Modules
See F<DTA-TokWrap/README.txt> for a full list of required
perl modules.
=cut
##~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
=pod
=head3 Development Tools
=over 4
=item C compiler
tested version(s): gcc / linux: v4.3.3, 4.4.6
=item GNU flex (development only)
tested version(s): 2.5.33, 2.5.35
Only needed if you plan on making changes to the lexer sources.
=item GNU autoconf (SVN only)
tested version(s): 2.61, 2.67
Required for building from SVN sources.
=item GNU automake (SVN only)
tested version(s): 1.9.6, 1.11.1
Required for building from SVN sources.
=back
=cut
##--------------------------------------------------------------
=pod
=head2 Building from SVN
To build this package from SVN sources, you must first run the shell command:
bash$ sh ./autoreconf.sh
from the distribution root directory B<BEFORE> running F<./configure>.
Building from SVN sources requires additional development tools to present
on the build system. Then, follow the instructions in L</"Building from Source">.
=cut
##--------------------------------------------------------------
=pod
=head2 Building from Source
To build and install the entire package, issue the following commands to the shell:
bash$ cd dta-tokwrap-0.01 # (or wherever you unpacked this distribution)
bash$ sh ./configure # configure the package
bash$ make # build the package
bash$ make install # install the package on your system
More details on the top-level installation process can be found in
the file F<INSTALL> in the distribution root directory.
More details on building and installing the DTA::TokWrap perl module included in this distribution
can be found in the F<perlmodinstall(1)> manpage.
=cut
##======================================================================
=pod
=head1 USAGE
The perl program L<dta-tokwrap.perl|/dta-tokwrap.perl> installed from the F<DTA-TokWrap/>
distribution subdirectory provides a flexible high-level command-line interface
to the tokenization of DTA XML documents.
=cut
##--------------------------------------------------------------
=pod
=head2 Input Format
The L<dta-tokwrap.perl|dta-tokwrap.perl> script takes as its input DTA "base-format" XML files,
which are simply (TEI-conformant) UTF-8 encoded XML files with one C<E<lt>cE<gt>>
element per character:
=over 4
=item *
the document B<MUST> be encoded in UTF-8,
=item *
all text nodes to be tokenized should be descendants of a C<E<lt>textE<gt>> element,
and may optionally be immediate daughters of a C<E<lt>cE<gt>> element
(XPath C<//text//text()|//text//c/text()>). C<E<lt>cE<gt>> elements may not be nested.
Prior to dta-tokwrap v0.38, C<E<lt>cE<gt>> elements were required.
=back
=cut
##--------------------------------------------------------------
=pod
=head2 Example: Tokenizing a single XML file
Assume we wish to tokenize a single DTA "base-format" XML file F<doc1.xml>.
Issue the following command to the shell:
bash$ dta-tokwrap.perl doc1.xml
.. This will create the following output files:
=over 4
=item F<doc1.t.xml>
"Master" tokenizer output file encoding sentence boundaries, token boundaries,
and tokenizer-provided token analyses. Source for various stand-off annotation formats.
This format can also be passed directly to and from the L<DTA::CAB(3pm)|DTA::CAB>
analysis suite using the L<DTA::CAB::Format::XmlNative(3pm)|DTA::CAB::Format::XmlNative>
formatter class.
=back
=cut
##--------------------------------------------------------------
=pod
=head2 Example: Tokenizing multiple XML files
Assume we wish to tokenize a corpus of three DTA "base-format" XML files
F<doc1.xml>, F<doc2.xml>, and F<doc3.xml>.
This is as easy as:
bash$ dta-tokwrap.perl doc1.xml doc2.xml doc3.xml
For each input document specified on the command line,
master output files and stand-off annotation files will be created.
See L<"the dta-tokwrap.perl manpage"|dta-tokwrap.perl> for more details.
=head2 Example: Tracing execution progess
Assume we wish to tokenize a large corpus of XML input files F<doc*.xml>,
and would like to have some feedback on the progress of the
tokenization process.
Try:
bash$ dta-tokwrap.perl -verbose=1 doc*.xml
or:
bash$ dta-tokwrap.perl -verbose=2 doc*.xml
or even:
bash$ dta-tokwrap.perl -traceAll doc*.xml
=cut
##--------------------------------------------------------------
=pod
=head2 Example: From TEI to TCF and Back
Assume we have a TEI-like document F<doc.tei.xml> which we want
to encode as TCF to the file F<doc.tei.tcf>, using only whitespace tokenizer "hints", but
not actually tokenizing the document yet. This can be accomplished by:
$ dta-tokwrap.perl -t=tei2tcf -weak-hints doc1.tei.xml
If the output should instead be written to STDOUT, just call:
$ dta-tokwrap.perl -t=tei2tcf -weak-hints -dO=tcffile=- doc1.tei.xml
Assume that the resulting TCF document has undergone further processing
(e.g. via L<WebLicht|
http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page>)
to produce an annotated TCF document C<doc.out.tcf>.
selected TCF layers (in particular the C<tokens> and C<sentences> layers) can be spliced back into the TEI document as
F<doc.out.xml> by calling:
$ dta-tokwrap.perl -t=tcf2tei doc.out.tcf -dO=tcffile=doc.out.tcf -dO=tcfcwsfile=doc.out.xml
=cut
##======================================================================
=pod
=head1 TOOLS
This section provides a brief overview of the individual tools included
in the dta-tokwrap distribution.
=cut
##--------------------------------------------------------------
=pod
=head2 Perl Scripts & Programs
The perl scripts and programs included with this distribution are installed
by default in F</usr/local/bin> and/or wherever your perl installs
scripts by default (e.g. in C<`perl -MConfig -e 'print $Config{installsitescript}'`>).
=over 4
=item dta-tokwrap.perl
Top-level wrapper script for document tokenization
using the L<DTA::TokWrap|DTA::TokWrap> perl API.
=item dtatw-add-c.perl
Script to insert C<E<lt>cE<gt>> elements and/or
C<xml:id> attributes for such elements into an
XML document which does not yet contain them.
Guaranteed not to clobber any existing //c IDs.
//c/@xml:id attributes are generated by a simple document-global counter
("c1", "c2", ..., "c65536").
See L<"the dtatw-add-c.perl manpage"|dtatw-add-c.perl> for more details.
=item dtatw-cids2local.perl
Script to convert C<//c/@xml:id> attributes to page-local encoding.
Never really used.
See L<"the dtatw-cids2local.perl manpage"|dtatw-cids2local.perl> for more details.
=item dtatw-add-ws.perl
Script to splice C<E<lt>sE<gt>> and C<E<lt>wE<gt>> elements encoded from a standoff (.t.xml or .u.xml) XML file
into the I<original> "base-format" (.chr.xml) file, producing a .cws.xml file.
A tad too generous with partial word segments, due to strict adjacency and boundary criteria.
In earlier versions of dta-tokwrap, this functionality was split between the scripts
C<dtatw-add-w.perl> and C<dtatw-add-s.perl>, which required only an I<id-compatible>
base-format (.chr.xml) file as the splice target. As of dta-tokwrap v0.35, the splice target
base-format file must be I<original> source file itself, since the current implementation
uses byte offsets to perform the splice.
See L<"the dtatw-add-ws.perl manpage"|dtatw-add-ws.perl> for more details.
=item dtatw-splice.perl
Script to splice generic standoff attributes and/or content into a base file;
useful e.g. for merging flat DTA::CAB standoff analyses into TEI-structured
*.cws.xml files.
See L<"the dtatw-splice.perl manpage"|dtatw-splice.perl> for more details.
=item dtatw-get-ddc-attrs.perl
Script to insert DDC-relevant attributes extracted from a base file into a *.t.xml file,
producing a pre-DDC XML format file (by convention *.ddc.t.xml, a subset of the *.t.xml
format).
See L<"the dtatw-get-ddc-attrs.perl manpage"|dtatw-get-ddc-attrs.perl> for more details.
=item dtatw-get-header.perl
Simple script to extract a single header element from an XML file (e.g. for later
inclusion in a DDC XML format file).
See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details.
See L<"the dtatw-get-header.perl manpage"|dtatw-get-header.perl> for more details.
=item dtatw-pn2p.perl
Script to conver insert E<lt>pE<gt>...E<lt>/pE<gt> wrappers for C<//s/@pn> key attributes
in "flat" *.t.xml files.
=item dtatw-xml2ddc.perl
Script to convert *.ddc.t.xml files and optional headers to DDC-XML format.
See L<"the dtatw-xml2ddc.perl manpage"|dtatw-xml2ddc.perl> for more details.
=item dtatw-t-check.perl
Simple script to check consistency of tokenizer output (*.t) offset + length
fields with input (*.txt) file.
=item dtatw-add-c.perl
Script to add C<E<lt>cE<gt>> elements to an XML document which does
not already contain them.
Not really useful as of dta-tokwrap v0.38.
=item dtatw-rm-c.perl
Script to remove C<E<lt>cE<gt>> elements from an XML document.
Regex hack, fast but not exceedingly robust, use with caution.
See also L</"dtatw-rm-c.xsl">
=item dtatw-rm-w.perl
Fast regex hack to remove C<E<lt>wE<gt>> elements from an XML document
=item dtatw-rm-s.perl
Fast regex hack to remove C<E<lt>sE<gt>> elements from an XML document.
=item dtatw-rm-lb.perl
Script to remove C<E<lt>lbE<gt>> (line-break) elements from an XML document,
replacing them with newlines.
Regex hack, fast but not robust, use with caution.
See also L</"dtatw-rm-lb.xsl">
=item dtatw-lb-encode.perl
Encodes newlines under //text//text() in an XML document as C<E<lt>lbE<gt>> (line-break) elements
using high-level file heuristics only.
Regex hack, fast but not robust, use with caution.
See also L</"dtatw-ensure-lb.perl">, L</"dtatw-add-lb.xsl">, L</"dtatw-rm-lb.perl">.
=item dtatw-ensure-lb.perl
Script to ensure that all //text//text() newlines in an XML document are explicitly encoded
with C<E<lt>lbE<gt>> (line-break) elements, using optional file-, element-,
and line-level heuristics.
Robust but slow, since it actually parses XML input documents.
See also L</"dtatw-lb-encode.perl">, L</"dtatw-add-lb.xsl">, L</"dtatw-rm-lb.perl">.
=item dtatw-tt-dictapply.perl
Script to apply a type-"dictionary" in one-word-per-line (.tt) format to a
token corpus in one-word-per-line (.tt) format. Especially useful together with
standard UNIX utilities such as cut, grep, sort, and uniq.
=item dtatw-cabtt2xml.perl
Script to convert DTA::CAB::Format::TT (one-word-per-line with variable analysis
fields identified by conventional prefixes) files to expanded .t.xml format used
by dta-tokwrap. The expanded format should be identical to that used by the
DTA::CAB::Format::Xml class. See also L<dtatw-txml2tt.xsl>.
=item file-substr.perl
Script to extract a portion of a file,
specified by byte offset and length.
Useful for debugging index files created by other tools.
=back
=cut
##--------------------------------------------------------------
=pod
=head2 GNU make build system template
The distribution directory F<make/> contains a "template"
for using GNU F<make> to
organizing the conversion of large corpora with
the dta-tokwrap utilities. This is useful because:
=over 4
=item *
F<make>'s intuitive, easy-to-read syntax provides a
wonderful vehicle for user-defined configuration files,
obviating the need to remember the names of all 64
(at last count)
C<dta-tokwrap.perl|/dta-tokwrap.perl> options,
=item *
F<make>
is very good at tracking complex dependencies of the sort
that exist between the various temporary files generated
by the dta-tokwrap utilities,
=item *
F<make>
jobs can be made "robust" simply by adding a C<-k>
(C<--keep-going>) to the command-line,
and
=item *
last but certainly not least,
F<make>
has built-in support for parallelization of complex
tasks by means of the C<-j N> (C<--jobs=N>) option,
allowing us to take advantage of multiprocessor systems.
=back
By default, the contents of the distribution F<make/>
subdirectory are installed to F</usr/local/share/dta-tokwrap/make/>.
See the comments at the top of F<make/User.mak> for instructions.
=cut
##--------------------------------------------------------------
=pod
=head2 Perl Modules
=over 4
=item L<DTA::TokWrap|DTA::TokWrap>
Top-level tokenization-wrapper module, used by L<dta-tokwrap.perl|dta-tokwrap.perl>.
=item L<DTA::TokWrap::Document|DTA::TokWrap::Document>
Object-oriented wrapper for documents to be processed.
=item L<DTA::TokWrap::Processor|DTA::TokWrap::Processor>
Abstract base class for elementary document-processing operations.
=back
See the L<DTA::TokWrap::Intro(3pm)|DTA::TokWrap::Intro> manpage for more details
on included modules, APIs, calling conventions, etc.
=cut
##--------------------------------------------------------------
=pod
=head2 XSL stylesheets
The XSL stylesheets included with this distribution are installed
by default in F</usr/local/share/dta-tokwrap/stylesheets>.
=over 4
=item dtatw-add-lb.xsl
Replaces newlines with C<E<lt>lb/E<gt>> elements in input document.
=item dtatw-assign-cids.xsl
Assigns missing C<//c/@xml:id> attributes using the XSL C<generate-id()> function.
=item dtatw-rm-c.xsl
Removes C<E<lt>cE<gt>> elements from the input document.
Slow but robust.
=item dtatw-rm-lb.xsl
Replaces C<E<lt>lb/E<gt>> elements with newlines.
=item dtatw-txml2tt.xsl
Converts "master" tokenized XML output format (F<*.t.xml>) to
TAB-separated one-word-per-line format
(F<*.mr.t>
aka F<*.t>
aka F<*.tt>
aka "tt"
aka "CSV"
aka DTA::CAB::Format::TT
aka "TnT"
aka "TreeTagger"
aka "vertical"
aka "moot-native"
aka ...).
See the F<mootfiles(5)> manpage for basic format details, and
see the top of the XSL script for some influential transformation parameters.
=back
=cut
##--------------------------------------------------------------
=pod
=head2 C Programs
Several C programs are included with the distribution.
These are used by the L<dta-tokwrap.perl|dta-tokwrap.perl> script
to perform various intermediate document processing operations,
and should not need to be called by the user directly.
B<Caveat Scriptor>: The following programs are meant for
internal use by the C<DTA::TokWrap> modules only, and their
names, calling conventions, and very presence is subject to change
without notice.
=over 4
=item dtatw-mkindex
Splits input document F<doc.xml>
into
a "character index" F<doc.cx> (CSV),
a "structural index" F<doc.sx> (XML),
and a
"text index" F<doc.tx> (UTF-8 text).
=item dtatw-rm-namespaces
Removes namespaces from any XML document by
renaming "C<xmlns>" attributes to "C<xmlns_>"
and "C<xmlns:*>" attributes to "C<xmlns_*>".
Useful because XSL's namespace handling is annoyingly slow and ugly.
=item dtatw-tokenize-dummy
Dummy C<flex> tokenizer. Useful for testing.
=item dtatw-txml2sxml
Converts "master" tokenized XML output format (F<*.t.xml>) to
sentence-level stand-off XML format (F<*.s.xml>).
=item dtatw-txml2wxml
Converts "master" tokenized XML output format (F<*.t.xml>) to
token-level stand-off XML format (F<*.w.xml>).
=item dtatw-txml2axml
Converts "master" tokenized XML output format (F<*.t.xml>) to
token-analysis-level stand-off XML format (F<*.a.xml>).
=back
=cut
##======================================================================
=pod
=head1 SEE ALSO
perl(1).
=head1 AUTHOR
Bryan Jurish E<lt>
[email protected]<gt>
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2009-2018 by Bryan Jurish
This package is free software. Redistribution and modification
of C portions of this package are subject to the terms of the
version 3 or greater of the GNU Lesser General Public License; see the
files COPYING and COPYING.LESSER which came with the distribution for details.
Redistribution and/or modification of the Perl portions of this package
are subject to the same terms as Perl itself, either Perl version 5.24.1 or,
at your option, any later version of Perl 5 you may have available.
=cut