README for Array::Suffix
=========================
Array::Suffix is a perl module to determine variable length ngrams
from large corpora using the data structure suffix arrays.
The module:
1. Provides an easy to use interface to determine ngrams from a
corpus. Some of the basic functionality include:
* returns variable length ngrams
* allow for a stop list
* allows for a frequency cutoff
* allows for a remove cutoff
REQUIREMENTS
===================================
This module REQUIRES that the following software be download
and installed.
--Programming Languages
Perl (version 5.8.5 or better)
INSTALLATION
=========================
There are multiple ways to install this package.
1. You can use CPAN.pm to install Array::Suffix.
To install type the following:
perl -MCPAN -e 'install Array-Suffix'
2. Or you can install this yourself.
To install this module type the following:
perl Makefile.PL
make
make test
make install
PROGRAM :
=========================
array-suffix-driver.pl
This program takes as input a flat ASCII text file and outputs all
Ngrams, or token sequences of length 'n', where the value of 'n'
can be decided by the user, and the frequency of the ngram.
Using array-suffix-driver.pl
The most basic way of running this program is the following:
% array-suffix-driver.pl output.txt input.txt
where input.txt is the input text file in which to find
the Ngrams and output.txt is the output file into which
count.pl will put all the Ngrams with their frequencies.
Changing the Length of Ngrams
The default ngram size is 2. This can be changed by using
the parameter option --ngram N, where N is the number of
tokens in each ngram. For example, to find all the trigrams
in the file input.txt, you would running program:
%count.pl --ngram 3 output.txt input.txt
Using User-Provided Token Definitions:
The default token definitions are:
\w+ -> this matches a contiguous sequence of
alpha-numeric characters
[\.,;:\?!] -> this matches a single punctuation mark
The default token definitions can be over-ridden by using
the option:
--token FILE
where FILE is the name of the file containing the regular
expressions on which the token definitions will be based.
Each regular expression in this FILE should be:
1. on a line of its own
2. should be delimited by the forward slash '/'.
3. should be valid Perl regular expressions
Removing character strings
This option
--nontoken FILE
allows a user to define regular expressions that
will match strings that should not be considered as tokens.
These strings will be removed from the data and not counted
or included in Ngrams.
The --nontoken option is recommended when there are predictable
sequences of characters that you know should not be included as
tokens for purposes of counting Ngrams, finding collocations, etc.
For example, if mark-up symbols like <s>, <p>, [item], [/ptr]
exist in text being processed, you may want to include those
in your list of nontoken items so they are discarded. If not,
a simple regex such as /\w+/ will match with 's', 'p', 'item',
'ptr' from these tags, leading to confusing results.
The FILE following the nontoken option file should contain Perl
regular expressions delimited by forward slashes '/' that define
non-tokens. Multiple expressions may be placed on separate lines
or be separated via the '|' (Perl 'or') as in /regex1|regex2|../
The following are some of the examples of valid non-token
definitions:
/<\/?s|p>/ : will remove xml tags like <s>, <p>, </s>, </p>.
/\[\w+\]/ : will remove all words which appear in square
brackets like [p], [item], [123] and so on.
The program will first remove any string from the input data that
matches the non-token regular expression, and only then will match
the remaining data against the token definitions.
The Output Format
Assume that the following are the contents of the input text file to
array-suffix-driver.pl; let us call the file test.txt:
first line of text
second line
and a third line of text
Assume that array-suffix-driver.pl is run in its most general
mode:
% array-suffix-driver.pl test.out test.txt
The output will contain all the bigrams found in the file test.txt
using the default tokens as specified above. The contents of the
output file test.out would be:
11
line<>of<>2
of<>text<>2
second<>line<>1
line<>and<>1
and<>a<>1
a<>third<>1
first<>line<>1
third<>line<>1
text<>second<>1
The number on the first line, 11, indicates that there were
11 bigrams in test.txt
Following are the bigrams that were found in the test.txt file
delimited by the diamond sign, "<>". Therefore the first bigram
is line<>of<>, make up of the tokens "line" and "of" in that
order. After the diamond following the last token there is a
number, this number denotes how many times this bigram occurred
in the text.
The Marginals Option
To obtain the a partial set of marginal counts for the bigram
the option:
--marginals
must be set. This option outputs the individual frequency counts
of each token in the ngram. Let us use our example from above
but run the array-suffix-driver.pl program as follows:
% array-suffix-driver.pl --marginals test.out test.txt
The output will contain all the bigrams found i the file test.txt
using the default tokens as specified above, their frequency
counts and the number of times each of the tokens in the bigram
occurred in their respective positions. The contents of the
output file test.out would be:
11
line<>of<>2 3 2
of<>text<>2 2 2
second<>line<>1 1 3
line<>and<>1 3 1
and<>a<>1 1 1
a<>third<>1 1 1
first<>line<>1 1 3
third<>line<>1 1 3
text<>second<>1 1 1
The first number after the bigram is the frequency of the bigram
seen in test.out. The second number after the bigram is the
number of times the first token was seen in the first position
of all the bigrams and the second number is the number of times
the second token was seen in the second position of all the
bigrams.
Stoplists
The user may "stop" the Ngrams formed by array-suffix-driver.pl
by providing a list of stop-tokens through the option:
--stop FILE.
Each stop token in FILE should be a Perl regular expression that
occurs on a line by itself. This expression should be delimited
by forward slashes, as in /REGEX/. All regular expression
capabilities in Perl are supported except for regular expression
modifiers (like the "i" /REGEX/i).
The following are a few examples of valid entries in the stop list.
/^\d+$/
/\bthe\b/
/\b[Tt][Hh][Ee]\b/
/^and$/
/\bor\b/
/^be(ing)?$/
There are two modes in which a stop list can be used,
AND and OR. The default mode is AND, which means that
an Ngram must be made up entirely of words from the
stoplist before it is eliminated. The OR mode eliminates
an Ngram if any of the words that make up the Ngram
are found in the stoplist.
Removing Low Frequency Ngrams:
We allow the user to either remove or to not display low
frequency Ngrams. The user can remove low frequency Ngrams
by using the option :
--remove N
by which all Ngrams that occur less than n times are
removed. The Ngram and the individual frequency counts are
adjusted accordingly upon the removal of these Ngrams.
The user can choose not to display low frequency Ngrams by
using the option :
--frequency N,
by which Ngrams that occur less than n times are not
displayed in the output. Note that this differs from the
remove option above in that the frequency counts are not
changed.
COPYRIGHT AND LICENCE
=========================
Copyright (C) 2004-2007, Bridget T. McInnes
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
02111-1307, USA.
Note: a copy of the GNU Free Documentation License is available
on the web at L<
http://www.gnu.org/copyleft/fdl.html> and is
included in this distribution as FDL.txt.
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.