README for Array::Suffix
=========================

Array::Suffix is a perl module to determine variable length ngrams
from large corpora using the data structure suffix arrays.

The module:

  1. Provides an easy to use interface to determine ngrams from a
     corpus. Some of the basic functionality include:

  *  returns variable length ngrams
  *  allow for a stop list
  *  allows for a frequency cutoff
  *  allows for a remove cutoff

REQUIREMENTS
===================================

This module REQUIRES that the following software be download
and installed.

--Programming Languages
Perl (version 5.8.5 or better)

INSTALLATION
=========================

There are multiple ways to install this package.

1. You can use CPAN.pm to install Array::Suffix.

  To install type the following:

       perl -MCPAN -e 'install Array-Suffix'

2. Or you can install this yourself.

  To install this module type the following:

     perl Makefile.PL
     make
     make test
     make install

PROGRAM :
=========================

array-suffix-driver.pl

 This program takes as input a flat ASCII text file and outputs all
 Ngrams, or token sequences of length 'n', where the value of 'n'
 can be decided by the user, and the frequency of the ngram.

 Using array-suffix-driver.pl

       The most basic way of running this program is the following:

           % array-suffix-driver.pl output.txt input.txt

           where input.txt is the input text file in which to find
           the Ngrams and output.txt is the output file into which
           count.pl will put all the Ngrams with their frequencies.

 Changing the Length of Ngrams

       The default ngram size is 2. This can be changed by using
       the parameter option --ngram N, where N is the number of
       tokens in each ngram. For example, to find all the trigrams
       in the file input.txt, you would running program:

            %count.pl --ngram 3 output.txt input.txt

 Using User-Provided Token Definitions:

       The default token definitions are:

       \w+         -> this matches a contiguous sequence of
                      alpha-numeric characters

       [\.,;:\?!]  -> this matches a single punctuation mark

       The default token definitions can be over-ridden by using
       the option:

            --token FILE

       where FILE is the name of the file containing the regular
       expressions on which the token definitions will be based.

       Each regular expression in this FILE should be:
             1. on a line of its own
             2. should be delimited by the forward slash '/'.
             3. should be valid Perl regular expressions

 Removing character strings

       This option

            --nontoken FILE

       allows a user to define regular expressions that
       will match strings that should not be considered as tokens.
       These strings will be removed from the data and not counted
       or included in Ngrams.

       The --nontoken option is recommended when there are predictable
       sequences of characters that you know should not be included as
       tokens for purposes of counting Ngrams, finding collocations, etc.

       For example, if mark-up symbols like <s>, <p>, [item], [/ptr]
       exist in text being processed, you may want to include those
       in your list of nontoken items so they are discarded. If not,
       a simple regex such as /\w+/ will match with 's', 'p', 'item',
       'ptr' from these tags, leading to confusing results.

       The FILE following the nontoken option file should contain Perl
       regular expressions delimited by forward slashes '/' that define
       non-tokens. Multiple expressions may be placed on separate lines
       or be separated via the '|'  (Perl 'or') as in /regex1|regex2|../

       The following are some of the examples of valid non-token
       definitions:

               /<\/?s|p>/ : will remove xml tags like <s>, <p>, </s>, </p>.

               /\[\w+\]/  : will remove all words which appear in square
                            brackets like [p], [item], [123] and so on.

       The program will first remove any string from the input data that
       matches the non-token regular expression, and only then will match
       the remaining data against the token definitions.

 The Output Format

       Assume that the following are the contents of the input text file to
       array-suffix-driver.pl; let us call the file test.txt:

               first line of text
               second line
               and a third line of text

        Assume that array-suffix-driver.pl is run in its most general
        mode:

                % array-suffix-driver.pl test.out test.txt

        The output will contain all the bigrams found in the file test.txt
        using the default tokens as specified above. The contents of the
        output file test.out would be:

               11
               line<>of<>2
               of<>text<>2
               second<>line<>1
               line<>and<>1
               and<>a<>1
               a<>third<>1
               first<>line<>1
               third<>line<>1
               text<>second<>1

        The number on the first line, 11, indicates that there were
        11 bigrams in test.txt

        Following are the bigrams that were found in the test.txt file
        delimited by the diamond sign, "<>". Therefore the first bigram
        is line<>of<>, make up of the tokens "line" and "of" in that
        order. After the diamond following the last token there is a
        number, this number denotes how many times this bigram occurred
        in the text.

 The Marginals Option

        To obtain the a partial set of marginal counts for the bigram
        the option:

            --marginals

        must be set. This option outputs the individual frequency counts
        of each token in the ngram. Let us use our example from above
        but run the array-suffix-driver.pl program as follows:

                % array-suffix-driver.pl --marginals test.out test.txt

        The output will contain all the bigrams found i the file test.txt
        using the default tokens as specified above, their frequency
        counts and the number of times each of the tokens in the bigram
        occurred in their respective positions. The contents of the
        output file test.out would be:
               11
               line<>of<>2 3 2
               of<>text<>2 2 2
               second<>line<>1 1 3
               line<>and<>1 3 1
               and<>a<>1 1 1
               a<>third<>1 1 1
               first<>line<>1 1 3
               third<>line<>1 1 3
               text<>second<>1 1 1

         The first number after the bigram is the frequency of the bigram
         seen in test.out. The second number after the bigram is the
         number of times the first token was seen in the first position
         of all the bigrams and the second number is the number of times
         the second token was seen in the second position of all the
         bigrams.


 Stoplists

         The user may "stop" the Ngrams formed by array-suffix-driver.pl
         by providing a list of stop-tokens through the option:

             --stop FILE.

         Each stop token in FILE should be a Perl regular expression that
         occurs on a line by itself. This expression should be delimited
         by forward slashes, as in /REGEX/. All regular expression
         capabilities in Perl are supported except for regular expression
         modifiers (like the "i" /REGEX/i).

         The following are a few examples of valid entries in the stop list.

               /^\d+$/
               /\bthe\b/
               /\b[Tt][Hh][Ee]\b/
               /^and$/
               /\bor\b/
               /^be(ing)?$/

               There are two modes in which a stop list can be used,
               AND and OR. The default mode is AND, which means that
               an Ngram must be made up entirely of words from the
               stoplist before it is eliminated. The OR mode eliminates
               an Ngram if any of the words that make up the Ngram
               are found in the stoplist.



 Removing Low Frequency Ngrams:

          We allow the user to either remove or to not display low
          frequency Ngrams. The user can remove low frequency Ngrams
          by using the option :

               --remove N

          by which all Ngrams that occur less than n times are
          removed. The Ngram and the individual frequency counts are
          adjusted accordingly upon the removal of these Ngrams.

          The user can choose not to display low frequency Ngrams by
          using the option :

               --frequency N,

          by which Ngrams that occur less than n times are not
          displayed in the output. Note that this differs from the
          remove option above in that the frequency counts are not
          changed.


COPYRIGHT AND LICENCE
=========================

Copyright (C) 2004-2007, Bridget T. McInnes

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
02111-1307, USA.

Note: a copy of the GNU Free Documentation License is available
on the web at L<http://www.gnu.org/copyleft/fdl.html> and is
included in this distribution as FDL.txt.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.