WAIT 1.6

                 Copyright (c) 1996, Ulrich Pfeifer

------------------------------------------------------------------------
   This program is free software; you can redistribute it and/or
   modify it under the same terms than Perl itself.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
------------------------------------------------------------------------

This software is not actively maintained by it's author.

For more two years now I tried to steal some time to clean this up
without any luck. So I decided to pass the baton on. I consider the
input part pretty satisfying. The query part - despite being operable
and useful - needs a major overhaul. To provide a forum for further
discussions an to coordinate further developement, I did setup a
mailinglist.  Drop me a line if you want to participate.

Ulrich Pfeifer <[email protected]>

------------------------------------------------------------------------
NAME
   WAIT - a rewrite of the freeWAIS-sf engine in Perl

Status of this document
   I started writing down some information about the implementation
   before I forget them in my spare time. The stuff is incomplete
   at least. Any additions, corrections, ... welcome.

PURPOSE
   As you might know, I developed and maintained freeWAIS-sf (with
   the help of many people in The Net). FreeWAIS-sf is based on
   freeWAIS maintained by the Clearing House for Network
   Information Retrieval (CNIDR) which in turn is based on wais-8-
   b5 implemented by Thinking Machine et al. During this long
   history - implementation started about 1989 - many people
   contributed to the distribution and added features not foreseen
   by the original design. While the system fulfills its task now,
   the code has reached a state where adding new features is nearly
   impossible and even fixing longstanding bugs and removing
   limitations has become a very time consuming task.

   Therefore I decided to pass the maintenance to WSC Inc. and
   built a new system from scratch. For obvious reasons I choosed
   Perl as implementation language.

DESCRIPTION
   The central idea of the system is to provide a framework and the
   building blocks for any indexing and search system the users
   might want to build. Obviously the framework limits the class of
   system which can be build.

          +------+     +-----+     +------+
      ==> |Access| ==> |Parse| ==> |      |
          +------+     +-----+     |      |
                          ||       |      |     +-----+
                          ||       |Filter| ==> |Index|
                          \/       |      |     +-----+
         +-------+     +-----+     |      |
      <= |Display| <== |Query| <-> |      |
         +-------+     +-----+     +------+

   A collection (aka table) is defined by the instances of the
   access and parse module together with the filter definitions. At
   query time in addition a query and a display module must be
   choosen.

 Access

   The access module defines which documents where members of a
   database. Usually an access module is a tied hash, whose keys
   are the Ids of the documents (did = document id) and whose
   values are the documents themselves. The indexing process loops
   over the keys using `FIRSTKEY' and `NEXTKEY'. Documents are
   retrieved with `FETCH'.

   By convention access modules should be members of the
   `WAIT::Document' hierarchy. Have a look at the
   `WAIT::Document::Split' module to get the idea.

 Parse

   The task parse module is to split the documents into logical
   parts via the `split' method. E.g. the `WAIT::Parse::Nroff'
   splits manuals piped through nroff(1) into the sections *name*,
   *synopsis*, *options*, *description*, *author*, *example*,
   *bugs*, *text*, *see*, and *environment*. Here is the
   implementation of `WAIT::Parse::Base' which handes documents
   with a pretty simple tagged format:

     AU: Pfeifer, U.; Fuhr, N.; Huynh, T.
     TI: Searching Structured Documents with the Enhanced Retrieval
         Functionality of freeWAIS-sf and SFgate
     ER: D. Kroemker
     BT: Computer Networks and ISDN Systems; Proceedings of the third
         International World-Wide Web Conference
     PN: Elsevier
     PA: Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo
     PP: 1027-1036
     PY: 1995

     sub split {                     # called as method
       my %result;
       my $fld;

       for (split /\n/, $_[1]) {
         if (s/^(\S+):\s*//) {
           $fld = lc $1;
         }
         $result{$fld} .= $_ if defined $fld;
       }
       return \%result;
     }

   Since the original document cannot be reconstructed from its
   attributes, we need a second method (*tag*) which marks the
   regions of the document with tags for the different attributes.
   This tagged form is used by the display module to hilight search
   terms in the documents. Besides the tags for the attributes, the
   method might assign the special tags `_b' and `_i' for
   indicating bold and italic regions.

     sub tag {
       my @result;
       my $tag;

       for (split /\n/, $_[1]) {
         next if /^\w\w:\s*$/;
         if (s/^(\S+)://) {
           push @result, {_b => 1}, "$1:";
           $tag = lc $1;
         }
         if (defined $tag) {
           push @result, {$tag => 1}, "$_\n";
         } else {
           push @result, {}, "$_\n";
         }
       }
       return @result;               # we don't go for speed
     }

   Obviously one could implement `split' via `tag'. The reason for
   having two functions is speed. We need to call `split' for each
   document when indexing a collection. Therefore speed is
   essential. On the other hand, `tag' is called in order to
   display a single document and may be a little slower. It may
   care about tagging bold and italic regions. See
   `WAIT::Parse::Nroff' how this might decrease performance.

 Filter definition

   From the Information Retrieval perspective, the hardest part of
   the system is the filter module. The database administrator
   defines for each attribute, how the contents should be processed
   before it is stored in the index. Usually the processing
   contains steps to restrict the character set, case
   transformation, splitting to words and transforming to word
   stems. In WAIT these steps are defined naturally as a pipeline
   of processing steps. The pipelines are made up by functions in
   the package WAIT::Filter which is pre-populated by the most
   common functions but may be extended any time.

   The equivalent for a typical freeWAIS-sf processing would be
   this pipeline:

           [ 'isotr', 'isolc', 'split2', 'stop', 'Stem']

   The function `isotr' replaces unknown characters by blanks.
   `isolc' transforms to lower case. `split2' splits into words and
   removes words shorter than two characters. `stop' removes the
   freeWAIS-sf stopwords and `Stem' applies the Porter algorithm
   for computing the stem of the words.

   The filter definition for a collection defines a set of piplines
   for the attributes and modifies the pipelines which should be
   used for prefix and interval searches.

   Here is a complete example:

     my $stem  = [{
                   'prefix'    => ['unroff', 'isotr', 'isolc'],
                   'intervall' => ['unroff', 'isotr', 'isolc'],
                  },'unroff', 'isotr', 'isolc', 'split2', 'stop', 'Stem'];
     my $text  = [{
                   'prefix'    => ['unroff', 'isotr', 'isolc'],
                   'intervall' => ['unroff', 'isotr', 'isolc'],
                  },
                   'unroff', 'isotr', 'isolc', 'split2', 'stop'];
     my $sound = ['unroff', 'isotr', 'isolc', 'split2', 'Soundex'];

     my $spec  = [
         'name'         => $stem,
         'synopsis'     => $stem,
         'bugs'         => $stem,
         'description'  => $stem,
         'text'         => $stem,
         'environment'  => $text,
         'example'      => $text,  'example' => $stem,
         'author'       => $sound, 'author'  => $stem,
        ]