[ IMPORTANT --> Be sure to read the section on requirements! ]

    This is a very ALPHA-TEST implementation of a thesaurus for GNU
Emacs.  Although it is not complete, I'm not sure when or if I'll have
the time to spiff it up.  As a result, I'm posting what I have here (is
anyone else working on something similar?).  It's copyrighted and is
being released under the GNU Public License (see the end of this file
for more details).  Note that only this interface falls under the GNU
Public License; the thesaurus itself has a completely separate and
independent "copyright".

    The Emacs-Lisp functions in this package allow you to query a
thesaurus for synonyms of a word.  For example, you can ask Emacs to
quickly display a thesaurus entry for "editor":

-------------------------------------------------------------------------------
***** Word: editor

    #593. Book. -- N. booklet; writing, work, volume, tome, opuscule;
tract, tractate; livret; brochure, libretto, handbook, codex, manual,
pamphjlet, enchiridion, circular, publication; chap book.
    part, issue, number livraison; album, portfolio; periodical, serial,
magazine, ephemeris, annual, journal.
    paper, bill, sheet, broadsheet; leaf, leaflet; fly leaf, page; quire,
ream
    chapter, section head, article paragraph, passage, clause.
    folio, quarto, octavo; duodecimo, sextodecimo, octodecimo.
    encyclopedia; encompilation;  library, bibliotheca; press &c.
(publication) 531.
    writer, author, litterateur, essayist, journalism; pen, scribbler, the
scribbling race; literary hack, Grub-street writer; writerr for the press,
gentleman of the press, representative of the press; adjective jerker,
diaskeaus, ghost, hack writer, ink slinger; publicist; reporter, penny a
liner; editor, subeditor; playwright &c. 599; powt &c. 597.
    bookseller, publisher; bibliopole, bibliopolist; librarian; bookstore,
bookseller's shop.
    knowledge of books, bibliography; book learning &c. (knowledge) 490.
    Phr. "among the giant fossils of my past" [E. B. Browning]; craignez
tout d'un auteur en courroux; "for authors nobler palms remain" [Pope]; "I
lived to write and wrote to live" [Rogers]; "look in thy heart and write"
[Sidney]; "there is no Past so long as Books shall live" [Bulwer Lytton);
"the public mind is the creation of the Master-Writers" [Disraeli]; volumes
that I prize above my dukedom" [Tempest].
-------------------------------------------------------------------------------


*******************************************************************************
***** REQUIREMENTS:

    To use this, you need the following (besides the files that came
with this README file):

* A copy of the thesaurus itself (which is not included with this README
 file).  Thanks to Project Gutenberg, a copy of the 1911 Roget's
 Thesaurus has been made available via anonymous ftp from
 mrcnext.cso.uiuc.edu [ 128.174.201.12 ] (please ftp the file during
 off-hours -- at times OTHER THAN 10:00 AM to 6:00 PM Central Standard
 Time (Daylight in summer)).  It's in the directory "/etext":

       -rw-r--r--  1 24       micro    1377400 Jun 19 18:08 roget11.txt
       -rw-r--r--  1 24       micro     592247 Jun 19 18:13 roget11.zip

 You only need one of these, as roget11.zip is roget11.txt in a .ZIP
 file.  Note, however, the size.

* A copy of Perl 4.0, compiled with dbm/ndbm support, as the thesaurus
 indexing and low-level access routines are written as Perl scripts
 (this was done to avoid having to load the entire 1.3MB thesaurus into
 Emacs, bloating its process size).  Part of the index is stored as a
 dbm database, and so dbm/ndbm support must be compiled into Perl.

* While building the index (an index must be built from the raw
 thesaurus data), it is recommended that your system have plenty of
 free RAM and swap space, as a single 10-12 megabyte process is created
 during the indexing process.  Once the index is created, you need much
 less resources to access the thesaurus.

* You need about two megabytes of free disk space.  The thesaurus
 occupies about 1.3MB, and the index files occupies another half
 megabyte or so.


    Installation instructions are mentioned below.


*******************************************************************************
***** USAGE:

    The GNU Emacs interface provides three functions:

       thesaurus-lookup-word
            This function will prompt for a word to look up, and all entries
            that begin with this word will be displayed.  To display the
            entry that contains only this word, specify a prefix.

       thesaurus-lookup-word-in-text
            This function will extract the word under the cursor and run
            `thesaurus-lookup-word' upon it.  A prefix can be specified to
            force the display of only the entry that contains this word.

       thesaurus-show-words
            This function will prompt for a word and will display all words
            in the thesaurus that begin with this word.

These functions should be bound to some key sequences; however, this
package does not do this.  You'll have to do it yourself.

    There is also a shell-command-line interface to the thesaurus
(which is what the GNU Emacs interface uses).  Using the "th" Perl
script, you can query the thesaurus for a number of things:

       th <word> [<word> ...]
               Search the thesaurus for all entries that begin with
               "<word>".  Multiple words can be specified here.

       th -V <word> [<word> ...]
               Search the thesaurus for all entries that begin with
               "<word>".  All displayed entries are separated by a line
               of dashes.

       th -W <word> [<word> ...]
               Search the thesaurus for the entry that contains
               "<word>" exactly.

       th -w <word> [<word> ...]
               Display all words in the thesaurus that begin with
               "<word>".

       th -w -v <word> [<word> ...]
               Display all words in the thesaurus that begin with
               "<word>".  Alongside each word, the numbers of the
               entries that contain the word are displayed.

       th -n <number>
               Display thesaurus entry number "<number>".  Unlike a
               word, only one number can be specified.

Generally, you will want to pipe the output to more(1) or less(1).



*******************************************************************************
***** PROBLEMS:

    Error handling needs work.  Nothing is output if a word is not
found in the thesaurus.

    The scripts are simple-minded, and occasionally "screw-up"
(fortunately, this seems to be rare).

    There are typos in the thesaurus, which can cause the scripts to
mis-index very small parts of the thesaurus.

    The scripts used to build the indices are inefficient and are
unbelievably poorly written.  Fortunately, this doesn't really matter,
as the index creation process is a one-time task.  Looking up words
in the thesaurus is quite fast.

    The thesaurus is stored in an uncompressed form.  I thought about
breaking the thesaurus apart and storing each entry as a separate,
compressed, file, but this method loses some information in the
thesaurus (which cannot currently be accessed by these routines).  It
might be interesting to try compiling Perl with GNU dbm and storing the
entire thesaurus as a monolithic gdbm database (one entry per datum).


*******************************************************************************
***** INSTALLATION INSTRUCTIONS:

    The following assumes that you are familiar with Emacs and that you
have installed a copy of Perl 4.0.

1. Create a directory to hold all of the files.

2. Copy the files that came with this README file into that directory.

3. Copy the thesaurus into that directory.

4. cd to that directory.

5. Link the thesaurus to the name "roget.txt".  For example, if the
  copied thesaurus is called "roget11.txt", you can use a symbolic link
  (if your system supports symbolic links):

       ln -s roget11.txt roget.txt

  or you can use hard links:

       ln roget11.txt roget.txt

6. Run the script "makeindex".  This script runs the other scripts to
  build the index files.  On most modern machines with adequate
  resources, it'll take about 5-10 minutes to run (less, on fast
  machines, and more, on slow machines).  Note that a single 10-12
  megabyte process is created during the indexing procedures, so be
  sure that you have plenty of free RAM (otherwise, you'll go into swap
  h*ll, and this procedure could take hours).  Once everything is done,
  three files will be created:

       -rw-r--r--   1 root     other       4096 Dec 17 20:50 offsets.dir
       -rw-r--r--   1 root     other      16384 Dec 17 20:50 offsets.pag
       -rw-r--r--   1 root     other     492733 Dec 17 20:49 word-index

  (The file sizes may be different on your machine.)

  The files that begin with "offset" comprise a dbm/nbdm database of
  thesaurus file byte offsets versus entry number (i.e., for a given
  entry number, the corresponding file byte offset of the beginning of
  that entry is stored).  This means that, if the thesaurus file is
  ever edited or changed, you MUST re-execute the "makeindex" script to
  rebuild the indices.

  The file "word-index" is an ASCII text file of words and entry
  numbers, e.g.:

       creditor: 805
       creed: 484
       creek: 198 343 348

  In this example, the word "creditor" is mentioned in entry #805,
  "creed" is mentioned in entry #484, and "creek" is mentioned in
  entries #198, #343, and #348.

7. Edit the file "th", and edit the line (around line 69):

       $thesaurus_dir = "/usr/local/lib/roget";

  Change "/usr/local/lib/roget" to point to the directory containing
  the thesaurus and index files.

8. Add this directory to your $PATH.  If you don't, you won't be able to
  run the "th" command (Emacs needs this).

9. Edit your .emacs file and add this directory to your load-path.  Also
  add a line like the following:

       (load-library "thesaurus")

10. That's it.


*******************************************************************************
***** Legal foo:

-------------------------------------------------------------------------------
These thesaurus indexing and accessing routines are copyrighted.
Copyright (C) 1991 Darryl Okahata ([email protected])

NOTE THAT THE THESAURUS ITSELF HAS A COMPLETELY SEPARATE AND INDEPENDENT
"COPYRIGHT".  SEE THE THESAURUS FOR DETAILS.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 1, or (at your option)
any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-------------------------------------------------------------------------------

    -- Darryl Okahata
       Internet: [email protected]

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion or policy of Hewlett-Packard or of the
little green men that have been following him all day.