[ IMPORTANT --> Be sure to read the section on requirements! ]
This is a very ALPHA-TEST implementation of a thesaurus for GNU
Emacs. Although it is not complete, I'm not sure when or if I'll have
the time to spiff it up. As a result, I'm posting what I have here (is
anyone else working on something similar?). It's copyrighted and is
being released under the GNU Public License (see the end of this file
for more details). Note that only this interface falls under the GNU
Public License; the thesaurus itself has a completely separate and
independent "copyright".
The Emacs-Lisp functions in this package allow you to query a
thesaurus for synonyms of a word. For example, you can ask Emacs to
quickly display a thesaurus entry for "editor":
-------------------------------------------------------------------------------
***** Word: editor
#593. Book. -- N. booklet; writing, work, volume, tome, opuscule;
tract, tractate; livret; brochure, libretto, handbook, codex, manual,
pamphjlet, enchiridion, circular, publication; chap book.
part, issue, number livraison; album, portfolio; periodical, serial,
magazine, ephemeris, annual, journal.
paper, bill, sheet, broadsheet; leaf, leaflet; fly leaf, page; quire,
ream
chapter, section head, article paragraph, passage, clause.
folio, quarto, octavo; duodecimo, sextodecimo, octodecimo.
encyclopedia; encompilation; library, bibliotheca; press &c.
(publication) 531.
writer, author, litterateur, essayist, journalism; pen, scribbler, the
scribbling race; literary hack, Grub-street writer; writerr for the press,
gentleman of the press, representative of the press; adjective jerker,
diaskeaus, ghost, hack writer, ink slinger; publicist; reporter, penny a
liner; editor, subeditor; playwright &c. 599; powt &c. 597.
bookseller, publisher; bibliopole, bibliopolist; librarian; bookstore,
bookseller's shop.
knowledge of books, bibliography; book learning &c. (knowledge) 490.
Phr. "among the giant fossils of my past" [E. B. Browning]; craignez
tout d'un auteur en courroux; "for authors nobler palms remain" [Pope]; "I
lived to write and wrote to live" [Rogers]; "look in thy heart and write"
[Sidney]; "there is no Past so long as Books shall live" [Bulwer Lytton);
"the public mind is the creation of the Master-Writers" [Disraeli]; volumes
that I prize above my dukedom" [Tempest].
-------------------------------------------------------------------------------
*******************************************************************************
***** REQUIREMENTS:
To use this, you need the following (besides the files that came
with this README file):
* A copy of the thesaurus itself (which is not included with this README
file). Thanks to Project Gutenberg, a copy of the 1911 Roget's
Thesaurus has been made available via anonymous ftp from
mrcnext.cso.uiuc.edu [ 128.174.201.12 ] (please ftp the file during
off-hours -- at times OTHER THAN 10:00 AM to 6:00 PM Central Standard
Time (Daylight in summer)). It's in the directory "/etext":
-rw-r--r-- 1 24 micro 1377400 Jun 19 18:08 roget11.txt
-rw-r--r-- 1 24 micro 592247 Jun 19 18:13 roget11.zip
You only need one of these, as roget11.zip is roget11.txt in a .ZIP
file. Note, however, the size.
* A copy of Perl 4.0, compiled with dbm/ndbm support, as the thesaurus
indexing and low-level access routines are written as Perl scripts
(this was done to avoid having to load the entire 1.3MB thesaurus into
Emacs, bloating its process size). Part of the index is stored as a
dbm database, and so dbm/ndbm support must be compiled into Perl.
* While building the index (an index must be built from the raw
thesaurus data), it is recommended that your system have plenty of
free RAM and swap space, as a single 10-12 megabyte process is created
during the indexing process. Once the index is created, you need much
less resources to access the thesaurus.
* You need about two megabytes of free disk space. The thesaurus
occupies about 1.3MB, and the index files occupies another half
megabyte or so.
Installation instructions are mentioned below.
*******************************************************************************
***** USAGE:
The GNU Emacs interface provides three functions:
thesaurus-lookup-word
This function will prompt for a word to look up, and all entries
that begin with this word will be displayed. To display the
entry that contains only this word, specify a prefix.
thesaurus-lookup-word-in-text
This function will extract the word under the cursor and run
`thesaurus-lookup-word' upon it. A prefix can be specified to
force the display of only the entry that contains this word.
thesaurus-show-words
This function will prompt for a word and will display all words
in the thesaurus that begin with this word.
These functions should be bound to some key sequences; however, this
package does not do this. You'll have to do it yourself.
There is also a shell-command-line interface to the thesaurus
(which is what the GNU Emacs interface uses). Using the "th" Perl
script, you can query the thesaurus for a number of things:
th <word> [<word> ...]
Search the thesaurus for all entries that begin with
"<word>". Multiple words can be specified here.
th -V <word> [<word> ...]
Search the thesaurus for all entries that begin with
"<word>". All displayed entries are separated by a line
of dashes.
th -W <word> [<word> ...]
Search the thesaurus for the entry that contains
"<word>" exactly.
th -w <word> [<word> ...]
Display all words in the thesaurus that begin with
"<word>".
th -w -v <word> [<word> ...]
Display all words in the thesaurus that begin with
"<word>". Alongside each word, the numbers of the
entries that contain the word are displayed.
th -n <number>
Display thesaurus entry number "<number>". Unlike a
word, only one number can be specified.
Generally, you will want to pipe the output to more(1) or less(1).
*******************************************************************************
***** PROBLEMS:
Error handling needs work. Nothing is output if a word is not
found in the thesaurus.
The scripts are simple-minded, and occasionally "screw-up"
(fortunately, this seems to be rare).
There are typos in the thesaurus, which can cause the scripts to
mis-index very small parts of the thesaurus.
The scripts used to build the indices are inefficient and are
unbelievably poorly written. Fortunately, this doesn't really matter,
as the index creation process is a one-time task. Looking up words
in the thesaurus is quite fast.
The thesaurus is stored in an uncompressed form. I thought about
breaking the thesaurus apart and storing each entry as a separate,
compressed, file, but this method loses some information in the
thesaurus (which cannot currently be accessed by these routines). It
might be interesting to try compiling Perl with GNU dbm and storing the
entire thesaurus as a monolithic gdbm database (one entry per datum).
*******************************************************************************
***** INSTALLATION INSTRUCTIONS:
The following assumes that you are familiar with Emacs and that you
have installed a copy of Perl 4.0.
1. Create a directory to hold all of the files.
2. Copy the files that came with this README file into that directory.
3. Copy the thesaurus into that directory.
4. cd to that directory.
5. Link the thesaurus to the name "roget.txt". For example, if the
copied thesaurus is called "roget11.txt", you can use a symbolic link
(if your system supports symbolic links):
ln -s roget11.txt roget.txt
or you can use hard links:
ln roget11.txt roget.txt
6. Run the script "makeindex". This script runs the other scripts to
build the index files. On most modern machines with adequate
resources, it'll take about 5-10 minutes to run (less, on fast
machines, and more, on slow machines). Note that a single 10-12
megabyte process is created during the indexing procedures, so be
sure that you have plenty of free RAM (otherwise, you'll go into swap
h*ll, and this procedure could take hours). Once everything is done,
three files will be created:
-rw-r--r-- 1 root other 4096 Dec 17 20:50 offsets.dir
-rw-r--r-- 1 root other 16384 Dec 17 20:50 offsets.pag
-rw-r--r-- 1 root other 492733 Dec 17 20:49 word-index
(The file sizes may be different on your machine.)
The files that begin with "offset" comprise a dbm/nbdm database of
thesaurus file byte offsets versus entry number (i.e., for a given
entry number, the corresponding file byte offset of the beginning of
that entry is stored). This means that, if the thesaurus file is
ever edited or changed, you MUST re-execute the "makeindex" script to
rebuild the indices.
The file "word-index" is an ASCII text file of words and entry
numbers, e.g.:
creditor: 805
creed: 484
creek: 198 343 348
In this example, the word "creditor" is mentioned in entry #805,
"creed" is mentioned in entry #484, and "creek" is mentioned in
entries #198, #343, and #348.
7. Edit the file "th", and edit the line (around line 69):
$thesaurus_dir = "/usr/local/lib/roget";
Change "/usr/local/lib/roget" to point to the directory containing
the thesaurus and index files.
8. Add this directory to your $PATH. If you don't, you won't be able to
run the "th" command (Emacs needs this).
9. Edit your .emacs file and add this directory to your load-path. Also
add a line like the following:
(load-library "thesaurus")
10. That's it.
*******************************************************************************
***** Legal foo:
-------------------------------------------------------------------------------
These thesaurus indexing and accessing routines are copyrighted.
Copyright (C) 1991 Darryl Okahata (
[email protected])
NOTE THAT THE THESAURUS ITSELF HAS A COMPLETELY SEPARATE AND INDEPENDENT
"COPYRIGHT". SEE THE THESAURUS FOR DETAILS.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 1, or (at your option)
any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-------------------------------------------------------------------------------
-- Darryl Okahata
Internet:
[email protected]
DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion or policy of Hewlett-Packard or of the
little green men that have been following him all day.