The CollectingParser is an example of how to use the SGML library (which
comes standard with Python) to parse HTML documents.  It currently works
with Python 1.3 (hasn't been tested against 1.4 yet).

See the comments at the head of the file for how to use it.  Here's an
example of what it can do:

% python CollectingParser.py http://www.python.org/sigs/web-sig/
URL: http://www.python.org/sigs/web-sig/
Title: Web SIG - Using Python for handling the World Wide Web
Size: 2939
Tables: 0
Frames: 0
Java: 0
Forms: 0
Isindex: 0
Foregrond color:
Link color: #0000FF
Background color: #FFFFFF
Background image: 1
Images:
       http://www.python.org/../pics/ArrowLeft.gif
       http://www.python.org/../pics/ArrowRight.gif
       http://www.python.org/sigs/web-sig/HTMLgen_banner.gif
       http://www.python.org/../pics/ArrowLeft.gif
       http://www.python.org/../pics/ArrowRight.gif
Links:
       http://www.python.org/../ (Home)
       http://www.python.org/../python/ (Software)
       http://www.python.org/../doc/ (Documentation)
       http://www.python.org/../psa/ (PSA)
       http://www.python.org/../workshops/ (Workshops)
       http://www.python.org/../sigs/ (SIGs)
       http://www.python.org/../locator/ (Search)
       http://www.python.org/sigs/web-sig/HTMLgen.beta.tar.gz (download it here)
       http://www.python.org/sigs/web-sig/mission (list mission statement)
       mailto:[email protected] ([email protected])
       mailto:[email protected] ([email protected])
       http://www.python.org/../ (Home)
       http://www.python.org/../python/ (Software)
       http://www.python.org/../doc/ (Documentation)
       http://www.python.org/../psa/ (PSA)
       http://www.python.org/../workshops/ (Workshops)
       http://www.python.org/../sigs/ (SIGs)
       http://www.python.org/../locator/ (Search)
Words: sig, python, handling, wide, software, documentation, psa,
workshops, sigs, sig, python, handling, wide, august, 15th, robin,
friedrich, released, beta, version, htmlgen, module, download, 160k, gzip,
tar, file, containing, test, script, supporting, files, complete,
documentation, set, created, daniel, larsson, gendoc, package, changes,
set, classes, flexible, table, generation, rewrite, list, classes, support,
full, nesting, probably, beta, release, important, addresses, list,
content, submissions, addr, sig, python, org, subscriptions, addr, sig,
request, python, org, list, admin, addr, sig, admin, python, org, list,
owner, addr, sig, owner, python, org, get, instructions, list, send,
message, containing, word, help, body, subscriptions, address, sig,
request, python, org, contact, list, owner, need, individual, help, click,
see, list, mission, statement, comments, send, email, webmaster, python,
org, questions, python, send, email, python, help, python, org, software,
documentation, psa, workshops, sigs

Contents of the tar file:

CollectingParser.py       the CollectingParser class
Stopwords.py              a list of stopwords to ignore

Comments and suggestions are welcome.  I hope this is useful to some.

Tessa Lau
[email protected]