Path: news.ruhr-uni-bochum.de!news.rhrz.uni-bonn.de!RRZ.Uni-Koeln.DE!news.gtn.com!blackbush.xlink.net!howland.erols.net!newsfeed.internetmci.com!newsfeed.direct.ca!nntp.teleport.com!usenet
From:
[email protected] (Steffen Beyer)
Newsgroups: comp.lang.perl.announce,comp.lang.perl.misc
Subject: ANNOUNCE: generate tree representation of WWW site
Followup-To: comp.lang.perl.misc
Date: 1 Sep 1996 21:39:24 GMT
Organization: sd&m GmbH & Co. KG Munich, Germany
Lines: 112
Approved:
[email protected] (comp.lang.perl.announce)
Message-ID: <
[email protected]>
Reply-To:
[email protected] (Steffen Beyer)
NNTP-Posting-Host: kelly.teleport.com
X-Disclaimer: The "Approved" header verifies header information for article transmission and does not imply approval of content.
Xref: news.ruhr-uni-bochum.de comp.lang.perl.announce:412 comp.lang.perl.misc:43786
Recently I have written a Perl script to generate a tree representation
of a complete WWW site, or subtrees thereof, which I think might be
useful to others as well.
This is to give the visitors of your web site a useful overview of all the
pages you offer, where they are, and where they have already been.
Please find more details about this script in the following excerpt of the
README file that goes with it!
Please download the script in question from
http://www.sdm.de/e/www/hilfe/gen_tree-1.1.tar.gz or
ftp://..../..../CPAN/authors/id/STBEY/gen_tree-1.1.tar.gz
on any CPAN (= Comprehensive Perl Archive Network) ftp server near you
if you're interested. (See "The Perl 5 Module List" by Tim Bunce and Andreas
Koenig in news:comp.lang.perl.modules for a list of CPAN ftp servers)
Most important: Enjoy! :-)
Requirements:
Perl version 5.002 or higher. Compatibility of your web pages with the
Apache HTTP server. (Concerning the syntax of server side includes and
server side image maps)
What does it do:
This script scans the tree (better: the directed graph) of HTML pages
of a web site. (It's not always a tree because circles and loops are
possible!)
It starts at the home page of that site (called the "root page" here)
and follows all hyperlinks in a recursive descent (width first, in
order to produce a representation in the expected way).
(You can also scan just a subtree of your web site if you want)
Since it scans files in the file system of the host bearing the web
site, it is confined to pages lying physically on one host (!).
The web server (HTTP daemon) of the web site is NOT used at all (!).
(That's also why it doesn't use the libwww (LWP) module)
Circles and loops are recognized through unique identification of each
page by the device and inode numbers of its corresponding file.
Therefore, this script is confined to UNIX hosts or hosts where the
device and inode numbers returned by "stat" serve the same purpose
as with UNIX.
One could abandon this latter restriction if one used checksums for
identification instead. This is not 100% reliable, however.
When scanning of the web site is complete, an HTML page is generated
which contains all the pages found in form of one hyperlink to each
of them.
(The parse tree that is built in memory during the scanning phase is
traversed in a recursive descent, this time depth first, to yield a
tree that looks the expected way.)
The tree structure of the web site is reflected in this page by the
indentation of these hyperlinks.
The text which is displayed in these hyperlinks is extracted from the
<TITLE> ... </TITLE> tags inside the corresponding page.
Supported features:
This script is capable of executing server side includes and of analyzing
server side image maps (client side image maps wouldn't be very hard to
add). Their syntax must be compatible with the Apache HTTP server.
This way, no important hyperlinks are missed. (Many home pages consist of
an image map and nothing else!)
It is also able to analyze CGI scripts simply by calling them and analyzing
their output. (Therefore, no HTTP server is needed!)
Passing of variable parameters to CGI scripts is not supported, however,
whereas passing of constants to all CGI scripts via environment variables
is possible.
(Passing of variable parameters (like query strings) is problematic con-
ceptually: Imagine you get back a list (a possibly quite individual list
at that) of hyperlinks from a full text search CGI script on your web site!)
While the web site is being scanned, a detailed log file is written. Most
of the time, it's a very good idea to read it because it lets you discover
flaws in your web site that often go unnoticed otherwise!
The files generated by this script (log file and output file) are never
overwritten: instead, older versions are archived by appending an ever
increasing number to their file names.
This way, you can always go back to a previous state if anything bad
should ever happen.
To see a working example of a page generated by this script, direct your
browser to
http://www.sdm.de/e/www/hilfe/
Yours,
--
Steffen Beyer ________________________ C:\ONGRATLN.W95 _______________________
mailto:
[email protected] |s |d &|m | software design & management GmbH&Co.KG
phone: +49 89 63812-244 | | | | Thomas-Dehler-Str. 27
fax: +49 89 63812-150 | | | | 81737 Munich, Germany.