A google spiders

* * * * *

A google spiders

In checking the log files for this site [1] I've notived that Google [2] has
finally found it and has spent the past few days spidering through it.

There are a few thousand links for it to follow (out of what? A million
potential URL (Universal Resource Locator)s on this site? I know the Electric
King James [3] has over fifteen million URLs [4]). For instance, there are
three just for the years, 12 each for each year (okay, so there's only 11 for
this year, but close enough) so that's now 39 URLs. Each day (for those days
that have an entry) have at least one entry and while I may have skipped a
day or two here and there, let's say there's an averave of 300 per year, so
that's over 900 there. And if you assume an average of two entries per day
(remember, you can retrieve the entire day, or just an entry) that's another
600 per year or 1,800 so we're now up to nearly 3,000 URLs that Google has to
crawl through (with lots of duplication).

robots.txt for bible.conman.org [5]

#-----------------------------
# Go away---we don't want you
# to endlessly spider this
# site.
#-----------------------------

User-agent: *
Disallow: /

There's a reason I don't allow web robots/spiders to the Electric King James
[6]—it would take way too long to index the site (if indeed, the spider in
question was even aware of all the possible URLs) and my machine isn't all
that powerful to begin with (it being a 33MHz 486 and all). But I feel that
there is a research problem lurking here that some interprising Masters or
Ph.D. candidate could tackle: how best to spider a site that allows multiple
views per document.

[1] http://boston.conman.org/
[2] http://www.google.com/
[3] http://literature.conman.org/bible/
[4] gopher://gopher.conman.org/0Phlog:2000/08/31.2
[5] http://bible.conman.org/
[6] http://literature.conman.org/bible/

Email author at [email protected]