* * * * *
Indexing weblogs
I see that Nick Denton [1] is launching a new venture [2] that seems to be
centered around marketing and weblog indexing; specifically, thoughts about
weblog indexing.
I've talked [3] about this a bit, but if a dedicated search engine wants to
successfully scan a weblog there are a few ways to go about it.
One, grab the RSS (Rich Site Summary) [4] file for the weblog and index the
links from that. That will allow you to populate the search engine with the
permanent links for the entries. Another thing it will allow you to do is
properly index the appropriate entries. Google [5] does a good job of
indexing pages, but a rather poor one of indexing individual entries of a
weblog, since it generally views pages as one entity and not as a possible
collection of entities. So that if I mention say, “hot dogs” on the first of
the month, “wet papertowels” on the fifteenth and “ugly gargoyles at Notre
Dame” on the last day of the month, someone looking for “hot wet gargoyles”
at Google [6] is going to find the page that archives that month.
Which is probably not what I, nor the searcher in question, want.
Well, unless I'm looking for disturbing search request [7] material, but I
digress.
Even if the permanent links point to a portion of a page, the link would be
something like
http://www.example.net/200204-index.html#31415926
Which points to a part of the page at
http://www.example.net/200204-index.html
And somewhere on that page is an anchor tag with the ID of “31415926” which
is most likely at the top of the entry in question. From there you index
until you hit the next named anchor tag that matches another entry in the RSS
(Rich Site Summary) file.
And if you hit a site like mine, the RSS (Rich Site Summary) [8] file will
have links that bring up individual pages for each entry.
Now, you might still have to contend with a weblog that doesn't have an Rich
Site Summary (Rich Site Summary) file, but then, you could just fall back to
indexing between named anchor points anyway and use heuristics to figure out
what may be the permanent links to index under.
I'm sure that people looking for “hot wet gargoyles” will thank you.
[1]
http://www.nickdenton.org/
[2]
http://www.nickdenton.org/newventure.htm
[3]
gopher://gopher.conman.org/0Phlog:2002/03/28.1
[4]
http://www.webreference.com/xml/column13/
[5]
http://www.google.com/
[6]
http://www.google.com/search?hl=en&q=hot+wet+gargoyles
[7]
http://searchrequests.weblogs.com/
[8]
https://boston.conman.org/bostondiaries.rss
Email author at
[email protected]