* * * * *
Ramblings about search engine optimizations and bandwidth utilization
For the past week or so, I've been playing around with search [1] engine [2]
optimizations [3] (that last link is so I know what not to do) and poring
through the log files.
The last time I made a major search engine optimization to my site was four
years ago [4], and the reason for that optimization was to get rid of the
disturbing search requests [5] that were plaguing the log files (and my mind)
at the time. It also had the added benefit of reducing the amount of
“duplicate content” on my site. A search engine like Google [6] would skip
indexing the monthly archives (as well as the front page) but would index the
individual entries. The end result: no more disturbing search requests, and
better results for people actually looking for stuff.
But it didn't reduce all the duplicate content. There was still the small
problem of /2000/1/1.1 having the same content as /2000/01/01.1 (note the
leading zeros). Technically, they are two separate pages, each with a unique
URL (Uniform Resource Locator), although internally, the leading zero is
ignored by my blogging engine [7] and it would happily serve up the page
under either location.
Now, that particular duplicate content issue is something I've known about
since I started writing mod_blog and I had code to distinquish between the
two requests, but never wrote the code to do anything about it. Until last
week. Now, go to /2000/1/1.1 and you'll get a permanent redirect to
/2000/01/01.1. This change should further reduce the amount of “duplicate
content” on my site, as well as reduce the number of hits from web spiders
indexing my site (although the redirection doesn't happen under a very unique
condition, but fixing that pretty much requires a complete overhaul of some
very old code, but it's such a seldom used bit of code that I'm not terribly
worried about it).
I'm a bit concerned about the spiders because of some other information I've
pulled out from the log files. My archive of log files (at least, of this
blog [8]) go back to October of 2001 [9] and using some homegrown tools, I
generated (with the help of GNUPlot [10]) this graph of the growth of my site
over the past six years:
[Graph of traffic growth at The Boston Diaries] [11]
In red, you see the number of raw hits to this site (with the scale along the
left hand side), with some explosive growth in early 2006 and again in just
the last few months here. In green you see the actual bytes transferred (with
its scale along the right hand side)—pretty steady up until January of 2006
when it goes vertical, and again it goes vertical in just the past few
months.
And I'm at a loss to the sudden explosion of bandwidth usage in my site.
Unless it's a lot of people hot linking [12] to images on this site (and yes,
that does happen quite often), or a vast increase in the number of spiders
indexing my site (and for the past few months, Yahoo's [13] Slurp [14] has
been generating about 40,000 hits a month).
I may no longer have disturbing search requests, but I know have a disturbing
use of bandwidth.
[1]
http://seo-theory.com/wordpress/
[2]
http://www.seobook.com/
[3]
http://seoblackhat.com/
[4]
gopher://gopher.conman.org/0Phlog:2003/08/14.1
[5]
http://www.disturbingsearchrequests.com/
[6]
http://www.google.com/
[7]
https://boston.conman.org/about/
[8]
https://boston.conman.org/
[9]
gopher://gopher.conman.org/1Phlog:2001/10
[10]
http://www.gnuplot.info/
[11]
gopher://gopher.conman.org/IPhlog:2007/08/13/growth.png
[12]
http://altlab.com/hotlinking.html
[13]
http://www.yahoo.com/
[14]
http://en.wikipedia.org/wiki/Yahoo!_Slurp
Email author at
[email protected]