* * * * *
The Ins and Outs of Calculating Weblog Traffic
Obligatory Sidebar Links
John Scalzi: Whatever 4/8/2002 [1] (what started this all)
Jeff Jarvis—Numbers game [2]
Matt Welch: Numbers for “B.S.” Detectors [3]
Glenn Reynolds—InstaPundit [4]
Tom Tomorrow—A hobby, not a profession [5]
The buzz in Bloggerton is about numbers. The number of readers a blog has and
it's not an easy number to calculate. Over the past few months I've been
measuring myself against Sean Tevis [6], a fellow South Florida blogger (whom
I actually met in real life once). For a while we were pretty much at parity,
but then over the past month or so he's taken off. As he states (as of today)
he is getting 4,000 visits and 10,000 page views per month.
And I'm wondering just how he's calculating that.
So here we go. Raw counts for The Boston Diaries [7]: January 2002: 14,297
requests. February 2002: 8,035 requests and March 2002: 7,860 requests. Yes,
there's a rather big drop there between January and February, but that can be
accounted for—5,870 requests in January were from easily identifiable search
engine robots (4,726 just from one alone). If we rerun the count for just the
popular browsers (basically, any agent reporting itself as Mozilla, of which
Netscape, Mozilla, Opera and IE—yes, that does skip Lynx, but the number of
hits via Lynx (that aren't me) is miniscule for purposes of the rough
estimates I'm doing here) and only pages (or files) that were successfully
served up, we get: January 2002: 5,880 requests. February 2002: 6,089
requests. And March 2002: 5,292 (ouch).
Now, I'm generating this by going over the raw logs with a custom program I
wrote that allows one to filter out fields (to make it easier to grep
through). Those last figures, for instance, were done with:
escanlog -status 200 -agent boston.conman.org | grep Mozilla | wc -l
escanlog is the program I wrote, and I instructed it to only print out
records that successfully completed (-status 200) and only print out the user
agent field (-agent) on the log file in question (boston.conman.org). grep
and wc -l are standard Unix programs to search for patterns and count
characters (or lines, in this invocation).
But those figures are again, misleading. They include images, requests for
the RSS (Rich Site Summary) [8] file, the CSS (Cascading Style Sheet) [9]
file; extraneous stuff that don't really constitute an actual page view.
Going over the logs again, this time only taking into account pages (most
likely) viewed by humans we get: January 2002: 1,805. February 2002: 2,090.
March 2002: 1,538 (ow! But it's still an improvement over December 2001 at
1,090).
Oh wait, one more variable to control for: those counts include those I've
done. Remove those, and the results are: December 2001 (since I included it
above): 1,009. January 2002: 1,673 (well, Rob [10] and Spring [11] are also
being excluded—yea, that's why I had over 100 visits from myself). February
2002: 1,909. March 2002: 1,328 (oooh).
Now, I can pretty much guarantee that those figures up there represent unique
visits. A more interesting question to answer would be the number of repeat
(or regular) visits. This is tougher since most ISP (Internet Service
Provider)s dish out dynamic IP (Internet Protocol) addresses whenever someone
reconnects but I don't think it's impossible to get a ball park figure,
taking the previous results, pulling out the unique IP addresses and sorting,
I see for January 2002 (cutting off after 5 unique visits per address):
197 65.116.145.137
92 208.55.254.110
63 211.101.236.143
45 63.173.190.16
30 64.129.118.129
30 24.52.32.105
20 211.101.236.79
19 24.4.252.167
15 208.60.8.130
11 65.58.147.103
11 164.77.128.210
10 64.131.172.241
9 66.157.2.122
8 207.49.213.174
7 65.2.207.3
6 65.207.131.180
6 64.39.15.82
6 12.39.254.108
5 64.30.224.30
5 63.251.87.214
5 212.250.100.122
5 209.214.129.196
5 208.1.105.145
5 204.89.226.65
5 196.41.28.43
5 130.74.211.63
And so on. Easily a dozen repeat readers, but there are probably more. One
way would be to generate the number of visits per block of IP addresses (most
users would fall into a range of addresses, usually along a classical C block
and by doing that, I get:
197 65.116.145
92 208.55.254
83 211.101.236
45 63.173.190
30 64.129.118
30 24.52.32
21 64.12.96
21 24.4.252
18 208.60.8
11 65.58.147
11 208.1.105
11 196.41.28
11 164.77.128
10 64.131.172
10 152.163.189
9 66.157.2
9 216.10.44
8 65.207.131
8 212.250.100
8 207.49.213
8 205.188.209
8 205.188.208
7 65.2.207
7 195.163.203
6 64.39.15
6 12.39.254
5 64.30.224
5 63.251.87
5 24.51.202
5 209.214.129
5 204.89.226
5 130.74.211
5 129.74.252
Hmmm … not much difference really. Rerunning for last month (March) I get:
86 208.55.254
56 65.214.36
22 64.131.172
21 12.164.38
20 218.45.21
20 211.101.236
19 66.176.111
17 207.200.84
17 129.74.186
16 66.27.11
16 151.203.23
14 64.12.96
12 64.129.118
12 216.76.209
11 196.41.28
10 66.27.63
10 64.30.224
10 64.231.69
10 194.222.60
9 64.152.245
9 24.51.200
9 205.188.209
8 64.90.36
8 152.163.188
7 64.158.38
7 208.60.8
6 205.188.208
6 199.44.53
6 199.174.3
6 199.174.0
6 195.163.203
6 194.82.103
5 64.34.18
5 64.210.248
5 24.71.223
5 24.52.32
5 207.158.192
5 207.114.208
5 204.89.226
5 151.100.29
5 128.242.197
5 12.225.219
Oh, lets call it two dozen repeat readers and be done with it.
This is an interesting topic, and I would still like to know how Sean Tevis
[12] calculates his stats.
[1]
http://scalzi.com/w020408.htm
[2]
http://www.buzzmachine.com/2002_04_01_crisis_archive.html#75207564
[3]
http://mattwelch.com/old/2002_04_07_archive.html#75208430
[4]
http://instapundit.blogspot.com/?/2002_04_07_instapundit_archive.html#75200115
[5]
http://www.thismodernworld.com/weblog/archive/2002_04_07_bloggera.html#75206563
[6]
http://www.tevis.net/
[7]
https://boston.conman.org/
[8]
https://boston.conman.org/bostondiaries.rss
[9]
https://boston.conman.org/bdstyle.css
[10]
http://www.tragic-smurfs.com/
[11]
http://www.springdew.com/
[12]
http://www.tevis.net/
Email author at
[email protected]