* * * * *

                   Some more observations about the MJ12Bot

I received another reply from MJ12Bot [1] about their badly written bot [2]
and it just said the person responsible for handling enquiries was out of the
office for the day and I should expect a reponse tomorrow. We shall see. In
the mean time, I decided to check some of the other bots hitting my site and
see how well they fare, request wise. And I'm using the logs from last month
for this, so these results are for 30 days of traffic.

Table: Top 10 bots hitting The Boston Diaries
requests        percentage      user agent
------------------------------
46334   19      The Knowledge AI
38097   16      Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
17130   7       Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
15928   7       Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
12358   5       Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
8929    4       Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
8908    4       Gigabot
7872    3       Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
6942    3       Barkrowler/0.9 (+http://www.exensa.com/crawl)
4737    2       istellabot/t.1.13

------------------------------
167235  70      Total (out of 239641)
So let's see some results:

Table: Results of bot queries
Bot     200     %       301     %       304     %       400     %       403     %       404     %       410     %       500     %       Total   %
------------------------------
The Knowledge AI        42676   92.1    3352    7.2     0       0.0     127     0.3     4       0.0     170     0.4     5       0.0     0       0.0     46334   100.0
SemrushBot/3~bl 36088   94.7    1873    4.9     0       0.0     110     0.3     0       0.0     21      0.1     5       0.0     0       0.0     38097   100.0
BLEXBot/1.0     16633   97.1    208     1.2     124     0.7     114     0.7     0       0.0     46      0.3     5       0.0     0       0.0     17130   100.0
AhrefsBot/6.1   15840   99.4    78      0.5     0       0.0     4       0.0     0       0.0     5       0.0     0       0.0     1       0.0     15928   99.9
bingbot/2.0     12304   99.6    35      0.3     0       0.0     6       0.0     0       0.0     3       0.0     5       0.0     0       0.0     12353   99.9
MegaIndex.ru/2.0        8412    94.2    456     5.1     0       0.0     24      0.3     0       0.0     36      0.4     1       0.0     0       0.0     8929    100.0
Gigabot 8428    94.6    448     5.0     0       0.0     23      0.3     0       0.0     7       0.1     2       0.0     0       0.0     8908    100.0
MJ12bot/v1.4.8  2015    25.6    175     2.2     0       0.0     2       0.0     0       0.0     5680    72.2    0       0.0     0       0.0     7872    100.0
Barkrowler/0.9  6604    95.1    300     4.3     0       0.0     10      0.1     0       0.0     28      0.4     0       0.0     0       0.0     6942    99.9
istellabot/t.1.13       4705    99.3    28      0.6     0       0.0     0       0.0     0       0.0     0       0.0     0       0.0     4       0.1     4737    100.0

Percentage wise of the top 10 bots hitting my blog (and in fact, these are
the 10 ten clients hitting my blog) MJ12Bot is just bad at 72% bad requests.
It's hard to say what the second worst one is, but I'll have to give it to
“The Knowledge AI” bot (and my search-foo is failing me in finding anything
about this one). Percentage wise, it's about on-par with the others, but some
of its requests are also rather odd:

* /%22
* /%22https:/www.brevardnc.org/business-directory/5474/rockys-soda-shop/%22
* /%22http:/brevardnc.org/%22
* /%22https:/www.greenvillesc.gov/%22
* /%22https:/en.m.wikipedia.org/wiki/Caesars_Head_State_Park/%22
* /%22https:/www.transylvaniacounty.org/town-of-rosman/%22

It appears to be a similar problem as MJ12Bot, but one that doesn't happen
nearly as often.

Now, this isn't to say I don't have some legitimate “not found“ (404)
results. I did come across some actual valid 404 results on my own blog:

* /2004/08/18/[email protected]
* /2012/08/10/HREF
* /2013/01/02/menamena
* /2013/02/01/HREF
* /2014/05/04/HREF
* /2015/02/10/B000FBJCJE
* /2015/07/10/mailtp:[email protected]

Some are typos, some are placeholders for links I forgot to add. And those I
can fix. I just wish someone would fix MJ12Bot. Not because it's bogging down
my site with unwanted traffic, but because it's just bad at what it does.

[1] https://mj12bot.com/
[2] gopher://gopher.conman.org/0Phlog:2019/07/09.1

Email author at [email protected]