How can a “commercial grade” web robot be so badly written?

* * * * *

How can a “commercial grade” web robot be so badly written?

Alex Schroeder was checking the status of web requests [1], and it made me
wonder about the stats on my own server. One quick script later and I had
some numbers:

Table: Status of requests for boston.conman.org so far this month
Status result requests percent
------------------------------
200 OKAY 53457 82.83
206 PARTIAL_CONTENT 12 0.02
301 MOVE_PERM 2421 3.75
304 NOT_MODIFIED 6185 9.58
400 BAD_REQUEST 101 0.16
401 UNAUTHORIZED 147 0.23
404 NOT_FOUND 2000 3.10
405 METHOD_NOT_ALLOWED 41 0.06
410 GONE 5 0.01
500 INTERNAL_ERROR 173 0.27

------------------------------
Total - 64542 100.01
I'll have to check the INTERNAL_ERRORs and into those 12 PARTIAL_CONTENT
responses, but the rest seem okay. I was curious to see what I didn't have
that was being requested, when I noticed that the MJ12Bot [2] was producing
the majority of NOT_FOUND responses.

Yes, sadly, most of the traffic around here is from bots [3]. Lots and lots
of bots.

Table: Top agents requesting pages
requests percentage user agent
------------------------------
16952 26 The Knowledge AI
9159 14 Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
5633 9 Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +https://velen.io)
4272 7 Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
4046 6 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
3170 5 Mozilla/5.0 (compatible; Go-http-client/1.1; [email protected])
2146 3 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
1197 2 Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, [email protected])
1146 2 istellabot/t.1.13

------------------------------
47721 74 Total (out of 64542)
But it's been that way for years now. C'est la vie.

So I started looking closer at MJ12Bot and the requests it was generating,
and … they were odd:

* //%22http://www.thomasedison.com//%22
* //%22https://github.com/spc476/NaNoGenMo-2018/blob/master/run.lua/%22
* //%22/2018/08/24.1/%22
* //%22https://kottke.org/19/04/life-sized-lego-electronics/%22

And so on. As they describe it:

> Why do you keep crawling 404 or 301 pages?
>
> We have a long memory and want to ensure that temporary errors, website
> down pages or other temporary changes to sites do not cause irreparable
> changes to your site profile when they shouldn't. Also if there are still
> links to these pages they will continue to be found and followed. Google
> have published a statement since they are also asked this question, their
> reason is of course the same as ours and their answer can be found here:
> Google 404 policy. [4]
>

But those requests? They have a real issue with their bot. Looking over the
requests, I see that they're pages I've linked to, but for whatever reason,
their bot is making requests for remote pages on my server. Worse yet,
they're quoted! The %22 parts—that's an encoded double quote. It's as if
their bot saw “<A HREF="http://www.thomasedison.com">” and treated it as not
only a link on my server, but escaped the quotes when making the request!

Pssst! MJ12Bot! Quotes are optional! Both “<A
HREF="http://www.thomasedison.com">” and “<A
HREF=http://www.thomasedison.com>” are equivalent!

Sigh.

Annoyed, I sent them the following email:

> From: Sean Conner <[email protected]>
> To: [email protected]
> Subject: Your robot is making bogus requests to my webserver
> Date: Tue, 9 Jul 2019 17:49:02 -0400
>
> I've read your page on the mj12 bot, and I don't necessarily mind the 404s
> your bot generates, but I think there's a problem with your bot making
> totally bogus requests, such as:
>
> //%22https://www.youtube.com/watch?v=LnxSTShwDdQ%5C%22
> //%22https://www.zaxbys.com//%22
> //%22/2003/11/%22
> //%22gopher://auzymoto.net/0/glog/post0011/%22
> //%22https://github.com/spc476/NaNoGenMo-2018/blob/master/valley.l/%22
>
> I'm not a proxy server, so requesting a URL will not work, and even if I
> was a proxy server, the request itself is malformed so badly that I have to
> conclude your programmers are incompetent and don't care.
>
> Could you at the very least fix your robot so it makes proper requests?
>

I then received a canned reply saying that they have, in fact, received my
email and are looking into it.

Nice.

But I did a bit more investigation, and the results aren't pretty:

Table: Requests and results for MJ12Bot
Status result number percentage
------------------------------
200 OKAY 505 23.34
301 MOVE_PERM 4 0.18
404 NOT_FOUND 1655 76.48

------------------------------
Total - 2164 100.00
So not only are they responsible for 83% of the bad requests I've seen, but
nearly 77% of the requests they make are bad!

Just amazing programmers they have!

[1] https://alexschroeder.ch/wiki/2019-07-09_Web_Requests
[2] http://mj12bot.com/
[3] https://en.wikipedia.org/wiki/Internet_bot
[4] https://www.seroundtable.com/google-404-memory-16616.html

Email author at [email protected]