* * * * *
Notes on blocking the MJ12Bot
The MJ12Bot [1] is the first robot listed in the Wikipedia's [2] robots.txt
[3] file, which I find amusing for obvious reasons [4]. In the Hacker News
comments [5] there's a thread [6] specifically about the MJ12Bot, and I
replied to a comment about blocking it [7]. It's not that easy, because it's
a distributed bot that has used 136 unique IP (Internet Protocol) addresses
just last month. Because of that comment, I decided I should expand on some
of those numbers here.
The first table is the number of addresses from January through June, 2019 to
show they're not all from a single netblock, The address format “A.B.C.D”
will represent a unique IP address, like 172.16.15.2; “A.B.C” will represent
the IP addresses 172.16.15.0 to 172.16.15.255; “A.B” will represent the range
172.16.0.0 to 172.16.255.255 and finally “A” will represent the range
172.0.0.0 to 172.255.255.255.
Table: Number of distinct IP addresses used by MJ12Bot in 2019 when hitting my site
Address format number
------------------------------
A.B.C.D 312
A.B.C 256
A.B 86
A 53
Next are the unique addresses from all of 2018 used by MJ12Bot:
Table: Number of distinct IP addresses used by MJ12Bot in 2018 when hitting my site
Address format number
------------------------------
A.B.C.D 474
A,B.C 370
A.B 125
A 66
This wide distribution can easily explain why Wikipedia found it to ignore
any rate limits set. Each individual node of MJ12Bot probably followed the
rate limit, but it's a hard problem to coordinate across … what? 500 machines
across the world?
It seems the best bet is to ban MJ12Bot via robots.txt:
-----[ data ]-----
User-agent: MJ12bot
Disallow: /
-----[ END OF LINE ]-----
While I haven't added MJ12Bot to my own robots.txt [8] file, it hasn't hit my
site since they removed me from their crawl list [9], so it appears it can be
tamed.
[1]
https://mj12bot.com/
[2]
https://www.wikipedia.org/
[3]
https://en.wikipedia.org/robots.txt
[4]
gopher://gopher.conman.org/0Phlog:2019/07/09-12
[5]
https://news.ycombinator.com/item?id=20453189
[6]
https://news.ycombinator.com/item?id=20453542
[7]
https://news.ycombinator.com/item?id=20455003
[8]
http://boston.conman.org/robots.txt
[9]
gopher://gopher.conman.org/0Phlog:2019/07/12.1
Email author at
[email protected]