Let's look at some bots that aren't the MJ12Bot

* * * * *

Let's look at some bots that aren't the MJ12Bot

I think it's time I stop blogging about work after my previous post. Work is
getting a tad too depressing to think about and my cynical side is saying
that it won't matter where I go, it'd be more or less the same with a higher
probability of forced Microsoft Windows use. So instead of that depressing
topic, let's take a look at something much lighter and less depressing—the
current state of Internet robots crawling my various sites!

Two weeks later and there are still bots attempting to follow endless
redirections [1]. I thought maybe I could attempt to figure out a contact,
but alas, they're coming from all over the place (and yes, I'm finally naming
IP (Internet Protocol) addresses):

Table: Top 20 Gemini based bots caught in redirection Hell
IP address # requests
------------------------------
18.134.208.136 933
18.132.248.127 850
3.8.92.131 817
18.169.194.52 745
3.8.210.87 728
13.40.97.54 715
18.170.56.106 713
3.8.134.65 682
35.176.22.93 681
18.130.231.183 681
13.40.67.85 667
13.40.137.233 666
18.132.46.166 659
3.8.24.209 641
18.170.107.207 637
13.40.155.157 634
35.178.170.215 577
18.130.216.34 573
13.40.145.207 572
35.179.76.79 564

They're all pretty much from Amazon Web Services [2] so who knows who is
running these bots. Just blocking them is too easy a solution—at this point,
I'd like to do something to get their attention (as if thousands of links
they are crawling are suddenly listed as “gone” isn't enough of a clue). I
don't necessarily mind bots crawling my sites, unless they're doing stupid
things. I shall have to think on this a bit more.

I also had high hopes that I could stop empty requests to my Gemini server
(which isn't allowed at all by the specification [3]) by returning a non-
standard response code with the text “Not a gopher server” but alas, that is
still happening. Does nobody bother checking results of their bots running? I
guess not.

And speaking of gopher, it's better there than Gemini. Yes, there are a few
agents that are attempting to use TLS (Transport Layer Security) [4], but
fortunately, they cache previous failures so it's not every request. There
are a few bots out there trying to exploit RDP (Remote Desktop Protocol) (not
much I can do about those) and a few that are confused into thinking my
gopher site is actually my Gemini site sans TLS (What?). But I can live with
155 failed gopher requests out of 10,423 over the past month.

And while I'm checking bots, I can't forget the web crawlers. And not much
has changed on that front since July 2019 [5] except that MJ12Bot [6] has
kept their promise never to crawl my site again [7]. The Knowledge AI (which
I cannot find any information on) is still the number one agent, with 68,000
requests in Debtember 2021, followed by 21,000 requests from Amazonbot [8].
And it seems that the bots in general are making fewer requests to non-
existant pages (I mean, back in June 2019, The Knowledge AI made 170 bad
requests; last month, 1).

So, with the exception of bots stuck in redirection Hell in Gemini, things on
the crawler front are looking pretty good.

[1] gopher://gopher.conman.org/0Phlog:2021/12/29.1
[2] https://aws.amazon.com/
[3] https://gemini.circumlunar.space/docs/specification.gmi
[4] gopher://gopher.conman.org/0Phlog:2021/09/28.1
[5] gopher://gopher.conman.org/0Phlog:2019/07/11.1
[6] https://mj12bot.com/
[7] gopher://gopher.conman.org/0Phlog:2019/07/12.1
[8] https://developer.amazon.com/support/amazonbot

Email author at [email protected]