* * * * *

          A different approach to blocking bad webbots by IP address

Obligatory Sidebar Links

* Complaints
* * Please stop externalizing your costs directly into my face [1] (Hacker
   News [2])
* * LLM (Large Language Model) crawlers continue to DDoS SourceHut [3]
   (Lobsters [4])
* * FOSS infrastructure is under attack by AI companies [5] (Hacker News [6])

* Solutions
* * Anubis: self hostable scraper defense software [7] (Lobsters [8] and
   Hacker News [9])
* * Trapping misbehaving bots in an AI Labyrinth [10] (Lobsters [11])
* * GitHub - sequentialread/pow-bot-deterrent: A proof-of-work based bot
   deterrent. Lightweight, self-hosted and copyleft licensed [12] (Lobsters
   [13])



Web crawlers for LLM-based companies, as well as some specific solutions to
blocking them, have been making the rounds in the past few days. I was
curious to see just how many were hitting my web site, so I ran a few queries
over the log files. To ensure consistent results, I decided to query the log
file for last month:

Table: Quick summary of results for February 2025
total requests  468439
unique IP (Internet Protocol)s  24654

Table: Top 10 requests per IP
IP      Requests
------------------------------
4.231.104.62    43242
198.100.155.33  26650
66.55.200.246   9057
74.80.208.170   8631
74.80.208.59    8407
216.244.66.239  5998
4.227.36.126    5832
20.171.207.130  5817
8.29.198.26     4946
8.29.198.25     4807

(Note: I'm not concerned about protecting any privacy here—given the number
of results, there is no way these are any individual. These are all companies
hitting my site, and if companies are mining their data for my information,
I'm going to do the same to them. So there.)

But it became apparent that it's hard to determine which requests are coming
from a single entity—it's clear that a company can employ a large pool of IP
addresses to crawl the web, and it's hard to figure out what IPs are under
control of which company.

Or is it?

An idea suddenly hit me—a stray thought from the days when I was wearing a
network admin hat [14] I recalled that BGP (Border Gateway Protocol) routing
basically knows the network boundaries for networks as it's based on policy
routing via ASN (Autonomous System Number)s. I wonder if I could map IP
addresses to ASNs? A quick search and I found my answer—yes! [15] Within a
few minutes, I had converted a list of 24,654 unique IP addresses to 1,490
unique networks, I was then able to rework my initial query to include the
ASN (or rather, the human readable version instead of just the number):

Table: Requests per IP/ASN
IP      Requests        AS (Autonomous System)
------------------------------
4.231.104.62    43242   MICROSOFT-CORP-MSN-AS-BLOCK, US
198.100.155.33  26650   OVH, FR
66.55.200.246   9057    BIDDEFORD1, US
74.80.208.170   8631    CSTL, US
74.80.208.59    8407    CSTL, US
216.244.66.239  5998    WOW, US
4.227.36.126    5832    MICROSOFT-CORP-MSN-AS-BLOCK, US
20.171.207.130  5817    MICROSOFT-CORP-MSN-AS-BLOCK, US
8.29.198.26     4946    FEEDLY-DEVHD, US
8.29.198.25     4807    FEEDLY-DEVHD, US

Now, I was curious as to how they identified themselves, so I reran the query
to include the user agent string. The top eight identified themselves
consistently:

Table: Requests per Agent
Agent   Requests
------------------------------
Go-http-client/2.0      43236
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/132.0.0.0 Safari/537.36   26650
WF search/Nutch-1.12    9057
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com)  8631
Mozilla/5.0 (compatible; ImagesiftBot; +imagesift.com)  8407
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; [email protected])        5998
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)  5832
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)  5817

The last two, however had a changing user agent string:

Table: Identifiers for 8.29.198.26
Agent   Requests
------------------------------
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; )  1667
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; )   1419
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; )       938
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; )      811
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; )       94
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; )      17

Table: Identifiers for 8.29.198.25
Agent   Requests
------------------------------
Feedly/1.0 (+https://feedly.com/poller.html; 16 subscribers; )  1579
Feedly/1.0 (+https://feedly.com/poller.html; 6 subscribers; )   1481
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 6 subscribers; )       905
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 16 subscribers; )      741
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 8 subscribers; )       90
Feedly/1.0 (+http://www.feedly.com/fetcher.html; 37 subscribers; )      11

I'm not sure what the difference is between polling and fetching (checking
the URLs shows two identical pages, only differing in “Poller” and “Fetcher.”
But looking deeper into that is for another post [16].

The next request I did was to see how many IPs (that hit my site in February)
map to a particular ASN, and the top 10 are:

Table: IPs per AS
AS      Count
------------------------------
ALIBABA-CN-NET Alibaba US Technology Co., Ltd., CN      4034
AMAZON-02, US   1733
HWCLOUDS-AS-AP HUAWEI CLOUDS, HK        1527
GOOGLE-CLOUD-PLATFORM, US       996
COMCAST-7922, US        895
AMAZON-AES, US  719
TENCENT-NET-AP-CN Tencent Building, Kejizhongyi Avenue, CN      635
MICROSOFT-CORP-MSN-AS-BLOCK, US 615
AS-VULTR, US    599
ATT-INTERNET4, US       472

So Alibaba US crawled my site from 4,034 different IP addresses—I haven't
done the query to figure out how many requests each ASN did, but it should be
a straightforward thing to just replace IP address with the ASN to get a
better count of which company is crawling my site the hardest.

And now I'm thinking, I wonder if instead of a form of ad-hoc banning of
single IP addresses, or blocking huge swaths of IP addresses (like
47.0.0.0/8, it might not be better to block per ASN? The IP to ASN mapping
service I found makes it quite easy to get the ASN of an IP address (and to
map the ASN to an human-readable name), Instead of, for example, blocking
101.32.0.0/16, 119.28.0.0/16, 43.128.0.0/14, 43.153.0.0/16 and 49.51.0.0/16
(which isn't an exaustive list by any means) just block IPs belonging to ASN
132203, otherwise known as “TENCENT-NET-AP-CN Tencent Building, Kejizhongyi
Avenue, CN.”

I don't know how effective that idea is, but the IP-to-ASN site I found does
offer the information via DNS (Domain Name System), so it shouldn't be that
hard to do.

[1] https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
[2] https://news.ycombinator.com/item?id=43397361
[3] https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/
[4] https://lobste.rs/s/dmuad3/mitigating_sourcehut_s_partial_outage
[5] https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
[6] https://news.ycombinator.com/item?id=43422413
[7] https://anubis.techaro.lol/
[8] https://lobste.rs/s/sknzdg/anubis_self_hostable_scraper_defense
[9] https://news.ycombinator.com/item?id=43427679
[10] https://blog.cloudflare.com/ai-labyrinth/
[11] https://lobste.rs/s/ybyno6/trapping_misbehaving_bots_ai_labyrinth
[12] https://github.com/sequentialread/pow-bot-deterrent
[13] https://lobste.rs/s/fvvcmv/pow_bot_deterrent_proof_work_based_bot
[14] gopher://gopher.conman.org/0Phlog:2006/05/06.1
[15] https://www.team-cymru.com/ip-asn-mapping
[16] gopher://gopher.conman.org/0Phlog:2025/03/21.4
---

Discussions about this page

Lazy Reading for 2025/04/13 – DragonFly BSD Digest
 https://www.dragonflydigest.com/2025/04/13/lazy-reading-for-2025-04-13/

Email author at [email protected]