/~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~\

Title: Scraper bots can suck a duck...
Date: April 8, 2025
Mood: Annoyed

|~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~|

 I've been fighting with AI scrapers hitting my web server relentlessly for
over a week now. It started just before the end of March, when I found out I
was near my 1 /Terabyte/ bandwidth limit. Checking the logs, my hidden story
archive blog was getting hit by hundreds upon hundreds of requests every hour.
Rate limiting helped for about a minute, but then it started getting worse.

 I then found a way to send back /444/ errors (no response) by user agent, and
implemented that. It's been a /massive/ help on its own, especially when
Fail2Ban can't keep up, but it's still not perfect. I'm still getting malware
bots trying to get in based on recent vulnerabilities in WordPress and Lavarel,
a number of git credential scanners, and "cybersecurity research" scanners that
I didn't exactly give permission to hit my VPS.

 In the end, all that's left is my homepage, and no other subdomains are up.
I had to take down the archive blog, as well as the new self-hosted blog I was
planning on starting. They're not hitting my gopher server at least, but it's
gotten to the point that I'm becoming paranoid about /that/ as well.

 I know LLMs have legitimate uses, and the companies involved in their abuse
need to be taken out back and given a good dose of "Operation: OLD YELLER".
But scraper bots that don't follow robots.txt or ai.txt rules, or announce what
they are in their useragents, should be litigated into financial ruin and their
management given prison sentences under the US CFAA (or other country-unique
laws of that nature). They should also be forced to pay for any bandwidth
overages caused by their bots, because hitting pages literally 20-30 times an
hour--if not even more--is beyond unreasonable. Outside of news orgs and social
media, most sites are lucky to update /once a day/.

 This is making me want to look into NNCP stuff[0] even more. I already have
Offpunk installed on my Linux laptop[^0], and /that/ made me want to consider
setting up gemini on the VPS as well for the first time. I'm not really into
the idea of gemini-for-the-sake-of-gemini, but I do support more options for
the smolnet.

 We'll see how things go, though. I'm still spending most of my time offline
and trying to de-stress, and not really checking RSS feeds or {ph,gem}logs all
that much. I'll watch videos with my partner, or to learn some skills that I
can better understand visually instead of reading about it, but even then, I
use `yt-dlp` to save the video for later viewing.

 It's nice being disconnected from the world. I honestly is.

\~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~/

[0]: https://nncp.mirrors.quux.org/

|~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~|

[^0]: Debian is surprisingly holding stable so far, even after a couple of
     updates that I dreaded taking down my system. Just wish I could turn off
     system-managed Python so I can use it the way /I/ want to.