Introduction
Introduction Statistics Contact Development Disclaimer Help
Poisoning LLM web scrapers
2025-11-01
Last edit: 2025-11-01
---------------------
All servers available on the Internet are exposed to attacks, scraping, open po…
I consulted the logs of my web server managed by NGINX using a visual generated…
goaccess
tool. The majority of HTTP requests come from unknown web browsers and crawler…
Most of these crawlers have questionable ethics, and they also pollute the serv…
## An aggressive solution
To counter these robots, I researched existing solutions. I then found a
post](https://tldr.nettime.org/@asrg/113867412641585520) Mastodon post that lis…
instance. The common goal of these tools is to sabotage AIs.
Among them, I chose an aggressive solution called
iocaine
. It is a web server that generates a page containing garbage text, and within …
## Filtering non-human visitors
The aim is to redirect robots to
iocaine
. To do this, we first need to be able to identify them. For this, I took inspi…
I'd like to thank [Agate Blue](https://agate.blue) for writing an extremely det…
iocaine
and its implementation with NGINX. I was inspired by some of his configuration…
I ended up with a reverse proxy capable of redirecting non-human visitors to a …
iocaine
. Note that in my case, NGINX runs on the host system, it is not conteuneurized…
## My iocaine configuration
I kept the basic configuration to customize the web pages generated by the tool…
Uncle Tom's cabin](https://archive.org/details/uncletomscabinta0000stow/mode/2u…
.
## My iocaine deployment
To deploy the solution I chose to use Docker Compose with a configuration base …
```yaml
services:
iocaine:
image: git.madhouse-project.org/iocaine/iocaine:2
container_name: iocaine
ports:
- "127.0.0.1:42069:42069"
- "127.0.0.1:42070:42070"
volumes:
- "./data:/data"
environment:
- IOCAINE__SERVER__BIND="0.0.0.0:42069"
- IOCAINE__SOURCES__WORDS="/data/words.txt"
- IOCAINE__SOURCES__MARKOV=["/data/text1.txt", "/data/text2.txt", "/data/…
- IOCAINE__METRICS__ENABLE=true
- IOCAINE__METRICS__BIND="0.0.0.0:42070"
- IOCAINE__METRICS__LABELS=["Host","UserAgent"]
restart: unless-stopped
networks:
- monitoring_prometheus_net
networks:
monitoring_prometheus_net:
external: true
```
The `monitoring_prometheus_net` network is used by Prometheus and Prometheus ex…
iocaine
Prometheus exporter behind port `42070`. In this way, we can view the metrics …
iocaine
via a dashboard in Grafana.
The value of `IOCAINE__SOURCES__WORDS` corresponds to a file which is the list …
## The robots are stuck
The service is now deployed and the NGINX configuration has been updated for th…
I've left
iocaine
running for around 20 hours already, and here's a Grafana dashboard showing so…
We can see that there have already been 128,644 requests made by robots that ha…
To get an idea of what this represents, imagine clicking 79 times on a word in …
## Testing with a web browser
To test with a web browser, I used LibreWolf and then installed an extension to…
https://theobori.cafe
. Here's what it looks like to be tricked.
## Conclusion
I've managed to prevent LLM scrappers from stealing my content without my permi…
You are viewing proxied material from tilde.pink. The copyright of proxied material belongs to its original authors. Any comments or complaints in relation to proxied material should be directed to the original authors of the content concerned. Please see the disclaimer for more details.