| Poisoning LLM web scrapers | |
| 2025-11-01 | |
| Last edit: 2025-11-01 | |
| --------------------- | |
| All servers available on the Internet are exposed to attacks, scraping, open po… | |
| I consulted the logs of my web server managed by NGINX using a visual generated… | |
| goaccess | |
| tool. The majority of HTTP requests come from unknown web browsers and crawler… | |
| Most of these crawlers have questionable ethics, and they also pollute the serv… | |
| ## An aggressive solution | |
| To counter these robots, I researched existing solutions. I then found a | |
| post](https://tldr.nettime.org/@asrg/113867412641585520) Mastodon post that lis… | |
| instance. The common goal of these tools is to sabotage AIs. | |
| Among them, I chose an aggressive solution called | |
| iocaine | |
| . It is a web server that generates a page containing garbage text, and within … | |
| ## Filtering non-human visitors | |
| The aim is to redirect robots to | |
| iocaine | |
| . To do this, we first need to be able to identify them. For this, I took inspi… | |
| I'd like to thank [Agate Blue](https://agate.blue) for writing an extremely det… | |
| iocaine | |
| and its implementation with NGINX. I was inspired by some of his configuration… | |
| I ended up with a reverse proxy capable of redirecting non-human visitors to a … | |
| iocaine | |
| . Note that in my case, NGINX runs on the host system, it is not conteuneurized… | |
| ## My iocaine configuration | |
| I kept the basic configuration to customize the web pages generated by the tool… | |
| Uncle Tom's cabin](https://archive.org/details/uncletomscabinta0000stow/mode/2u… | |
| . | |
| ## My iocaine deployment | |
| To deploy the solution I chose to use Docker Compose with a configuration base … | |
| ```yaml | |
| services: | |
| iocaine: | |
| image: git.madhouse-project.org/iocaine/iocaine:2 | |
| container_name: iocaine | |
| ports: | |
| - "127.0.0.1:42069:42069" | |
| - "127.0.0.1:42070:42070" | |
| volumes: | |
| - "./data:/data" | |
| environment: | |
| - IOCAINE__SERVER__BIND="0.0.0.0:42069" | |
| - IOCAINE__SOURCES__WORDS="/data/words.txt" | |
| - IOCAINE__SOURCES__MARKOV=["/data/text1.txt", "/data/text2.txt", "/data/… | |
| - IOCAINE__METRICS__ENABLE=true | |
| - IOCAINE__METRICS__BIND="0.0.0.0:42070" | |
| - IOCAINE__METRICS__LABELS=["Host","UserAgent"] | |
| restart: unless-stopped | |
| networks: | |
| - monitoring_prometheus_net | |
| networks: | |
| monitoring_prometheus_net: | |
| external: true | |
| ``` | |
| The `monitoring_prometheus_net` network is used by Prometheus and Prometheus ex… | |
| iocaine | |
| Prometheus exporter behind port `42070`. In this way, we can view the metrics … | |
| iocaine | |
| via a dashboard in Grafana. | |
| The value of `IOCAINE__SOURCES__WORDS` corresponds to a file which is the list … | |
| ## The robots are stuck | |
| The service is now deployed and the NGINX configuration has been updated for th… | |
| I've left | |
| iocaine | |
| running for around 20 hours already, and here's a Grafana dashboard showing so… | |
| We can see that there have already been 128,644 requests made by robots that ha… | |
| To get an idea of what this represents, imagine clicking 79 times on a word in … | |
| ## Testing with a web browser | |
| To test with a web browser, I used LibreWolf and then installed an extension to… | |
| https://theobori.cafe | |
| . Here's what it looks like to be tricked. | |
| ## Conclusion | |
| I've managed to prevent LLM scrappers from stealing my content without my permi… | |