_____ __________  ______
 / ___// ____/ __ \/ ____/______  ______ ___  ____
 \__ \/ __/ / / / / / __/ ___/ / / / __ `__ \/ __ \
___/ / /___/ /_/ / /_/ / /  / /_/ / / / / / / /_/ /
/____/_____/\____/\____/_/   \__,_/_/ /_/ /_/ .___/
                                          /_/
╭⋟─────────────────────────────────────────────────────────────────────╮
|                                                                      |
|  TITLE: LLMs and Gopher                                              |
|                                                                      |
|  DATE: May 14, 2025                                                  |
|                                                                      |
|  AUTHOR: [email protected]                                          |
|                                                                      |
╰─────────────────────────────────────────────────────────────────────⋞╯

I was thinking about LLMs yesterday, because it's impossible not to if
you're in my line of work. I was working on content for a client that
likes really long, detailed articles, and they prefer to provide
outlines listing all of the things they'd like me to cover. I can tell
that the outlines are AI generated. That's not a bad thing in itself;
organization is one thing that LLMs do very well.

This particular outline, however, contained a phrase that I know
originated from me.

I see my own phrases in AI content fairly often, but this is the first
time I've ever seen one in an assignment from a client. It was a pretty
strange feeling.

One of the reasons why I've enjoyed my experiences with gopher so much
is because I feel like it's a "safe" place on the Internet not yet
polluted by AI. Like so many other creators, I'm pretty peeved about my
work being scraped for someone else's gain.

It occurred to me, though, that there's actually nothing stopping AI
companies from scraping gopher for content. Gopher is a simple
protocol. Ask ChatGPT how to create a gophermap, and it'll give you the
right answer.

There's vastly more content on the web, of course, but AI companies are
becoming increasingly desperate for more human-generated content with
which to feed their models. That's why Microsoft, Apple and all the
other Big Tech companies are making their AI training opt-out rather
than opt-in. It's becoming harder to find authentic human content that
hasn't been scraped already.

The web is more gummed up with AI-generated content than a lot of
people realize. Even former bastions of authentic human-written content
like Reddit are now full of AI text. LLMs are notoriously bad at
telling the difference between AI- and human-generated text -- and if
you train an AI model on AI-generated text, it eventually leads to
the collapse of the model (garbage in, garbage out).

Does the volume of text on gopher compare to what's on the web?
Obviously not. It's a vast trove of authentic human content, though,
and scraping it would be trivial.

I wonder if it's already happened.


╰─────────────────────────────────────────────────────────────────────⋞╯