# 20200714

## Motivation

[Erowid Recruiter](https://twitter.com/erowidrecruiter) is a fun
Twitter account powered by Markov chains.  How hard could it be to
recreate this?

## Erowid

A fascinating web 1.0 site with trip reports for seemingly every known
substance.

Experiences list: https://www.erowid.org/experiences/exp.cgi?OldSort=PDA_RA&ShowViews=0&Cellar=0&Start=100&Max=1000
Experience (HTML): https://www.erowid.org/experiences/exp.php?ID=005513
Experience (LaTeX): https://www.erowid.org/experiences/exp_pdf.php?ID=093696&format=latex
Blocked: https://blackhole.erowid.org/blocked.shtml

Insights so far over the last two weeks:

- There's no convenient option to download all reports and the website
 discourages downloading and analyzing them.  However the experience
 search offers a customizable limit, by increasing it to the maximum,
 you can obtain all valid IDs.  This list is not consecutive, most
 likely due to the review process.
- It's possible to get blocked, play it safe and do no more than 5000
 requests per day.
- HTML reports are unusually broken: Key characters (angular brackets,
 ampersand, quotes) are not consistently escaped using HTML entities,
 little semantic formatting, copious comments suggest basic editing
 work, closing tags are omitted often
- LaTeX reports are mildly broken: Quotes aren't consistently escaped,
 some HTML comments are halfway preserved in the export.

I've contacted the Erowid Recruiter author and they revealed they
handpicked their favorite trip reports.  I don't really want to spend
that much time.  First I've tried downloading random HTML reports and
stored in a database whether access was successful or triggered an
error.

I later learned that there's LaTeX export and a full list of
experiences, with a subset marked as outstanding using one, two or
three stars.  I've downloaded them all, fixed some HTML comment
fuck-ups and wrote Scheme code to extract the report from the LaTeX
template and convert the LaTeX syntax to plaintext.  My first attempt
at doing Emacs-style text processing was comically verbose and not
successful, I resorted to parsing a sequence of tokens, accounting for
LaTeX insanity, converting the tokens to TeX commands, then
interpreting those specially to turn it into minimally marked up text.

## Recruiter

I haven't found any good public data for such email.  I bet the real
thing draws from personal email.  My plan is to instead use generic
spam from [this data set](http://untroubled.org/spam/), hence the name
"madads".  We'll see how this goes and how much cleaning is required.

# 20200720

## Recruiter

Initially I thought I could just download all archives, extract them
and look at the data, but I've underestimated just how many files
these are.  Take for example the 2011 archive, a file clocking in at
almost 100M takes surprisingly long for decompression.  At 10% into
decompression and 300M of text files, I canceled it.  Instead I
grabbed the archives from 1998 to 2003 which extracted to a few
thousand files at a far more managable 472M.

Some massaging is definitely required before further processing.
`file` recognizes many files as emails except for some starting with a
"From <id>" line.  There is a later "From: <id>" line which is clearly
an email header, so I looked up some `sed` magic to delete the first
line if it has this pattern: `sed --in-place=bak '1{/From /d;}'
*/*.txt`.

The emails themselves have structure and can be parsed.  My plan is to
extract plain text whenever possible and falling back to making sense
of HTML if necessary.  [hato](https://github.com/ashinn/hato) seems to
be a good codebase to study for that.

# 20200803

## Recruiter

I've considered even using Gauche's built-in email parser, but then a
friend reminded me that [mblaze](https://git.vuxu.org/mblaze/) is a
thing, a suite of tools for wrangling maildir-style mailboxes.  As
usual I immediately ran into a bug with `mshow` and got it fixed for
the 1.0 release.  Using some shell oneliners I extracted 33k plaintext
messages from 99k files, a far better success rate than expected.
That leaves text generation.

## Markov

This turned out easier than expected.  You take n-grams of a text,
split each n-gram into prefix (all but the last word) and suffix (the
last word) and track the seen combinations of prefix and suffix in a
hash table mapping prefix to seen suffixes.  To generate text, pick a
random prefix from the hash table, look its suffixes up, pick a random
one and combine the chosen prefix and suffix into a new n-gram to
repeat the process.  This can end prematurely if you end up with a
prefix with no suffixes (for example at the end of the text), but can
otherwise repeated as often as needed to generate the necessary amount
of text.

Many tweaks are possible to this basic idea:

- Rather than just splitting up text on whitespace into words, do
 extra clean-up to deal with quotes and other funny characters.
- Save the generated hash table to disk and load it up again to avoid
 expensive recomputation.
- Find better starting points, for example by looking for sentence
 starters (capitalized words).
- Find better stopping points, for example by looking for sentence
 enders (punctuation).
- Combine several text files and find some way to judge which ones go
 particularly well together (perhaps looking for overlap between
 prefixes is a suitable metric?).