* * * * *

                             Waist deep in emails

I'm having a lot of fun writing the email indexing program [1], despite
having to code around a few broken mbox [2] files. I've also been surprised
at what I've found so far (not in the “oh, I forgot about that email!” way
but more in the “What the—?” way).

At first, I assumed that no email header would be longer than 64K (kilobytes)
[3], but no, turns out that isn't big enough. Turns out I have an email with
a header that is 81,162 bytes in size, and it has enough email addresses (in
the Cc: header) to populate a small mass-mailing list (and yes, it's spam).

I'm also tracking unique sets of headers and unique message bodies (via the
SHA1 [4] hashing function). There are 118 messages with the same body but
with different headers and the amusing bit is that the emails in question
wheren't spam! It's from a mailing list I used to run years ago where one of
the members apparently changed his email address, and for a period of time
each message that went out caused his automated system to send an update to
the list.

And of course, he didn't unsubscribe his old email address.

Heh.

The tracking was done to keep from indexing duplicate emails (since my
testing corpus is 1,600 mbox files, some of which may be backups—I don't know
which ones though, which is part of the reason I'm writing this program) so
in the end I should end up with a set of unique headers.

I got down to 16 emails with duplicate headers, but unique bodies.

That scared me.

A small digression: at this point, the program pulls each email out of the
mbox file, and writes the headers into one file (the original, plus a few I
add during processing, like the SHA1 hash results) and the body of the email
into another file (my dad likes to send me photos and videos in email, so the
bodies of those messages tend to be rather large, and I'm concentrating on
the headers at the moment). I currently end up with about 50M (megabyte) of
headers and almost a gigabyte-worth of email bodies. Now, continuing on …

I pick one of the duplicate hashes, scan for it, and then check the messages:

> >find header_raw/ | xargs grep FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
> ./000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
> ./000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
> >grep X-SHA1 header_raw/000008069 header_raw/000026823
> header_raw/000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
> header_raw/000008069:X-SHA1-Body: 5C823DD92D3DCDC5AD43953D72B1D60017A134D6
> header_raw/000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
> header_raw/000026823:X-SHA1-Body: 85584F0167666BAA506E41A3D9ED927227F0FEF0
> >
>

(Note: I can't just grep PATTERN * because there are simply too many files
(over 45,000) which exceeds the command line limit—that's why I use find and
xargs).

Okay, same headers, different body. Just what is going on here? I check the
bodies:

> >more body/000008069
> Status: RO
>
>
> Accept All Major Credit Cards!!!
>
> Don't be fooled by the copycats. We are one of the original company's
> offering merchant credit card services for all kinds of business's. [sic]
>
>

This isn't looking good—it looks like my header parsing code is missing a
header. What about the other email?

> >more body/000026823
> Status: RO
> Content-Length: 2815
> Lines: 104
>
>
> Accept All Major Credit Cards!!!
>
> Don't be fooled by the copycats. We are one of the original company's
> offering merchant credit card services for all kinds of business's. [sic]
>

Okay, check the mbox files to see what's messing up the header parsing. What
I find actually reassures me:

> From [email protected]  Wed Dec 12 14:13:00 2001
> Return-Path: <[email protected]>
> Received: from gig.armigeron.com ([204.29.162.10])
>         by conman.org (8.8.7/8.8.7) with ESMTP id OAA06543
>         for <[email protected]>; Wed, 12 Dec 2001 14:12:59 -0500
> Received: from mercury.aibusiness.net (emi.net [208.10.128.2]
>       (may be forged))
>         by gig.armigeron.com (8.11.0/8.11.0) with ESMTP id fBCJ8Aa31356
>         for <[email protected]>; Wed, 12 Dec 2001 14:08:10 -0500
> Received: from domainmail.ionet.net (domainmail.ionet.net [206.41.128.18])
>         by mercury.aibusiness.net (8.9.3/8.9.3) with ESMTP id NAA19835
>         for <[email protected]>; Wed, 12 Dec 2001 13:52:26 -0500
> Received: from kqyfqkpby.motor.com (r145h250.afo.net [209.149.145.250]
>       (may be forged))
>       by domainmail.ionet.net (8.9.1a/8.7.3) with SMTP id MAA02841;
>       Wed, 12 Dec 2001 12:38:11 -0600 (CST)
> Date: Wed, 12 Dec 2001 12:38:11 -0600 (CST)
> Message-Id: <[email protected]>
> From: "griffin" <[email protected]>
> Subject: No fee! Accept Credit Cards for the Holidays!      (bbjlm)
> Reply-To: [email protected]
> MIME-Version: 1.0
> X-Mailer: Mozilla 4.7 [en]C-CCK-MCD NSCPCD47  (Win98; I)
> Content-Type: text/plain
>
> Status: RO
>
>
> Accept All Major Credit Cards!!!
>

It wasn't my code (thank God! The parsing code is getting a bit convoluted at
this point), but some clueless spammer trying to add additional headers in
the body of the message (the other one was the same). So I'll assume the
other 14 “duplicates” are similar in nature—spammers trying to be clever.

And now, back to coding …

[1] gopher://gopher.conman.org/0Phlog:2009/06/01.1
[2] http://en.wikipedia.org/wiki/Mbox
[3] http://imranontech.com/2007/02/20/did-bill-gates-say-the-640k-line/
[4] http://en.wikipedia.org/wiki/SHA_hash_functions

Email author at [email protected]