* * * * *

                          Yet more thoughts on spam

Last month I switched email clients from Thunderbird [1] to mutt [2] (I found
Thunderbird to be too sluggish but that's a story for another entry) and
configured our primary email server to forward my mail directly to my
workstation, where procmail [3] can then filter it.

So now I can burn through mail in about half the time it used to take me.

I get a ton of email, most of it from the various servers (from root mostly)
and most of that is generated by the mail system itself, informing me that
it's found, yet again, another email infected with a virus (oh, easily 500 a
day) or it couldn't deliver a message (another 500 a day easy) or the multi-
thousand line output of logwatch [4] (each easily 15,000 lines of summary per
day).

So it was a simple matter to set up procmail to filter the messages (and say,
automatically delete the virus warnings—I tried turning that off on the
servers themselves, but … well … control panels and hidden configuration
files and I'm stuck getting them even though I don't care for them). Now,
since our mail goes through a dedicated spam filtering system and can mark
emails as spam, I thought it would be a good idea to simply delete those upon
receipt as well.

Only I kept receiving emails marked as spam.

>   31 N   Dec 01 trespassers@gre ( 306) [SPAM]  Breaking News
>

Puzzled, I moved the procmail configuration to delete such marked spam:

> :0:
>       * ^Subject: .*SPAM.*
>       in-TRASH
>

to the start of my .procmailrc, and yet, I still get the emails. I bumped up
the verbosity of logging, and yes, some of it was actually being caught and
trashed, but not all of it.

What the heck?

In mutt I see:

> **From:** <[email protected] [5]>
>  **To:** <apache@XXXXXXXXXXX>
>  **Subject:** [SPAM] Breaking News
>  **Date:** Thu, 1 Dec 2005 22:49:10 +0200
>

But when I checked the actual raw email message …

> **From:** <[email protected] [6]>
>  **To:** <apache@XXXXXXXXXXX>
>  **Subject:** =?ascii?B?W1NQQU1dICBCcmVha2luZyBOZXdz?=
>  **Date:** Thu, 1 Dec 2005 22:49:10 +0200
>

That funky subject line? A form of MIME (Multipurpose Internet Mail
Extensions) encoding for email headers. In this case, the subject line uses
the US-ASCII character set and is encoded as base-64 [7]. procmail knows
nothing about MIME encodings. It's looking for “SPAM” in the subject line and
not finding it.

Well now …

Obviously, I can add

> :0:
>       * ^Subject: =\?.*\?W1NQQU1dIC.*
>       in-TRASH
>

(“[SPAM]” encoded as base-64) to my .procmailrc file, but is there a better
way?

Sure, Bayesian filtering [8] is pretty cool, but I still think that a few
simple heuristics in place would help just as much.

One idea: check the character encoding of the incoming email. In my case, if
it isn't US-ASCII, ISO-8859-1 or UTF-8 (oh, might as well include WINDOWS-
1251 for those unfortunate friends that are abused by Microsoft), then
discard it. It doesn't matter if it's legitimate email if I don't understand
the language it's written in.

Now, with ISO-8859-1, UTF-8 or WINDOWS-1251, I still might not be able to
read the message (since ISO-8859-1 and WINDOWS-1251 covers western European
langauges like French and German, and UFT-8 covers just about all written
languages), but my second idea should take care of that.

Second idea: spell check the incoming email.

No, seriously.

Take this bit of spam I received today:

> **lt** is really hard to recollect a company: the market is full of
> **sugqestions** and the information is overwhelming; but A GOOD CATCHY
> LOGO, STYLISH STATlONERY and OUTSTANDING **WEBSIT E wilI** make the task
> much easier.
>
> We do not promise that having ordered a **loqo** your company **wiIl
> automaticaIly** become a **worId Ieader**: it is quite clear that without
> good products **,effective** business **orqanization** and **practicable**
> aim it will be hot at nowadays market; but we do promise that your
> marketing efforts will become much more effective.
>

Twelve spelling errors (and one punctuation error, which I marked, but not
counting in the following statistic) for a 14% spelling error rate. And if
the email is in a different language, the spelling error rate will easily go
past 95%. So, if the number of misspelled words exceeds say, 70%, delete it,
and if it's above say, 5% (hey, we all make mistakes sometimes) mark it as
possible spam.

This would definitely piss off the V1@gr@ pushers.

Third idea: Unless whitelisted, any email that consists of any type of
attachment, delete it (well, for me at least).

And this is before explicit filtering, Bayesian or otherwise.

I wonder just how hard something like that would be to write …

[1] http://www.mozilla.org/products/thunderbird/
[2] http://www.mutt.org/
[3] http://www.procmail.org/
[4] http://www.logwatch.org/
[5] mailto:[email protected]
[6] mailto:[email protected]
[7] http://en.wikipedia.org/wiki/Base64
[8] http://www.paulgraham.com/spam.html

Email author at [email protected]