* * * * *
Taking away the spam filter's Little Orphan Annie Secret Decoder Ring
A few months ago I wrote about some character encoding problems [1] I was
having, namely that it was a real mess under the web. But apparently, it's
not a mess with email.
We have a dedicated computer that does nothing but filter spam (and the
statistics from that are depressing); you can add additional fitering via
regular expressions. Smirk has been receiving quite a bit of foreign spam,
stuff in Russian, Korean, Chinese, which he can't even read since it's in
Cyrillic, Wansung and Hangul. But (for instance) some (if not most) of the
email had subject lines like:
> **Subject:** =?Windows-1251?B?amFlQGxlZWhvbS5uZXQg?=
>
where the character set is encoded within the subject line. So Smirk thought
a regular expression like ^Subject: .*Windows-1251.* would work and filter
out the spam in Cyrillic (with appropriate regular expressions for Wansung
and Hangul).
Only it didn't work.
It caught subject lines that had “Windows-1251” as part of the legitimate
subject line (I sent him a test message with the subject of “Did you get
Windows-1251 yet?”) but not if it was part of an encoding. Which meant only
one thing: the spam filtering system was **applying the regular expressions
to the decoded characters!**
Well … that's certainly a surprise.
But it doesn't help the current problem. We're now waiting to hear back from
the company if that “feature” can be turned off.
[1]
gopher://gopher.conman.org/1Phlog:2004/12/05
Email author at
[email protected]