A Case Study in Spam Management

« | Sat March 8, 2008 | comments and reactions | permanent link | »

Years ago, when I obtained my first domain name (komlenic.com), half of the reason was so that I could have the most logical email address possible: chris {at} komlenic {dot} com. At first, like all virgin addresses on the net, I received no spam whatsoever, but obviously, over time and through various known and unknown paths, this address has found its way into the spammers lists. The sheer bulk of spam arriving daily eventually reached levels that I couldn't ignore any longer.

I had stuck in place a few ineffectual techniques over the years as the spam quantity grew, but I found these to be more of a figurative "finger in the dike", than a real solution that stemmed the tide. Though they made my email "manageable", they eventually barely kept up with the growing problem. All spam was still being delivered to me (often taking several minutes to download on a good connection), and then filtered on my local machine. Surely there were more efficient (and free) means to handle this spam. This article will detail the steps and methods I've identified and used in a tiered scheme to help manage spam.

Targeting the Biggest Problem: Catch-all

SpamBoy My assessment began by determining the single most defining characteristic of the spam I was receiving. The thought was, that if I could identify and treat the largest problem, I would be able to at least cut the spam that arrived in my inbox in half. My estimate of half turned out to be rather low actually.

For me, the big factor turned out to be to whom the incoming spam was addressed to. A quick examination of 7 days worth of email, revealed 3651 spam messages, of which only 959 were actually addressed to real mailboxes that I use. (A "mailbox" being the portion of an email address located before the @ symbol).

This means that nearly 75% of my incoming spam isn't actually addressed to a real mailbox. How can this be? Catch-all email aliasing was at fault. Years ago, I had set up a catch-all address, such that any mail sent to the domain komlenic.com, would end up in my inbox... so when someone sent a message to christopher or kris or webmaster or beelzebub ...@komlenic.com, I'd ultimately still get the message.

This can actually be used as a spam-fighting tactic, where when an email address is required at an online store or other website, you can enter somestore@yourdomain.com. Later, if you start receiving spam to that address, you'll know the store or site has probably sold your information to a spammer, and you can filter or block email to that address. I've used these "made-up" email addresses many times, but my problem was that I had never followed through on blocking those addresses which were now receiving spam on a daily basis.

Beyond that, I was receiving spam to many non-existent mailboxes that spammers often target in their attempt to "shotgun" their message, such as sales@, info@, or webmaster@. And then there were spam messages addressed to ridiculous mailboxes such as a1aaaazzz435@ or 6767676767@, which could only be intended to be swept up in a catch all scheme.

The easy solution: disable the catch-all email: Instant 75% reduction in spam volume. This however did require me to manually set up forwarders for the few "made up" mailboxes that I've used over the years that are still viable and useful. For example, all mail coming from my bank, might be directed to bankname@komlenic.com... or mail from an online retailer that I frequently use, such as Amazon.com, might be going to amazon.com@komlenic.com. Disabling the catch all, without putting forwarders in place, would "bounce" these messages back to the sender as undeliverable, and more importantly, I would never see them.

Disabling the catch-all, also means that in the future, I'll either have to manually forward new made-up mailboxes, or simply use a throw-away Gmail or Hotmail address for these purposes.

Server Side Filtering: SpamAssassin

Now that my inbox was receiving mail sent only to approved mailboxes, I still needed to reduce the spam that was coming through to these legitimate addresses. I knew my host had the acclaimed SpamAssassin available on the server, and found it a trivial matter to enable it for my domain. In my case, I was able to set SpamAssassin to delete all messages that were found to be "high ranking" spam (meaning that SpamAssassin assigned a very high probablity that the message was spam). I would never see these messages.

I also chose to set SpamAssassin to deliver all messages that were considered "low ranking" spam. SpamAssassin allows delivery as normal, but adds a header to these messages which identifies them as probable spam. I can then choose to filter these locally in my email client, allowing me to still periodically check these messages to make sure no legitimate messages have been flagged as spam. (More on this below.)

After employing SpamAssassin as described above, I saw an additional ~10% reduction in delivered spam, but this would have obviously been higher had I chose to let SpamAssassin delete low-ranking spam instead of delivering them.

Additionally, the other people who receive mail through my domain (in this case, family members), should also see a reduction in messages that are obviously spam.

Client Side Filtering: Configuring Mozilla Thunderbird

Mozilla Thunderbird Using Thunderbird, the free open-source email client from Mozilla, there are several settings and features which helped address the remaining 15% of spam that was still hitting my inbox.

For starters, under the Junk Settings for this account, I made sure that "Trust junk mail headers sent by:" was checked and set to SpamAssassin. Along with this, I selected to "move new junk messages to" a junk folder, and to "automatically delete junk mail older than 7 days". What all this does, is grab those messages that SpamAssassin flagged as low ranking spam, and put them in a special junk folder, deleting any messages older than 7 days. It keeps them out of my inbox, while allowing me to check periodically for possible false-positives (legitimate messages that might have gotten flagged as possible spam).

To help eliminate the possibility of false-positives, I further set the junk settings to "Do not mark mail as junk if the sender is in: Collected Addresses". The Collected Addresses, are just that, collected, from emails that you send. So any mail, coming from anyone, whom you have sent mail to before, will never be marked as junk.

At this point in the steps I've taken to reduce spam in my inbox, the following estimated (yet confirmed by real-world data) reductions have taken place:

75% - caught by eliminating catch all email aliasing
10% - caught by letting SpamAssassin delete high-ranking spam on the server
10% - caught by letting Thunderbird trust SpamAssassin's low-ranking spam and moving it to a junk folder

That's a 95% reduction in spam hitting my inbox, with no undesirable side-effects and a near-zero chance of false-positives.

Tackling the Last 5%: Adaptive Junk Mail Filtering

Lastly, I use Thunderbird's Bayesian spam filtering to help catch and weed out the final 5% or less of spam that makes it through. Note: Thunderbird doesn't use the word "spam", substituting "junk mail" instead. The junk mail controls in Thunderbird are pretty self explanatory, so I won't go into great detail here. If you need more information, please see Mozillazine's Junk Mail Controls page. I simply enabled the adaptive junk mail filtering for the account, and began flagging messages as either "junk" or "not junk". Thunderbird does the rest. There are some tips however to get the adaptive junk mail filtering to behave better in Thunderbird 2 (the current version at the time of writing).

For starters, you should understand that in the beginning, this filter has little training data to work with, and a lot of messages may get through. I also diligently skim the messages it determines to be junk for false-positives, even though I've found relatively few occurrences. The other side of this is that it seems possible to over-train or incorrectly-train Thunderbird's adaptive junk mail filtering: after several months of training, for some reason Thunderbird often seems to start missing more and more junk messages for a lot of users (although still better than with no filter at all).

Why this occurs is partially unknown by me, but appears to have something to do with the theory behind bayesian filtering and the dangers of having too large of a data set with proportionately too much of one kind of data. In Thunderbird's case, if you receive and mark 400 messages as junk but only receive and mark 20 messages as not junk (or vice versa), the filter's accuracy rate will be hindered. It requires many messages of each type to achieve a near perfect accuracy rate.

Also, the nature and type of junk messages changes over time, and stale training data that is months or years old can further throw off the accuracy rate.

The solution, is to periodically reset the training data (perhaps every month, or every other month, or every 3-6 months depending on the volume of mail you receive, or simply reset it when you notice the filter consistently not performing as well). You can do this simply by clicking the Reset Training Data button (found on the Privacy pane of the Options dialog). Most users report near perfect accuracy within a few days of resetting their training data.

Some other options that many users employ and report success with is tweaking the mail.adaptivefilters.junk_threshold preference or adding custom filters to help capture persistently stubborn junk mail types. Remember that in a multi-tiered approach like I have employed, I'm really only working on filtering out the last 5% of junk messages here, so 100% accuracy isn't really required for this step.

Adding It All Up

After employing the techniques outlined above, I was able to all but eliminate spam from arriving in my inbox, with generally only one or two messages per week getting through. With the exception of Mozilla Thunderbird's adaptive junk mail filtering, the vast majority of spam in this case is being caught by means which once implemented, require no maintenance or further input.