Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Various techniques exits to combat the problem; keyword-based filters, source blacklists, signature blacklists, source verification and . . .
Spam e-mail has become an ever increasing problem, and these days it is next to impossible to use e-mail without receiving it in large amounts. Various techniques exits to combat the problem; keyword-based filters, source blacklists, signature blacklists, source verification and combinations of these to name a few. All of them have problems; keyword filters needs to be constantly updated manually and are not very accurate; blacklists also need to be constantly updated, and will always lag behind spammers.

Fortunately, just as we seemed to be losing the war on spam, a new technique appeared on the scene after a paper by Paul Graham: Bayesian filters, our last, best hope for spam-free inboxes. Without going into details on how they work (more information can be found here and here), they are based on statistical methods which gives a probability for an e-mail belonging to a given class (usually just two classes are used; spam and not-spam, but this is not a limitation of the technique, and indeed, POPFile supports an arbitrary number of classes). The beauty of bayesian filtering is that the filter can be trained by each individual user simply by categorizing each received e-mail as either spam or not-spam; after the user has categorized a few e-mails the filter will begin to make this categorization by itself, and usually with a very high level of accuracy. If the filter makes a mistake, the user re-categorizes the e-mail; the filter learns from its mistakes. No complicated maintenance is required after the filter is installed; it's so easy even grandma can use it.

The link for this article located at Kristian Eide is no longer available.