There are two main types of spam, and they have different effects on Internet users.
Cancellable Usenet spam is a single message sent to 20 or more Usenet newsgroups. (Through long experience, Usenet users have found that any message posted to so many newsgroups is often not relevant to most or all of them.) Usenet spam is aimed at lurkers, people who read newsgroups but rarely or never post and give their address away. Usenet spam robs users of the utility of the newsgroups by overwhelming them with a barrage of advertising or other irrelevant posts. Furthermore, Usenet spam subverts the ability of system administrators and owners to manage the topics they accept on their systems.
I think it’s possible to stop spam, and that content-based filters are the way to do it.
The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that. Email spam targets individual users with direct mail messages. Email spam lists are often created by scanning Usenet postings, stealing Internet mailing lists, or searching the Web for addresses.
Email spams typically cost users money out-of-pocket to receive. Many people – anyone with measured phone service – read or receive their mail while the meter is running, so to speak. Spam costs them additional money. On top of that, it costs money for ISPs and online services to transmit spam, and these costs are transmitted directly to subscribers.
The statistical approach is not usually the first one people try when they write spam filters. Most hackers’ first instinct is to try to write software that recognizes individual properties of spam. You look at spams and you think, the gall of these guys to try sending me mail that begins Dear Friend or has a subject line that’s all uppercase and ends in eight exclamation points. I can filter out that stuff with about one line of code.
But the real advantage of the Bayesian approach, of course, is that you know what you’re measuring.
Feature-recognizing filters like SpamAssassin assign a spam score to email. The Bayesian approach assigns an actual probability. The problem with a score is that no one knows what it means. The user doesn’t know what it means, but worse still, neither does the developer of the filter. How many points should an email get for having the word sex in it? A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it.
Based on my corpus, sex indicates a .97 probability of the containing email being a spam, whereas sexy indicates .99 probability. And Bayes’ Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.