Unless you've been living under a rock (or been on holiday) for the past fortnight, you'll have seen some of the reaction to Paul Graham's Plan for Spam. Most of it has focused on the Bayesian probability method used, some of it has argued that a more thorough lexical analysis is required, and thankfully, much of it has focused on the implementation of that Plan.
It seems to me, though, that a lot of effort also needs to be directed towards the data that we use to train up our spamassassins, our cloudmarks and the like. I come from a background in speech technology, and for at least twenty years, our maxim has been 'There's no data like more data'. I'd alter that slightly to read 'There's no data like more, relevant, data'.
What does relevant data mean in this case? Well, consider this analogy. For years now, text-based language identification has been researched. This page from Georgetown manages (very well) to deduce your language from a short string of its text. What the current crop of anti-spam filters are doing, essentially, is language verification. In other words, they're trying to work out whether your incoming message is written in proper English or in 'spam English'.
All well and good, you might say - cloudmark is collecting plenty of data, and geographically well-distributed data at that, which describes what spam looks like. In an ideal world, then, spamassassin could use these non-local corpora to improve its hit rate for spam, and reduce its false positives (those emails that aren't actually spam, but which are marked as such).
The trouble is that we need data not only for spam, but for non-spam. To take a couple of typical false positives, how about HTML email written in red (FF0000)? It might well be an urgent message from a work colleague - you don't want that being thrown away because your spam filter thinks its contents might be obscene. There are a few word pairs that might trigger a bigram analysis to think something dodgy was going on - 'free' and 'pictures' for instance. In the wider context of the email, though, it might become apparent that you were tipping off a friend about a film developing offer. Too bad when their spam filter has already discarded that information.
A corpus of non-spam is a positive benefit to spam detectors. In the context of the language verification example, if we want to verify that we're writing in English and not - say - Welsh, not only do we need Welsh data, we also need lots of English data. We use some method of determining whether the suspect email is likely to be written in Welsh, and use the same method to determine whether it's likely to have been written in English. The most probable score wins.
So what, you might say. What's so special about collecting a bank of text anyway? Well, don't tell that to the LDC or to ELRA. Between them, they have collected catalogues of speech, text and terminological resources which would cost you millions of dollars to buy in their entirety. ELRA, the European equivalent to the Americas' LDC, even holds its own biennial conferences to discuss the state of its art.
A free corpus of non-spam, in addition to a free corpus of spam, would be wonderful. Not only for the poor beleagured end-user, fed up of having their mail munched before it reached their desktop. Not only for the sysadmins, who could, with their users' permission, add all non-spam to the corpus on the company's server, causing the spam and non-spam models to be tweaked to perfection for that firm. No, the largest benefit is probably to the free software community in general - we would be seen to be providing to everyone the means to reduce the amount of unwanted junk in the inbox.
There are many issues here, of course; not least are those of data protection and privacy. There are ways round this though, be they through the deletion of potentially sensitive information, or the munging of URLs and email addresses. And in any case, it's highly unlikely that any spam prevention system would be released with the actual corpus itself - it's more likely to go out with a model (bigram, hash table, heptagram, whatever) derived from the corpus.
There are steps being made in the direction of global spam corpora already. The work of Vipul's Razor and its commercial offshoot are to be applauded. But if you're detecting spam, please don't neglect - or forget - that which is not.