The Purify Lexical Analysis Tool (PLAT). Using an English dictionary, and an algorithm that examines the content of an email using 5 parts of speech, the PLAT will examine the email message content to determine what words it considers “meaningful” in the message. It will only use these words when training, or rating an email using its Bayesian type statistics filter. This is a far more accurate way to apply a statistics based filter to email message content.
Plain Bayesian type filters don’t work as stand-alone, frontline spam filters.
Here is a quick Bayesian 101 tutorial. The user of a Bayesian type filter will provide the filter with “training” by adding words from the user’s email content to the filter’s dictionary categorizing the words as “good” or “bad.”
When used for rating an email as “good” or “bad”, the filter will take the words from the content of an email and, using simple statistics, calculate the likelihood of the words being “good” or “bad” based on the user’s previous “training.” The filter will take the aggregate score of the words, and apply the “score” to the email message as a whole, usually as a rating percentage (e.g. .88), identifying the email message as being “good”, or “bad.”
Why doesn’t this work?
Take this crass example for instance. Examine the two phrases:
“She takes it up the stairs.”
“She takes it up the (vulgar reference human waste elimination point).”
The phrases are very different, almost to the point where they cause a visceral reaction to us, aren’t they? But what is different? One word.
You can stretch this out to longer, more elaborate, messages but the underlying flaw remains. By the time you train the vulgar phrase as bad, you will also have trained the innocent phrase as bad, and vice versa.
Spammers have figured this out. To the point where they will add content from books to the bottom of their spam email to “poison” a Bayesian type filter.
In my first version of Purify, I tried adding a straight Bayesian filter to it to support its other filtering capabilities. At first, I was impressed with the results. Then the misclassifications started. So I started futzing around with how the statistics were produced.
First, I started only using the “most interesting”, highest probability of “bad”, 15 words in my statistics. I believe that Paul Graham had used this approach in his “plan for spam” article, and Michael Tsai had used this in his SpamSieve product. For reasons mentioned above, this approach fell apart.
Next, I tried “biasing” the good words, to reduce the possibility of a “false positive.” This started letting an unacceptable level of spam through.
I went round and round until I came to the conclusion that Bayesian filtering simply did not work by itself, and at best, it could be used with other technologies to evaluate an email message as “good” or “bad.”
So, why do so many people swear by these filters? Use. If you were Joe average user, I’d estimate that over 80% of the email that you receive is spam. Right out of the gate you have 80% accuracy “catching” spam. Your false positive, and false negative results come out of the remaining 20% of your email, and will probably be relatively painless to deal with.
But what if you are a business owner? What if every email counts?
Purify offers several frontline, ground-truth, tested, tried and true methods for filtering email, including the modified Bayesian type filter (PLAT) described herein.
Purify Country Filter. Purify will examine the SMTP header of an email message, and determine the email’s source domain. It will lookup the country of the IP address of the source domain, and if the country is marked as blocked by the user, the message will be marked as spam. It will also apply the same Country Filter to any embedded links in the email message!
Purify Blacklist. Purify supports a blacklist of phrases. If any of the phrases shows up in an email’s message content, the message is marked as spam. The blacklist can be used as plain text (viagra), or use powerful Regular Expressions (v[i|1|!]agra). What if you don’t know how to use Regular Expressions? Enter your words into Purify’s Obfuscated Words Checker, and it will generate them for you!
Purify Spam Reporting. Purify allows you to trace and report spammers to their Internet Service Providers. Many users have had great success reducing the volume of spam email that they receive by tracing and reporting spam. This is very easy to use. Report your spam with the click of a button.
Purify Email Designates. Purify will allow you to define your “proper name” for an account. If your name is “Betty Smith”, and a spammer uses a proper name in the address, and it is to “John Doe”, then Purify will pick this up and mark the item as spam.
Purify Ignorelist. If you are tracing and reporting spam, and are not interested in seeing the abuse desk responses, you can ignore them using this feature.
Purify Whitelist. The user can accept email from an easily imported friends list, or using whitelist phrases (“you have a payment!”).
Purify Bayesian type filtering featuring the Purify Lexical Analysis Tool (PLAT).
When email counts, count on Purify as your email filtering solution.