Monday, September 13, 2004

Fighting a bazillion ways to spell v1*A*g*ra...

Well... I've been thinking about spam further the past few days... In particular, how there are so many ways to spell various spam words.

It reminded me of a project I was working on to do data clean-up of some address information for a database of addresses that spanned the country... Customer Service Reps (CSRs) were essentially creating multiple address records by entering data inconsistently, and that this data had been collected over several years with little validation, but was a key piece of data for searching the system...

We created multiple scripts to rate the likelihood that 2 addresses were actually the same, and to merge these addresses... In our system, they were case sensitive, so that :
123 Main St.
and
123 MAIN ST.

Ended up being the same record, and could be merged... We added further logic to consistently map Street to St., Drive to Dr, etc etc.. (there were about 100 street suffixes...)
So we would also recommend merging
123 Main Street

Now some cities also had East / West...
123 Main St. East and 123 Main St. West were probably not the same... But 123 Main St. and 123 Main St. East could be the same - if postal code / zip codes matched also... Consider that the CSR might type W, West, West, or for data in Quebec (Oueste, or even just O) - so we needed mapping rules there too...

Then there were the typos... Young St. vs. Younge St., vs. Youngge St... We tweaked the script to modify the score if, we removing vowels or duplicate letters (two g's in a row) ... Matches increased the likelyhood that much more...

This same kind of logic could be applied to spam... Map \/ to V, map all of the accented characters to the values that might appear as (we did something similar for accented characters in our Quebec addresses as well)... 1 become i, strip spaces, remove asteriks, etc, etc...

Pretty soon you end up with "buyviagranow" and you can then pick out the evil phrases and you're laughing...

Our technique for finding matches was pretty strong - unfortunately it turns out that the CSRs were taking advantage of the way addresses got stored seperately to work through a system / design flaw... Weeks of migration work had to be backed out... But the mapping exercise was still an intersting / challenging examination at handling / validating / mapping relatively complex bad data...

6 Comments:

Blogger Helen said...

Someone spam me with chocolate...

December 16, 2004 1:54 PM  
Anonymous Anonymous said...

DSPAM already handles this excellently. You can see the paper published there on Bayesian Noise Reduction, and you can download *and use right away* the state-of-the-art DSPAM anti-spam engine.

-- Asheesh Laroia.

December 18, 2004 4:32 AM  
Anonymous Anonymous said...

Regarding spam misspellings and such, check out http://www.adamlyon.com/spam/ (specifically the "obfuscated words" section and the reference file: http://www.adamlyon.com/spam/spam_filter_regex.txt )

December 18, 2004 5:03 AM  
Blogger supergugler said...

Think about the CPU cycles this takes if your mailserver handles thousands of emails per day!

December 18, 2004 6:49 AM  
Anonymous Anonymous said...

How about you filter anything with accented characters and asteriks in general

What kind of formal email address subject line would include that in the first place?

What kind of informal email address subject line would include that in the first place?

Friends, coworkers, bosses, no one in the right mind would have ***** all over the place and accented characters. That's just plain stupid.

December 18, 2004 12:29 PM  
Anonymous Anonymous said...

I'm portuguese and here we use lots of accented characters :-)

cheers,
António

December 21, 2004 5:40 AM  

Post a Comment

<< Home