Friday, September 17, 2004

Starting a business - things to consider...

I've been talking to various friends, and I'm constantly reading blogs about people out of work, and thinking of starting up their own business... As someone who's started a couple of businesses, I don't claim to know everything, but I've learned a thing over two either directly, or through observations of clients I have worked with...

First of all, for most people, money doesn't just fall into peoples laps - if it did, humans would have all just evolved giant laps to catch all of the money, and we'd all be happy... Most people that get money have to be smart, hardworking, and/or lucky (and combinations of these are a bonus)...

It's one thing to gather stats on how many people use / benefit from a free service, but how many people will actually PAY for that service... I wanted to look around for a good example, and came across an excellent example, SorceForge... It's been around for 5 years, has gobs of users, provides a truly valuable service... How much money did they actually collect? Well, looking at the supporter log, we can see that of the bazillion source forge users (who, many have been using the service for years...), 55 made contributions in July (and publicized it...) ... No idea how much the contributions are for, or how many weren't publicized, but if 100 people total made donations, someone needing $3500 / month would have needed the average donation to be $35... And if they were incurring expenses (I suspect SourceForge doesn't just run itself, it's got hardware / software / additional labor expenses / etc...), and the SF people were counting just on these donations, they'd be in a world of trouble...

Think that the guy running the corner store is raking in dough? We'd all be running corner stores if that was the case ... He has competition, he has margins, he has expenses, he probably works over 16 hours a day, he's probably just getting by... If he's driving a Mercedes, it would be thanks to years and years of hard work, struggling to make ends meet.

Think about everyone you interact with in a day... How many stores do you walk by, without going in? Going in and not buying anything?

All business plans should be realistic and include not just what you think you can get, but also a worst case scenario, and a plan for what happens if that were to occur...

I remember talking with a friend last year, who had a nice little severance package, and was starting a business (with a great idea, clear idea of who his customers were, and why they needed the product)... One of the phrases that came out of the mouth of either himself or his wife was something to the effect that he'll just work 20 hours a week, and make as much money as he did as a salaried employee, if not more... My wife was with me, and immediately looked at me, giving me a "just wait..." look in her eyes... His business is now starting to do well... but he's lucky to be home by midnight most days...

I'm not saying that starting a business is impossible - millions of people do it, and many of them succeed... I'm just saying to be prepared to put in some effort, and be prepared to handle many of the crap jobs that need to be handled at a new / growing business...

Monday, September 13, 2004

Fighting a bazillion ways to spell v1*A*g*ra...

Well... I've been thinking about spam further the past few days... In particular, how there are so many ways to spell various spam words.

It reminded me of a project I was working on to do data clean-up of some address information for a database of addresses that spanned the country... Customer Service Reps (CSRs) were essentially creating multiple address records by entering data inconsistently, and that this data had been collected over several years with little validation, but was a key piece of data for searching the system...

We created multiple scripts to rate the likelihood that 2 addresses were actually the same, and to merge these addresses... In our system, they were case sensitive, so that :
123 Main St.
123 MAIN ST.

Ended up being the same record, and could be merged... We added further logic to consistently map Street to St., Drive to Dr, etc etc.. (there were about 100 street suffixes...)
So we would also recommend merging
123 Main Street

Now some cities also had East / West...
123 Main St. East and 123 Main St. West were probably not the same... But 123 Main St. and 123 Main St. East could be the same - if postal code / zip codes matched also... Consider that the CSR might type W, West, West, or for data in Quebec (Oueste, or even just O) - so we needed mapping rules there too...

Then there were the typos... Young St. vs. Younge St., vs. Youngge St... We tweaked the script to modify the score if, we removing vowels or duplicate letters (two g's in a row) ... Matches increased the likelyhood that much more...

This same kind of logic could be applied to spam... Map \/ to V, map all of the accented characters to the values that might appear as (we did something similar for accented characters in our Quebec addresses as well)... 1 become i, strip spaces, remove asteriks, etc, etc...

Pretty soon you end up with "buyviagranow" and you can then pick out the evil phrases and you're laughing...

Our technique for finding matches was pretty strong - unfortunately it turns out that the CSRs were taking advantage of the way addresses got stored seperately to work through a system / design flaw... Weeks of migration work had to be backed out... But the mapping exercise was still an intersting / challenging examination at handling / validating / mapping relatively complex bad data...

Thursday, September 09, 2004

My spam fighting idea...

I've had the same email address for about 10 years, I use it publicly, and I get a lot of spam... over 1000 messages / day ... I use several layers of spam cleaning (my mail routes through 2 ISPs with 2 and 1 mail filter each), plus I use Mozilla with it's built-in junk mail filter (which I love because it has learned what I consider spam or not...) ...

BTW: Mozilla folks... After marking receiving about 200,000 pieces of spam, the junk mail filter is getting to be pretty slow now...

Anyway... Stuff still gets through... The main thing that comes through these days are messages that are just images, and a link to a website... The problem is that there isn't a lot of content to filter on in terms of determining whether the message is spam or not... I might have a friend send a message like... "Check out this picture - it's hilarious" ... and a picture of whatever is funny this week... And I'll have spammers send the same thing, but the picture is a big pile of Viagra, and I can go buy it from their website...

My idea is this... I glanced through the patent directory and didn't see anything, so I'm establishing this entry as prior art for anyone that implements it, and giving the idea up to the public doamin...

My [insert spam filter here - on the mail server or on the client] should actually follow the links on these short little messages, and filter on the content that the link leads to... Here's a message that came today... it's actually got a plaintext part and html part, and the plaintext part has nothing to do with the html (the text is designed to be ordinary and defeating the various spam filters), but the email reader displays the html:

CJ: Message headers here...

This is a multi-part message in MIME format.

Content-Type: multipart/alternative;

CJ: This plain text throws off content based filters... Most mail
readers won't even display this...

Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 7bit


Her dark, pretty face xvhfglittered there in front of me. I stood with my
mouth open, trying to ljeu think of some way to answer her. We were locked
together this way for dcsqmaybe a couple of seconds; then the sound of the
mill jumped a hitch, and something kmdncommenced to draw her back away from
me. A string somewhere I didn't see hooked on oufb that flowered red skirt and
was tugging her back. Her fingernails peeled khjx.
Her dark, pretty face bpryglittered there in front of me. I stood with my
mouth open, trying to mhtx think of some way to answer her. We were locked
together this way for cpegmaybe a couple of seconds; then the sound of the
mill jumped a hitch, and something plbacommenced to draw her back away from
me. A string somewhere I didn't see hooked on cbes that flowered red skirt and
was tugging her back. Her fingernails peeled euns.


CJ: This is the part that I end up seeing... low content, an image, and
a whole bunch of links to drug websites...

Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 7bit

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<a href="">
<img src="cid:mexjheto_mlsjprse_mzwuzxlq" border="0"></a>

Mirror sites:   
<a href="">#1</a>   
<a href="">#2</a>   
<a href="">#3</a>

Starbuck Davis

Starbuck Davis

SO... My filters have been fooled... Following the first link however I'm
taken to a pretty slick looking website... The content on the website includes:

Super Viagra, or CIALIS is used to treat erectile dysfunction, also known as impotence. This is when a man cannot get, or keep, a hard erect penis suitable for sexual activity.

The active ingredient in CIALIS tablets, tadalafil, belongs to a group of medicines called phosphodiesterase type 5 inhibitors.
Following sexual stimulation CIALIS helps the blood vessels in your penis to relax which allows the blood flow into your penis.
The result is the improved erectile function.
CIALIS will not help you if you do not have erectile dysfunction.

So I'm pretty sure that there's some good content there that would have my
filters freaking out...

This technique would have a couple of advantages:

1) If _every single email_ that a spammer sent resulted in their website delivering 10K of data, there would be some pretty serious (ie. expensive) bandwidth used up... (And might result in spammers effectively DOS'ing their own sites)...

2) The website(s) for these spammers pretty professional... they spell "Viagra", not "V1agr!" - so the link is more representative of the content of the message than the actual message content...

3) Legitimate emailers that send 50-100 messages wouldn't have bandwidth or DOS issues... legitimate emailers would have links to sites that are of interest to me... the content on their sites would match my preferences, and wouldn't be filtered...

So there it is folks... Mail / spam developers, go build this into server side and client side mail so that just a bit more spam will be auto-filtered before I see it... :)