Stopping the Spammers
The Spam Trap
This is a work in progress. I think the idea is sound, but I haven't yet written the code to implement it .
Like everyone else I get a lot junk email. In fact Scientific American, in their April 2005 issue, claim that for about one third of email users 80% or more of email they receive is junk. This third will undoubtably include a high proportion of the most technologically savvy, for reasons I'll explain in a minute. The sheer volume of spam means that legitimate internet users are paying for the infrastructure that is being leeched by the low life spammers .
The Scientific American article discusses all sorts of technological techniques that might be used to help prevent spam - working on making email senders verify their identity for example. I quite liked the 'reverse turing test' idea; small puzzles to make an email sender verify that they are a human rather than a robot. That wouldn't work of course, not only is a lot of automated email not junk (newsletters, system warnings, etc), but most humans are lazy and just wouldn't bother. Nice idea though.
There is an alternative, and it involves tackling the problem from the other end.
Spammers fire off thousands of email per second, from zombie computers, via open relay proxies, from lax ISPs, wherever - the volume is staggering. They do this by trading email lists. Once your address is on one list, it won't be long before it's on them all. So where do these email lists come from ? The primary source of email addresses is spiders... not our furry arachnid friends, but the same web technology used by search engines. These 'robotic' programs crawl the web, reading every page they find. Each page contains links to other pages, which will also be crawled. In this way, the number of links it has to follow will increase exponentially until it has covered most of the internet. A search engine spider does this to index each page for keywords and content. Spam spiders also follow links - but instead of indexing the contents they harvest email addresses on the page. Guestbooks, blogs, bulletin boards, newsgroups, home pages, these all make rich pickings for the spambots.
Most people can keep their inboxes free from junk mail by not revealing their email address on the internet. If, like me, you run a website or administer an open source project, you want your email address to be a matter of public record - and bam in pours the junk mail. This is why those who are the most technologically aware - the programmers and the webmasters - are likely to be those who suffer most.
A lot of the email addresses on the internet will be old. The owners long ago having been forced to abandon the mailbox due to the constant deluge of bilge. A certain proportion of the addresses will be real, and a few idiots will give the spammers money - ensuring the cycle continues. There are a few other ways of getting email addresses, attacking databases, emailing random addresses in a domain, tricking a server to verify whether a made up address is real or not, etc. However, most email addresses come from the spam spiders. To verify this for yourself, sign up for a new email address and post the address to a couple of bulletin boards. See how long it is before the junk starts flowing in to it.
There is a certain cost trade off in sending junk mail, it's cheap but not free. The Scientific American article calculated that if each junk email costs one hundredth of a cent to send, and spammers average $11 profit per sale, they need to make a minimium of around one sale per 100 000 emails for it to remain profitable.
So what if the number of email addresses available on the internet suddenly increased tenfold or one hundred fold ? And what if only a tiny proportion of these were genuine ? If the number of 'junk addresses' on their lists increased tenfold, then the number of emails they need to send per sale also increases tenfold. This makes the ecenomics that much more unfriendly for the spammers. Instead of one in twenty  of the addresses on their lists being genuine, it would be one in two hundred, or even one in two thousand. If a spammer currently makes one sale per fifty thousand emails , they would need to send five hundred thousand or five million. Spamming, on the current model, becomes uneconomical.
Any webpage linked to will be crawled by the spambots, and the any email addresses (real or not) will be harvested. What if a large proportion of website owners and bloggers dedicated a few pages of their site to containing junk email addresses ? There are literally millions of private websites and blogs. A lot of these are run by the sort of people who are likely to be most affected by spam.
Lets look at the numbers. Let's estimate  that the average spammer has access to one hundred million email adresses. That means we want to seed the internet with at least one billion fake email address. If a fake web page can contain an average of twenty five email addresses  before it becomes suspicious. If a site can have twenty or more pages like this, then when you have two million webmasters on board the system will work. If webmasters will dedicate more pages then you need less individual sites to participate .
So how difficult is this, and what other issues are their ?
Knocking together a script  that autogenerates the sites is in fact very easy. It needs to create (pseudo)random email addresses, insert them into randomly generated web pages, and make sure these pages cross link to each other.
- invisible links from 'real website' to the junk page
- link = rel nofollow ?
- cross linked pages so the harvesters find them
The battle against spam has become a bit like hand to hand combat. Every measure that is introduced to combat it, they seem to overcome. This technique plays them at their own game. Even if it just forces spammers to find more ingenious ways of getting to people, it would make address harvesting a less effective means of getting email addresses. This would make the web a safer place for email addresses.
- Will this cause websites to be blacklisted by google for including pages with lots of junk words ? (this is a typical tactic of spammers to manipulate search rankings). Hopefully not if they don't cross link to other sites, because they won't be manipulating search rankings (which rely on the cross links from these keyword rich pages). See also the question below about search bots, 'robots.txt' can be used to eliminate this problem (instruct search engines not to index these pages).
- Don't we risk (accidentally) listing real email addresses, causing them to receive spam ? Hmm.... yes, possibly, but only a very tiny minority of our randomly generated addresses. In the long run this is going to make email addresses harvested from the web virtually useless to spammers. This means that it ought to only be a short term nuisance. We could add a few random characters to the addresses, this would make it less likely that we hit real email addresses, but also possibly easier to spot that they're fake.
- Not only spam harvesters but also search bots will crawl the pages. This will increase traffic and skew search engine results. True but not such a problem. If you use 'robots.txt' wisely, then geunine search engines will respect this and ignore the pages. Spam harvesters are unlikely to respect 'robots.txt'. (Maybe we can teach them this too..)
This project relies on the fact that a computer can't tell the difference between a real page and a junk page. Without using complicated and unreliable heuristics on the whole page it is very difficult for the spambots to tell the difference between a real page . For this reason the project has a high chance of success in the short term. At this point the email harvesters will become an ineffective means of gathering email addresses and we ought to see the tide of junk email reduced - diluted by the majority of spam being sent to invented addresses. The spammers will inevitably not take this lying down, and will turn to other techniques. This will probably include trying to manually discover the fake websites and remove them from the areas crawled. It might be harder to maintain enthusiasm for creating new areas with the junk addresses when there is a perception that the battle has already been won.
If the project does show signs of being successful then it must be actively maintained. Junk email gets through our filters because it is hard to tell what is junk, and what isn't. This is only true if it contains randomness and unpredictability. It succeeds because the spammers are unable to tell which email addresses are junk and which are real. My guess is that they will develop heuristics to work out which pages are junk. We will need to continue to develop and change the way we host junk addresses to stay ahead of the spammers. But that's the difference - this tiem we're ahead of the spammers and they're on the back foot trying to catch up .
|||Increasingly the spam is becoming more insidious. Not only do we have the financial and identity theft attempts of phishing - but hijacked, so called zombie, computers can actually earn the spammers money. They can send the junk mail or host the sites through which they sell their wares. More and more junk email is aimed at propogating the trojans that allow spammers to take over peoples computers.|
|||An utterly wild guess.|
|||Another utterly wild guess.|
|||Yet another astoundingly random guess.|
|||It ought to be a range of between nought and fifty addresses. Randomness is the key.|
|||Volunteers could also set up fake sites using free webhosters, and post links to them in public places. Ten thousand volunteers, each setting up the equivalent of ten sites with one hundred pages, would get you a quarter of a billion junk email addresses online. If the sites are autogenerated then this wouldn't be a lot of work for each volunteer.|
|||Modern dynamic languages like Python and Perl are perfectly suited to tasks like this.|
For buying techie books, science fiction, computer hardware or the latest gadgets: visit The Voidspace Amazon Store.
Last edited Tue Aug 2 00:51:34 2011.