Spam TrapStopping the Spammers
The Spam Trap Note This is a work in progress. I think the idea is sound, but I haven't yet
written the code to implement it
Junk EmailLike everyone else I get a lot junk email. In fact Scientific American, in their April 2005 issue, claim that for about one third of email users 80% or more of email they receive is junk. This third will undoubtably include a high proportion of the most technologically savvy, for reasons I'll explain in a minute. The sheer volume of spam means that legitimate internet users are paying for the infrastructure that is being leeched by the low life spammers [1]. The Scientific American article discusses all sorts of technological techniques that might be used to help prevent spam - working on making email senders verify their identity for example. I quite liked the 'reverse turing test' idea; small puzzles to make an email sender verify that they are a human rather than a robot. That wouldn't work of course, not only is a lot of automated email not junk (newsletters, system warnings, etc), but most humans are lazy and just wouldn't bother. Nice idea though. There is an alternative, and it involves tackling the problem from the other end. Harvesting AdressesSpammers fire off thousands of email per second, from zombie computers, via open relay proxies, from lax ISPs, wherever - the volume is staggering. They do this by trading email lists. Once your address is on one list, it won't be long before it's on them all. So where do these email lists come from ? The primary source of email addresses is spiders... not our furry arachnid friends, but the same web technology used by search engines. These 'robotic' programs crawl the web, reading every page they find. Each page contains links to other pages, which will also be crawled. In this way, the number of links it has to follow will increase exponentially until it has covered most of the internet. A search engine spider does this to index each page for keywords and content. Spam spiders also follow links - but instead of indexing the contents they harvest email addresses on the page. Guestbooks, blogs, bulletin boards, newsgroups, home pages, these all make rich pickings for the spambots. Most people can keep their inboxes free from junk mail by not revealing their email address on the internet. If, like me, you run a website or administer an open source project, you want your email address to be a matter of public record - and bam in pours the junk mail. This is why those who are the most technologically aware - the programmers and the webmasters - are likely to be those who suffer most. A lot of the email addresses on the internet will be old. The owners long ago having been forced to abandon the mailbox due to the constant deluge of bilge. A certain proportion of the addresses will be real, and a few idiots will give the spammers money - ensuring the cycle continues. There are a few other ways of getting email addresses, attacking databases, emailing random addresses in a domain, tricking a server to verify whether a made up address is real or not, etc. However, most email addresses come from the spam spiders. To verify this for yourself, sign up for a new email address and post the address to a couple of bulletin boards. See how long it is before the junk starts flowing in to it. There is a certain cost trade off in sending junk mail, it's cheap but not free. The Scientific American article calculated that if each junk email costs one hundredth of a cent to send, and spammers average $11 profit per sale, they need to make a minimium of around one sale per 100 000 emails for it to remain profitable. Increasing the CostSo what if the number of email addresses available on the internet suddenly increased tenfold or one hundred fold ? And what if only a tiny proportion of these were genuine ? If the number of 'junk addresses' on their lists increased tenfold, then the number of emails they need to send per sale also increases tenfold. This makes the ecenomics that much more unfriendly for the spammers. Instead of one in twenty [2] of the addresses on their lists being genuine, it would be one in two hundred, or even one in two thousand. If a spammer currently makes one sale per fifty thousand emails [3], they would need to send five hundred thousand or five million. Spamming, on the current model, becomes uneconomical. So how can this happen ?Any webpage linked to will be crawled by the spambots, and the any email addresses (real or not) will be harvested. What if a large proportion of website owners and bloggers dedicated a few pages of their site to containing junk email addresses ? There are literally millions of private websites and blogs. A lot of these are run by the sort of people who are likely to be most affected by spam. The NumbersLets look at the numbers. Let's estimate [4] that the average spammer has access to one hundred million email adresses. That means we want to seed the internet with at least one billion fake email address. If a fake web page can contain an average of twenty five email addresses [5] before it becomes suspicious. If a site can have twenty or more pages like this, then when you have two million webmasters on board the system will work. If webmasters will dedicate more pages then you need less individual sites to participate [6]. Actually Doing ItSo how difficult is this, and what other issues are their ? Knocking together a script [7] that autogenerates the sites is in fact very easy. It needs to create (pseudo)random email addresses, insert them into randomly generated web pages, and make sure these pages cross link to each other.
The battle against spam has become a bit like hand to hand combat. Every measure that is introduced to combat it, they seem to overcome. This technique plays them at their own game. Even if it just forces spammers to find more ingenious ways of getting to people, it would make address harvesting a less effective means of getting email addresses. This would make the web a safer place for email addresses. Problems
This project relies on the fact that a computer can't tell the difference between a real page and a junk page. Without using complicated and unreliable heuristics on the whole page it is very difficult for the spambots to tell the difference between a real page . For this reason the project has a high chance of success in the short term. At this point the email harvesters will become an ineffective means of gathering email addresses and we ought to see the tide of junk email reduced - diluted by the majority of spam being sent to invented addresses. The spammers will inevitably not take this lying down, and will turn to other techniques. This will probably include trying to manually discover the fake websites and remove them from the areas crawled. It might be harder to maintain enthusiasm for creating new areas with the junk addresses when there is a perception that the battle has already been won. If the project does show signs of being successful then it must be actively maintained. Junk email gets through our filters because it is hard to tell what is junk, and what isn't. This is only true if it contains randomness and unpredictability. It succeeds because the spammers are unable to tell which email addresses are junk and which are real. My guess is that they will develop heuristics to work out which pages are junk. We will need to continue to develop and change the way we host junk addresses to stay ahead of the spammers. But that's the difference - this tiem we're ahead of the spammers and they're on the back foot trying to catch up Footnotes
For buying techie books, science fiction, computer hardware or the latest gadgets: visit The Voidspace Amazon Store. If you're looking for a new techie job, try the Voidspace Tech Job Board. This is part of the Hidden Network of technology and programming jobs.
Last edited Fri Feb 15 13:42:08 2008. Counter... |
|||||||||||||||
|
Blogads
Follow me on: Tech Jobs |