Grand Text Auto » Reversing the Spam Cannon

May 17, 2004

Reversing the Spam Cannon

by Nick Montfort · , 2:07 pm

Traditional methods for combating spam on blogs – for instance, obfuscating links and thus decreasing the PageRank and usefulness of blogs, using censorship methods known as blacklists – are a disservice to public communication, albeit often in ways that are minor at first. If these are used exclusively, they will eventually lead to the ruin of the Internet as a public space and a public conversation.

Instead, we should encourage technical and legal measures that actively counterattack spammers and assailants of blogs. Spambots – here I refer to the sorts of programs that communicate on IRC to coordinate the defacement and destruction of blogs – attempt to turn channels of public communication and conversation against themselves. Spambots should themselves be sabotaged so that they are made to perform useful tasks, at the very least, notifying end users and network administrators that their computers have been compromised, but perhaps also implementing DDOS (distributed denial of service) attacks on rogue, spamming machines. Additionally, spammers should certainly be publicly identified and then ostracized, bankrupted, and in some cases physically incarcerated, but there are powerful technical methods that could be available to us, too, and it’s worthwhile to spur on the development of these.

The problem with comment spam is not that blogs link to things or that blogs allow unconstrained communication by commenters online; the problem is the abuse of blogs as a channel of communication and the attempts of spammers to destroy the blog as a popular forum and to render the Internet a wasteland of speech. The appropriate response is not to cripple blogs, but to target abusers and the abuse and attacks they visit on our new communication systems and conversational spaces.

A classic, useful definition of email spam is “UBE,” unsolicited bulk email. By this definition, spam does not need to be commercial; something that is noncommercial or even meaningless as communication (e.g., a flood of empty or nonsense messages) counts. However, a single unsolicited message sent by a person, even if it is commercial in nature, is not spam, so people can initiate conversations with each other by email without being cast as spammers. Any message sent to a group of recipients without their consent falls into this category, however. Although it is not explicit in the definition, we would expect that varying the message by exchanging nonsense words for one another should not keep us from considering a batch of emails as UBEs. The definition is not perfect, since a flood of nonsense email sent to a single person does not involve bulk messages, but it is a start, and perhaps that sort of attack is best characterized differently, anyway.

We might similarly adopt the term or definition “UBC” (unsolicited bulk comments) to refer to comments, whether entered automatically or manually, that are entered into multiple blogs without any relevance to messages on those blogs or in violation of comment policies on those blogs. Actually, spam on blogs is probably better defined in terms of USENET spam. The term “spam” in this usage seems to have originated on USENET, along with the standard that messages are the same if they are “substantively identical.” Blogs do not generally encourage cross-posting – a legitimate activity on USENET – so identifying spam is even easier on blogs. We should also distinguish a single bulk comment posted across blogs, however inappropriate, from a flood attack, the purpose of which is to destroy systems and, overall, to discourage the existence of blogs as channels of communication so that the Internet is eventually turned into an enormous direct mail campaign and hosts no communication between individuals.

The conventional response is typified by two changes that have been made to many blogs, including Grand Text Auto: [*]

Obfuscate links so that comments no longer link directly to Web sites and spammers are “denied PageRank.” Of course, the numerous creative sites, authored by individuals, that are linked to in legitimate comments are also denied PageRank, which is is exactly what spammers want. They would prefer that their link farms and paid advertisements be the exclusive way that PageRank is assigned. And even better, blogs are crippled by this mechanism since it prevents users from mousing over a link and reading the URL. By degrading the workings of blogs, pay sites seem better off in comparison.
Implement censorship mechanisms known as blacklists. As we know from AOL’s prohibition of the word “breast” in chat rooms and bulletin board posts (quite a problem for those who wanted to discuss breast cancer and breast feeding), these blanket methods stifle legitimate conversation, making it particularly difficult to discuss uncomfortable subjects. Will a legitimate URL that has the words “product” and “rape” in it, such as this one, make it through filters meant to deny flood attacks of commercial, sexually deviant URLs?

These sorts of measures also simply pass the comment spam problem along to bloggers on the margins, people who may have no or limited access to technical support and exactly the sort of people who may benefit the most from having blogs as a channel of communication to discuss serious issues that are hard to converse about in other “real world” forums.

On the other hand, legal and technical counterattacks on spammers and blog assailants benefit the whole blogging community and do nothing to restrict legitimate conversations on blogs.

Already, some webmasters have designed systems that feed spammer’s email addresses back to their own email harvesters or otherwise fight back against email harvesters so that their automatic bulk emailing software will send email to their own servers or to the “abuse” addresses of their ISP. Email harvesters are often referred to as “spambots,” but in this document, I’m using “spambot” to refer to a program that is used to coordinate the spamming of or attacks on blogs, often by communicating with numerous compromised computers via IRC to avoid IP-banning schemes. These are enabled by Trojan horse programs such as IRC/Fyle.

Why not extend these sorts of anti-email-harvester tactics to comment-posting spambots that operate on IRC channels? Instead of just kicking such spambots off of IRC channels – the typical response – the spammer’s system can be sabotaged, by the legitimate operators of the public channel, so that spam and attacks are redirected to spammers.

Please notice the essential difference between this proposed tactic – in which the administrator of an IRC channel takes measures to prevent the channel itself from being turned against legitimate Internet users and tries to divert an already-organized attack by compromised computers – and the tactics of the bounty-hunter, anti-art organization RIAA, which has sought legal sanction for clearly criminal incursions on the communications systems of private users, attacks that resemble spammer flood assaults much more than they do the countermeasures I am proposing here. Again, I’m not proposing to initiate DDOS attacks against spammers, but simply to divert their own attacks on blogs so that spammers and attackers, rather than bloggers, suffer.

Perhaps the most serious criticism of this invective of mine: Why just mention this idea, rather than implement it? I have been known to do some computer programming at times, but I certainly couldn’t put together such a system in a half a day, so I felt that I could probably make a better immediate contribution at the level of concept and rhetoric than implementation. I’d be glad to work with others to make it happen, though.

[*] Both of these changes were made with my consent or urging, I should add, as they seemed to be the only tenable, immediate options for us to recover Grand Text Auto as a space for conversation. These can’t be the only defenses that bloggers use, however, and measures that are restrictive to legitimate users should be reversed when more suitable spam-dismantling techniques become available.

6 Responses to “Reversing the Spam Cannon”

nick Says:
May 17th, 2004 at 3:06 pm
I should have mentioned that intrepid souls are taking action against IRC-based spambots.
Matthew the Mad Devil-sticker Says:
May 18th, 2004 at 10:02 am
One of my current project ideas with email spam, though it’s hard to see how it might work with blog spam, is to keep a global database of how much we trust a host. For example, evidence of a windows worm in my apache logs or on a port that I listen on to trap worms should force whatever trust-rating I currently have assigned to be much lower. Similarly, the score from spamassassin and the information from SAUCE about syntax errors and other problems with the mail transaction should adjust this value. This is something I’ve mentioned to Hanna before now, as I need to think about how the statistics are going to work.

It seems to me that in the coming years, this kind of approach is going to be the only one that can work in any sensible kind of a way.

The problem is that this works well for services where you expect a certain set of things to be the case, but in the comments, you will expect a set of workstation hosts, and a very small number of server machines (eg. proxies).

It is perhaps worth applying something like bayesian spam analysis or markov chain analysis to decide on the validity of the comment for you, and set a threshold. This would, of course, mean a corpus of spam comments and a corpus of real comments is needed, and that you need to disable your temporary measures, and just delete the comments.

Either that, or, for the moment screen everything before it appears, and file it for your training. The whole thing is a problem, and distrusting people until proven otherwise, however unsavoury this may be in many ways, is probably the only way forward.
Jill Says:
May 20th, 2004 at 7:41 am
Nick, this is an excellent discussion of spam, and in principle, I agree that penalising the spammers is the way to go, rather than crippling public discussion in an attempt to keep spammers out. I guess as a not-programmer I have no idea how hard something like that would be to implement though.

I have seen a couple of other ideas to limit spamming, though I suspect they all to some extent fall into the cripple-ourselves category:
1. Greylisting, which my university uses for email spam. Except that relies on “real” mailservers resending messages after ten minutes, which human posters don’t do but spambots could easily do, so really, it’s out.
2. Things like TypeKey or Blogger, where commenters can only leave their URL when they’re already registered in the Blogger or TypeKey databases. Obviously it’s a disadvantage to have to register like that – unless you happen to already have an account, as I found I did when I first tried to comment to a Blogger blog. Blogger blogs let non-registered commenters post as anonymous which is OK. But then again the only URL they seem to allow from comments is the one from the comment to the commenter’s Blogger profile, if it exists, and from there there’s a link to their homepage, so it’s not good enough.
3. PGP signatures required for comments (suggested in Feb, more comments more recently, and there’s a Movable Type plugin for it, it seems) – I like the idea of PGP but have never bothered to get myself a code or whatever it is you get, and assume most people are like me. So it would be likely to limit comments like TypeKey, Blogger and other authentication systems, which would suck.
I sure know I’m sick to death of spam. Currently, every time I look at my blog it’s full of spam that has to be deleted. Sure it only takes five minutes with MT-Blacklist but that’s five minutes too much.
nick Says:
May 20th, 2004 at 11:27 am
Matthew and Jill, thanks for the comments. I wish I could have developed a more specific proposal. I know the little I do about the operation of spambots from folklore and scattered places on the Web; I unfortunately couldn’t find and good resource about how they operate, either for my benefit or to provide a link. It’s certainly easier said than done, but I hope there are ways to turn spammers’ and attackers’ systems against them.

In the meantime, I hope it helps for us to at least be aware that blacklisting and registrations schemes of various sorts, even if they seem the only tenable, immediate option, are chipping away at an originally uncensored and free forum for communication – they aren’t just an inconvenience for blog admins. I know that I’ve missed legitimate emails and have restricted other people’s access to my email address because of a deluge of email spam; I really hope comment spam doesn’t end up explicitly or implicitly shutting down blogs as a channel for speech and discussion.
scott Says:
May 20th, 2004 at 4:04 pm
It’s a shame we can’t just find the culprits and put them in stocks. I often wonder how effective their spam really is. Wouldn’t their money effort be better spent just taking out a yellow pages ad so that people in the market for “penis enlargement” or “rape sex videos” could just let their fingers do the walking?
jill/txt Says:
May 20th, 2004 at 7:48 am penalise the spammers, not the community
In a discussion of anti-spam remedies over at Grandtextauto, Nick argues that attacking spam by crippling blogs and other arenas for public discussion is not solving anything: instead we should devise anti-spam tactics that penalise the spammers. Would…

nick Says:
May 17th, 2004 at 3:06 pm

I should have mentioned that intrepid souls are taking action against IRC-based spambots.

Matthew the Mad Devil-sticker Says:
May 18th, 2004 at 10:02 am

One of my current project ideas with email spam, though it’s hard to see how it might work with blog spam, is to keep a global database of how much we trust a host. For example, evidence of a windows worm in my apache logs or on a port that I listen on to trap worms should force whatever trust-rating I currently have assigned to be much lower. Similarly, the score from spamassassin and the information from SAUCE about syntax errors and other problems with the mail transaction should adjust this value. This is something I’ve mentioned to Hanna before now, as I need to think about how the statistics are going to work.

It seems to me that in the coming years, this kind of approach is going to be the only one that can work in any sensible kind of a way.

The problem is that this works well for services where you expect a certain set of things to be the case, but in the comments, you will expect a set of workstation hosts, and a very small number of server machines (eg. proxies).

It is perhaps worth applying something like bayesian spam analysis or markov chain analysis to decide on the validity of the comment for you, and set a threshold. This would, of course, mean a corpus of spam comments and a corpus of real comments is needed, and that you need to disable your temporary measures, and just delete the comments.

Either that, or, for the moment screen everything before it appears, and file it for your training. The whole thing is a problem, and distrusting people until proven otherwise, however unsavoury this may be in many ways, is probably the only way forward.

Jill Says:
May 20th, 2004 at 7:41 am

Nick, this is an excellent discussion of spam, and in principle, I agree that penalising the spammers is the way to go, rather than crippling public discussion in an attempt to keep spammers out. I guess as a not-programmer I have no idea how hard something like that would be to implement though.

I have seen a couple of other ideas to limit spamming, though I suspect they all to some extent fall into the cripple-ourselves category:

Greylisting, which my university uses for email spam. Except that relies on “real” mailservers resending messages after ten minutes, which human posters don’t do but spambots could easily do, so really, it’s out.
Things like TypeKey or Blogger, where commenters can only leave their URL when they’re already registered in the Blogger or TypeKey databases. Obviously it’s a disadvantage to have to register like that – unless you happen to already have an account, as I found I did when I first tried to comment to a Blogger blog. Blogger blogs let non-registered commenters post as anonymous which is OK. But then again the only URL they seem to allow from comments is the one from the comment to the commenter’s Blogger profile, if it exists, and from there there’s a link to their homepage, so it’s not good enough.
PGP signatures required for comments (suggested in Feb, more comments more recently, and there’s a Movable Type plugin for it, it seems) – I like the idea of PGP but have never bothered to get myself a code or whatever it is you get, and assume most people are like me. So it would be likely to limit comments like TypeKey, Blogger and other authentication systems, which would suck.

I sure know I’m sick to death of spam. Currently, every time I look at my blog it’s full of spam that has to be deleted. Sure it only takes five minutes with MT-Blacklist but that’s five minutes too much.

nick Says:
May 20th, 2004 at 11:27 am

Matthew and Jill, thanks for the comments. I wish I could have developed a more specific proposal. I know the little I do about the operation of spambots from folklore and scattered places on the Web; I unfortunately couldn’t find and good resource about how they operate, either for my benefit or to provide a link. It’s certainly easier said than done, but I hope there are ways to turn spammers’ and attackers’ systems against them.

In the meantime, I hope it helps for us to at least be aware that blacklisting and registrations schemes of various sorts, even if they seem the only tenable, immediate option, are chipping away at an originally uncensored and free forum for communication – they aren’t just an inconvenience for blog admins. I know that I’ve missed legitimate emails and have restricted other people’s access to my email address because of a deluge of email spam; I really hope comment spam doesn’t end up explicitly or implicitly shutting down blogs as a channel for speech and discussion.

scott Says:
May 20th, 2004 at 4:04 pm

It’s a shame we can’t just find the culprits and put them in stocks. I often wonder how effective their spam really is. Wouldn’t their money effort be better spent just taking out a yellow pages ad so that people in the market for “penis enlargement” or “rape sex videos” could just let their fingers do the walking?

jill/txt Says:
May 20th, 2004 at 7:48 am penalise the spammers, not the community
In a discussion of anti-spam remedies over at Grandtextauto, Nick argues that attacking spam by crippling blogs and other arenas for public discussion is not solving anything: instead we should devise anti-spam tactics that penalise the spammers. Would…