Stopping Email Address Leakage | Paul G. Allen School of Computer Science & Engineering

Hint: you can cut to the chase by skipping over the first two sections and going right to the meat contained in Local Support for Spam Avoidance.

The CSE defense against spam has two components:

Preventing email addresses from becoming known to senders of spam.

Identifying incoming email that is spam and marking or quarantining it so that it presents less of an interruption to our work.

This document deals with the first of those components: how to prevent email addresses from "leaking" from our web.

About the Problem

Leakage Prevention Techniques

Local Support for Spam Avoidance

About the Problem

When an email address is exposed on a web page in a form that can be identified by a machine, it is certain to be collected by a "spambot" and later used as the destination of spam.

There is therefore a tension between the goals of (1) making it easy for legitimate users to contact us and (2) making it hard for the malevolent ones to do so. To prevail, we must leverage the differences between a real human user and a robotic spambot.

Here is a brief summary of what we think we know about the behaviour of spambots, based in part upon published research:

harvested addresses rapidly stale: While there are exceptions— even extreme exceptions— experience shows that the volume of spam received at an address is a strong function of when it was most recently available for harvest. That means that that removing an address from the web even after it leaks is a worthwhile endeavour.
spambots are stupid: Spambots are armed with a few heuristic rules for gathering addresses. While their human masters doubtless work continuously to improve them, they do not, for example, learn from their mistakes.

Leakage Prevention Techniques

Policy

The well-known robots.txt convention is useful for directing the behaviour of well-behaved robotic web browsing software. It's entirely voluntary, so Spambots are unlikely to respect it.

Restricted Access

If a spambot can't read a page, it can't harvest the email addresses it contains, so restricting access to a page is a very tempting and effective technique. Of course, it doesn't work for all content— much of the content at an academic site is intended for the public.

One technique that's only a little bit unfriendly to the user but stops spambots dead is to use "basic auth" with well-publicized credentials. The advantage of basic auth is that (in contrast to for example pubcookie and csecookie), it doesn't require that the user have an actual local computer account— users only need to know a username and a password. And you can tell them that.

For example, you might offer a link to a page that's restricted to user "no" with password "spam" on a page like this:

Please use the username "no" with password "spam" to access this page.

No general-purpose spambot will be able to step past this restriction, but any human user can do so in only a few seconds, and most browsers can be directed to memorize credentials to allow future visits to the same document tree to be fully seamless.

HTML Encoding

A common obfuscation technique is to encode email addresses in a text format understood and properly displayed by a web browser, but not by trivial spambot software. For example, the @ character can be encoded as A, or the string rose can be encoded as rose— basically, replacing each original ASCII character with an HTML encoding of it's decimal value, a scheme that's intended to allow "tricky" characters to be displayed, but unnecessary for common ones.

Does it work? Researchers who study spam have reported that it does. A problem is that what works today may not work tomorrow, when new versions of spambots are released that are less easily fooled. The bottom line is that it's probably better than doing nothing, but we don't expect it to remain much better.

English Language Encoding

Everybody has seen them: email addresses "recoded" in English, such as rose at cs dot washington dot edu. It probably does succeed in confusing at least today's spambots.

Graphics

More aggressive than the "English language encoding" is to create a graphic representing the text of an email address. For example, . Downsides are that they are a little tedious for the content provider to create and deploy, and that such addresses cannot be read by the text-to-speech applications employed by blind users.

CSE offers a local tool for creating such graphics; details are here.

CGI Scripts

The most convenient way to offer an email address on a web page is to use a mailto: link— it allows a user to click on the link and use their own email client to send an email to the address listed without a minimum amount of typing. It's also the most convenient for a spambot, though, so we try to avoid using it.

Slightly less convenient for the user is to offer a link to an email-sending CGI script that takes its input from an HTML form. Disadvantages for the user are being forced to use a somewhat clunky interface with strictly limited options. Advantages to the content provider abound, however:

such an interface encourages brevity
a well-constructed script can include key context (such as the referring URL) that a human correspondant may neglect
unlike a mailto: link, arguments to an email-sending CGI script needn't include a fully-qualified email address

CSE offers a local email-sending CGI script on all lab-managed web servers. Details are here.

JavaScript

Some users have deployed simple Javascript code that writes an email address to the web page at browse time, working from an embedded email address that is somehow encoded to obfuscate it. Since spambots aren't expected to run the JavaScript code in a web page, it's effective. On the downside, it won't work for those users that run with JavaScript (wisely?) disabled.

My best estimate is that 95% of our visitors run with JavaScript enabled, this measured by analyzing about 10000 visits to the CSE home page in March, 2005. (That's likely an underestimate, however, since it includes robots of all stripes, which are expected to run without JavaScript enabled or, perhaps worse for my statistics, without interpreting <SCRIPT> or <NOSCRIPT> tags at all).

Disposable Email Addresses

A disposable email address is an email address that you can feel free to share with questionable correspondants because you can "turn it off" at any time. As long as the address has only been shared with correspondants with whom you don't need to maintain an ongoing relationship, the cost of disposing it— after, for example, it becomes a target of spam— is low.

Disposable email addresses take various forms. There are cost-free services such as SpamGourmet and Sneakemail that will forward email sent to a family of email addresses to a registered user, and there are webmail services such as Mailinator that allow you to avoid sharing your email address with anybody. And there is support built into the sendmail software that forms the backbone of the email service here at CSE.

The way it works with sendmail is this: any mail sent to <user>+<suffix>@<domain> for any value of <suffix> is sent to <user>@<domain> with the suffix intact. That suffix can be used to control email filtering rules.

For example, I might use the username rose+rules instead of rose to register for the rebate on my new flash memory card because I have an irrational bias against marketting companies based in Miami. Later, when spam directed at rose+rules starts rolling in (months before the rebate check, of course), I can configure my email client to discard such email. It's safe because I know that I only gave that email address to marketting companies based in Miami.

Does the sendmail scheme work? Reports are that it's effective today, but it might not be effective in the future as spammers figure out that stripping everything after a + sign sells more penis pills.

Local Support for Spam Avoidance

There is no good reason to expose email addresses in the clear on unrestricted portions of the CSE web. In particular, it's unkind to expose the email addresses of other users, regardless you choose to throw caution to the wind with your own address.

Below is information on two tools that we offer locally to help with meeting that goal.

An Email-sending CGI

The standard CSE "orthodox" web design offers a comments-to field in the footer. Formerly, that comments-to field offered a link to an email-sending CGI script that required a fully-qualified email address, but the newest version— http://www.cs.washington.edu/htbin-post/unrestricted/mailto2.pl— wants only a username (or the name of a mailing list).

mailto2.pl only supports sending email to cs.washington.edu addresses.

To call the script, provide a to parameter to specify the recipient and a sub parameter to specify the subject. For example, to send mail to "support" with a subject of "fix my stuff", you could use a link that looks like this:

  <a href="/htbin-post/unrestricted/mailto2.pl?to=support&sub=fix+my+stuff">mail support</a>

You can update legacy web content to use this script by using the content tool— just open the original file with the tool, click through the screens to regenerate it, then copy the new version over the original.

An Email-to-GIF Tool

If you wish to create a simple one-line graphic representation of an email address (or any other text), you can use http://www.cs.washington.edu/htbin-post/unrestricted/Email2GIF.cgi. There is support for a few fonts, a few colors, and a few background colors (including transparent).

Disposable Email Addresses

As mentioned above, the sendmail mail transport software we use here at CSE supports a particular flavor "disposable" email addresses: those in which a username is followed by a + and any suffix. That could be used to manage spam sent to a version of your email address that is allowed to leak from our web.

For example, imagine that I am teaching CSE 666 in Spring 2005. I could rashly expose an email address such as rose+cse666sp05 in a mailto: link on my web. After Spring 2005 is a memory but the herbal supplement offers are very much of the present, I can configure my email client to discard email to rose+cse666sp05 without loss.

The SpamBuster Team is standing by to hear your comments and suggestions.