CSE logo University of Washington Computer Science & Engineering
 Web Crawler Policy DRAFT (deprecated)
  CSE Home   About Us    Search    Contact Info 

This draft version deprecated Nov 22, 2006.
See published version.

If you plan to create a web crawler, whether for a research project, a class project, or as a personal educational project, you should tread carefully and act thoughtfully. Your crawler may have the potential for disrupting the services of another web site, causing extra effort to deal with the invasion of your crawler, or even creating a legal liability for the university. This document describes the UW CSE policy for running web crawlers, contains a few rules you must follow, and offers links to other resources that you may find helpful in constructing and operating your crawler.

An excellent set of guidelines was developed by Martin Koster at the dawn of the web; and they are still relevant today. The highlights in Koster's Guidelines for Robot Writers are:

Please take a few minutes to read Koster's guidelines in their entirety.


Minimum Requirements
for Operating a Web Crawler

at UW CSE

If you operate a web crawler on or from the UW CSE network, you must abide by all of the following rules. Failure to follow these rules is considered a misuse of your CSE computer account, and may result in loss of department computing privileges.

  1. Ensure that your crawler obeys robots.txt conventions.
  2. Ensure that your crawler does not engage in any of these specific gaffes.
  3. Provide contact info, so that crawled sites can contact somebody who can expeditiously deal with problems (e.g., a persistent crawler that is bothering their site). You must include a working, monitored email address in the "user agent" request header field. Alternatively, you could provide a URL in the header field, as long the resulting page will provide a good email address.
  4. Monitor the email address provided in the user-agent field, to ensure that appropriate action can be taken promptly.
  5. Notify the CS Lab Staff of your crawler. We need to know:
    1. When the crawl is or will be active.
    2. Which machines are participating.
    3. Who is responsible for each, including CSE account name(s) and after-hours phone number(s).
    4. How to shut down the activity if necessary.
  6. Follow the dictum that "the scanee is always right." For example, if they ask us to stop altogether or to otherwise mitigate, we need to do it expeditiously and graciously regardless their robots.txt forbids the scanning, and regardless their request lacks grace or is not correct in every technical detail. And we need to make sure that the information about who has expressed a wish to avoid being scanned is shared among groups, because the scannee doesn't give a whit if it was team A or team B that hosed their site.
  7. Post any "stop-scanning" requests to the crawlers@cs mailing list (see previous dictum). Also post any other mitigation requests made by site operators. And you must incorporate any such requests that are posted by others.

    [It would be nice if we could do something a little better than a Mailman archive as the repository for which sites must be avoided. Sort of a master robots.txt (or portion thereof) that crawler authors can update; and the mailing list would be automagically notified of all updates. Or even better... some sort of "include" directive that crawler authors could put into their robots.txt code, along with a way for any registered author to add to it. (For all I know, this idea is already part of the robots.txt stuff - I've never dug into it, so have no idea.)]

Other Resources


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to CS Lab Director]