CSE logo University of Washington Computer Science & Engineering
 Web Crawler Policy
  CSE Home   About Us    Search    Contact Info 

If you plan to create a web crawler, whether for a research project, a class project, or as a personal educational project, you should tread carefully and act thoughtfully. Your crawler may have the potential for disrupting the services of another web site, causing extra effort to deal with the invasion of your crawler, or even creating a legal liability for the university. This document describes the UW CSE policy for running web crawlers, contains a few rules you must follow, and offers links to other resources that you may find helpful in constructing and operating your crawler.

An excellent set of guidelines was developed by Martin Koster at the dawn of the web; and they are still relevant today. The highlights in Koster's Guidelines for Robot Writers are:

Please take a few minutes to read Koster's guidelines in their entirety.


Minimum Requirements
for Operating a Web Crawler

at UW CSE

If you operate a web crawler on or from the UW CSE network, you must abide by all of the following rules. Failure to follow these rules is considered a misuse of your CSE computer account, and may result in loss of department computing privileges.

  1. Ensure that your crawler obeys robots.txt conventions.
  2. Ensure that your crawler does not engage in any of these specific gaffes.
  3. Line up a CSE faculty sponsor for your crawl project, and let us know who that is.
  4. Provide contact info, so that crawled sites can contact somebody who can expeditiously deal with problems (e.g., a persistent crawler that is bothering their site). You must include a working, monitored email address in the "user agent" request header field. Alternatively, you could provide a URL in the header field, as long the resulting page will provide a good email address.
  5. Monitor the email address provided in the user-agent field, to ensure that appropriate action can be taken promptly.
  6. Operate a webserver on all hosts involved in the scan as an alternate information forum. Provide a "top-level" URL on that server, with a description of your project including sponsors and contact information. This allows someone who determines the source of robotic scan of their site, to simply check an obvious place for more information (e.g., http://SCANHOST.cs.washington.edu).
  7. Notify the CS Lab Staff of your crawler. Do this by registering your scan in the CSE Crawler Activity Wiki. We need to know:
    1. When the crawl is or will be active.
    2. Which machines are participating.
    3. Who is responsible for each, including CSE account name(s) and after-hours phone number(s).
    4. How to shut down the activity if necessary.
    5. Plus other information included in the Crawler Activity Template.
  8. Follow the dictum that "the scanee is always right." For example, if they ask us to stop altogether or to otherwise mitigate, we need to do it expeditiously and graciously regardless their robots.txt forbids the scanning, and regardless their request lacks grace or is not correct in every technical detail. And we need to make sure that the information about who has expressed a wish to avoid being scanned is shared among groups, because the scannee doesn't give a whit if it was team A or team B that hosed their site.
  9. Post any "stop-scanning" requests to the Do Not Scan list (see previous dictum). Also post any other mitigation requests made by site operators.
  10. Honor all Do Not Scan requests and other mitigation requests that have been requested or posted by others (see previous two items).

Other Resources


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to CS Lab Director]