If you plan to create a web crawler, whether for a research project, a class project, or as a personal educational project, you should tread carefully and act thoughtfully. Your crawler may have the potential for disrupting the services of another web site, causing extra effort to deal with the invasion of your crawler, or even creating a legal liability for the university. This document describes the UW CSE policy for running web crawlers, contains a few rules you must follow, and offers links to other resources that you may find helpful in constructing and operating your crawler.
An excellent set of guidelines was developed by Martin Koster at the dawn of the web; and they are still relevant today. The highlights in Koster's Guidelines for Robot Writers are:
- Be Accountable - If your actions cause problems, be available to take prompt action in response.
- Test Locally - ...and expand the scope gradually before you unleash your crawler on others.
- Don't hog resources.
- Stay with it - "It's vital to know what your robot is doing, and that it remains under control."
Minimum Requirements for Operating a Web Crawler at UW CSE
If you operate a web crawler on or from the UW CSE network, you must abide by all of the following rules. Failure to follow these rules is considered a misuse of your CSE computer account, and may result in loss of department computing privileges.
- Ensure that your crawler obeys robots.txt conventions.
- Ensure that your crawler does not engage in any of these specific gaffes:
- Failure to respect robots.txt
- This is a common one. Invariably, the crawler designed to process robots.txt, but wasn't properly tested. Victims of a crawl gone wrong can make a good case for collecting monetary damages if they can demonstrate that the crawler failed to respect their robots.txt. It's standard industry practice to respect it, and not to do so is easy to characterize as rogue behaviour.
- Unthrottled crawling
- Site administrators become vexed when a crawler constitutes a large percentage of the traffic on their site. The solution is to ensure that the rate of your your crawl is limited by your own conservative policy, not by thickness of your pipe, the capabilities of the crawler or the urgency of your paper deadline. In one recent instance, a poorly-coded (but fully deployed) web crawler ignorantly used the HTTP 1/1 "keepalive" facility on each crawler connection in an apparent attempt to consume as many resources at the crawled site as possible. CSE itself was once victimized by an unthrottled crawler that found a loop in our document tree, a loop that it blythely followed deeper and deeper to no good purpose. Unfortunately, the cost of each request to our server turns out to be quite strong a function of the depth. It was necessary and effective to filter the requests at the IP level.
- Unrespected cease-and-desist orders
- CSE recently had an incident where a site administrator made it abundantly clear that CSE crawlers, having caused problems on one server, were not welcome at any of her organization's sites. She included a very simple and unambiguous definition of the network block where the organization's servers were to be found. This was energetically conveyed to the research team, who, while they abandoned the crawl of the first site they had hit, then proceeded to crawl other servers within that same network block. The researchers had been told, but they weren't listening. But we still managed to stay out of court.
- Missing or faulty contact information
- One CSE research team followed the lab's request to make their contact information readily available, then shortly thereafter disabled the email address because it attracted spam. Experience shows that the darkness of the mood of offended site administrators is a direct function of the time and effort required to contact a responsible party.
- Line up a CSE faculty sponsor for your crawl project, and let us know who that is.
- Provide contact info, so that crawled sites can contact somebody who can expeditiously deal with problems (e.g., a persistent crawler that is bothering their site).
- You must include a working, monitored email address in the "user agent" request header field. Alternatively, you could provide a URL in the header field, as long the resulting page will provide a good email address.
- Monitor the email address provided in the user-agent field, to ensure that appropriate action can be taken promptly.
- Operate a webserver on all hosts involved in the scan as an alternate information forum. Provide a "top-level" URL on that server, with a description of your project including sponsors and contact information. This allows someone who determines the source of robotic scan of their site, to simply check an obvious place for more information (e.g., http://SCANHOST.cs.washington.edu).
- Notify the CS Lab Staff of your crawler. Do this by registering your scan in the CSE Crawler Activity Wiki. We need to know:
- When the crawl is or will be active.
- Which machines are participating.
- Who is responsible for each, including CSE account name(s) and after-hours phone number(s).
- How to shut down the activity if necessary.
- Plus other information included in the Crawler Activity Template.
- Follow the dictum that "the scanee is always right." For example, if they ask us to stop altogether or to otherwise mitigate, we need to do it expeditiously and graciously regardless their robots.txt forbids the scanning, and regardless their request lacks grace or is not correct in every technical detail. And we need to make sure that the information about who has expressed a wish to avoid being scanned is shared among groups, because the scannee doesn't give a whit if it was team A or team B that hosed their site.
- Post any "stop-scanning" requests to the Do Not Scan list (see previous dictum). Also post any other mitigation requests made by site operators.
- Honor all Do Not Scan requests and other mitigation requests that have been requested or posted by others (see previous two items).