Chemistry Lab University of Washington Computer Science & Engineering
 Web Crawler Gaffes
  CSE Home     Web Crawler Policy  About Us    Search    Contact Info 

Here is a brief list of web-crawling gaffes to avoid, a list drawn from actual occurances here at UW CSE.

Failure to respect robots.txt
This is a common one. Invariably, the crawler designed to process robots.txt, but wasn't properly tested.
Victims of a crawl gone wrong can make a good case for collecting monetary damages if they can demonstrate that the crawler failed to respect their robots.txt. It's standard industry practice to respect it, and not to do so is easy to characterize as rogue behaviour.
Unthrottled crawling
Site administrators become vexed when a crawler constitutes a large percentage of the traffic on their site. The solution is to ensure that the rate of your your crawl is limited by your own conservative policy, not by thickness of your pipe, the capabilities of the crawler or the urgency of your paper deadline.
In one recent instance, a poorly-coded (but fully deployed) web crawler ignorantly used the HTTP 1/1 "keepalive" facility on each crawler connection in an apparent attempt to consume as many resources at the crawled site as possible.
CSE itself was once victimized by an unthrottled crawler that found a loop in our document tree, a loop that it blythely followed deeper and deeper to no good purpose. Unfortunately, the cost of each request to our server turns out to be quite strong a function of the depth. It was necessary and effective to filter the requests at the IP level.
Unrespected cease-and-desist orders
CSE recently had an incident where a site administrator made it abundantly clear that CSE crawlers, having caused problems on one server, were not welcome at any of her organization's sites. She included a very simple and unambiguous definition of the network block where the organization's servers were to be found. This was energetically conveyed to the research team, who, while they abandoned the crawl of the first site they had hit, then proceeded to crawl other servers within that same network block. The researchers had been told, but they weren't listening. But we still managed to stay out of court.
Missing or faulty contact information
One CSE research team followed the lab's request to make their contact information readily available, then shortly thereafter disabled the email address because it attracted spam. Experience shows that the darkness of the mood of offended site administrators is a direct function of the time and effort required to contact a responsible party.
To be effective and consistent with minimum standards, contact information must be both valid and actively monitored— otherwise, it is worse than useless.

Last updated: 8-apr-05


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to CS Lab Director]