If you plan to create a web crawler, whether for a research
project, a class project, or as a personal educational project, you
should tread carefully and act thoughtfully. Your crawler may have
the potential for disrupting the services of another web site, causing
extra effort to deal with the invasion of your crawler, or even
creating a legal liability for the university. This document
describes the UW CSE policy for running web crawlers, contains
a few rules you must follow, and offers links to other
resources that you may find helpful in constructing and operating your
crawler.
An excellent set of guidelines was developed by Martin Koster at
the dawn of the web; and they are still relevant today. The
highlights in Koster's Guidelines for
Robot Writers are:
- Be Accountable - if your actions cause problems, be
available to take prompt action in response;
- Test Locally - and expand the scope gradually before you
unleash your crawler on others;
- Don't hog resources
- Stay with it - "it's vital to know what your robot is
doing, and that it remains under control".
Please take a few minutes to read
Koster's
guidelines in their entirety.
Minimum Requirements
for Operating a Web Crawler
at UW CSE
If
you operate a web crawler on or from the UW CSE network, you must
abide by all of the following rules. Failure to follow these
rules is considered a misuse of your CSE computer account, and may
result in loss of department computing privileges.
- Ensure that your crawler obeys robots.txt
conventions.
- Ensure that your crawler does not engage in
any of these specific gaffes.
- Line up a CSE faculty sponsor for your crawl project, and
let us know who that is.
- Provide contact info, so that crawled sites can contact
somebody who can expeditiously deal with problems (e.g., a
persistent crawler that is bothering their site). You must
include a working, monitored email address in the "user agent"
request header field. Alternatively, you could provide a URL in
the header field, as long the resulting page will provide a good
email address.
- Monitor the email address provided in the user-agent
field, to ensure that appropriate action can be taken
promptly.
- Operate a webserver on all hosts involved in the scan as
an alternate information forum. Provide a "top-level" URL on
that server, with a description of your project including
sponsors and contact information. This allows someone who
determines the source of robotic scan of their site, to simply
check an obvious place for more information (e.g.,
http://SCANHOST.cs.washington.edu).
- Notify the CS Lab Staff of your crawler. Do this by
registering your scan in the
CSE Crawler Activity Wiki.
We need to know:
- When the crawl is or will be active.
- Which machines are participating.
- Who is responsible for each, including CSE account name(s)
and after-hours phone number(s).
- How to shut down the activity if necessary.
- Plus other information included in the
Crawler Activity Template.
- Follow the dictum that "the scanee is always
right." For example, if they ask us to stop altogether or to
otherwise mitigate, we need to do it expeditiously and graciously
regardless their robots.txt forbids the scanning, and regardless
their request lacks grace or is not correct in every technical
detail. And we need to make sure that the information about who
has expressed a wish to avoid being scanned is shared among
groups, because the scannee doesn't give a whit if it was team A
or team B that hosed their site.
- Post any "stop-scanning" requests to the
Do Not Scan list
(see previous dictum). Also post any other mitigation requests
made by site operators.
- Honor all Do Not Scan requests and other mitigation
requests that have been requested or posted by others (see
previous two items).
|
Other Resources