If you plan to create a web crawler, whether for a research
project, a class project, or as a personal educational project, you
should tread carefully and act thoughtfully. Your crawler may have
the potential for disrupting the services of another web site, causing
extra effort to deal with the invasion of your crawler, or even
creating a legal liability for the university. This document
describes the UW CSE policy for running web crawlers, contains
a few rules you must follow, and offers links to other
resources that you may find helpful in constructing and operating your
crawler.
An excellent set of guidelines was developed by Martin Koster at
the dawn of the web; and they are still relevant today. The
highlights in Koster's Guidelines for
Robot Writers are:
- Be Accountable - if your actions cause problems, be
available to take prompt action in response;
- Test Locally - and expand the scope gradually before you
unleash your crawler on others;
- Don't hog resources
- Stay with it - "it's vital to know what your robot is
doing, and that it remains under control".
Please take a few minutes to read
Koster's
guidelines in their entirety.
Minimum Requirements
for Operating a Web Crawler
at UW CSE
If
you operate a web crawler on or from the UW CSE network, you must
abide by all of the following rules. Failure to follow these
rules is considered a misuse of your CSE computer account, and may
result in loss of department computing privileges.
- Ensure that your crawler obeys robots.txt
conventions.
- Ensure that your crawler does not engage in any of these specific gaffes.
- Provide contact info, so that crawled sites can contact
somebody who can expeditiously deal with problems (e.g., a
persistent crawler that is bothering their site). You must
include a working, monitored email address in the "user agent"
request header field. Alternatively, you could provide a URL in
the header field, as long the resulting page will provide a good
email address.
- Monitor the email address provided in the user-agent
field, to ensure that appropriate action can be taken
promptly.
- Notify the CS Lab Staff of your crawler. We need to know:
- When the crawl is or will be active.
- Which machines are participating.
- Who is responsible for each, including CSE account name(s)
and after-hours phone number(s).
- How to shut down the activity if necessary.
- Follow the dictum that "the scanee is always
right." For example, if they ask us to stop altogether or to
otherwise mitigate, we need to do it expeditiously and graciously
regardless their robots.txt forbids the scanning, and regardless
their request lacks grace or is not correct in every technical
detail. And we need to make sure that the information about who
has expressed a wish to avoid being scanned is shared among
groups, because the scannee doesn't give a whit if it was team A
or team B that hosed their site.
- Post any "stop-scanning" requests to the
crawlers@cs mailing list (see previous dictum). Also
post any other mitigation requests made by site operators. And
you must incorporate any such requests that are posted by
others.
[It would be nice if we
could do something a little better than a Mailman archive as the
repository for which sites must be avoided. Sort of a master
robots.txt (or portion thereof) that crawler authors can update;
and the mailing list would be automagically notified of all
updates. Or even better... some sort of "include" directive that
crawler authors could put into their robots.txt code, along with
a way for any registered author to add to it. (For all I know,
this idea is already part of the robots.txt stuff - I've never
dug into it, so have no idea.)]
|
Other Resources