The following materials are very closely related and may be useful.
Crawlers: If your project requires you to build a crawler, please
be sure to know
your responsibilities. Also, please read the Mercator paper. There are three recommended
strategies for incorporating crawling capabilities into your project,
listed below in increasing order of sophistication.
Wget is a free utility for downloading files from the Web. It's pretty
basic, but could be the perfect choice if you need primarily to fetch files
from one or a small number of sites.
- Heritix is an open-source,
extensible, Web-scale crawler distributed by the Internet Archive. It's not
quite as flexible as Nutch but even easier to get running. Particularly
convenient is a Web-based
UI, which lets you create and configure craweling jobs. Pretty snazzy!
- Nutch is a
full featured, industrial strength, open source Web search software
package. If all you need is a crawler, you can throw away the Lucene
information retrieval part (which does TF/IDF and other types of ranking of
documents based on an optimized, inverted-index data store).
You get complete control, but through easy programming. It's
really not that bad, but the others might be easier if you have limited
- Machine Learning & IE Packages:
- Weka is a
well-developed and simple-to-use machine learning package which is quite
popular. A book provides excellent documentation, but there is stuff
online as well.
- Mallet is aimed at
statistical natural language processing, but has quite a bit of machine
learning code built in. Specifically, it provides learning and decoding
functions for conditional random fields (CRFs) which are similar to, but
better than HMMs. Documentation on this functionality ("sequence tagging")
You may also want to read a Guide for Using Mallet written by Fei Xia.
- CRF++ is "a simple,
customizable, and open source implementation of Conditional Random Fields
(CRFs) for segmenting/labeling sequential data. CRF++ is designed for
generic purpose and will be applied to a variety of NLP tasks, such as
Named Entity Recognition, Information Extraction and Text Chunking."
Chloe has provided a mini-tutorial to supplement the information on the CRF++ page here.
- Reuters' webservice for named entity identification for about 20 classes of
- Natural Language Processing Libraries and Tools
- Stanford maintains a great list of downloadable tools for
statistical NLP tasks, written in different languages.
- LingPipe is a suite of Java
libraries for the linguistic analysis of human language, including
information extraction and data mining tools. E.g., track mentions of
entities (e.g. people or proteins); link entity mentions to database
entries; uncover relations between entities and actions; classify text
passages by language, character encoding, genre, topic, or sentiment;
correct spelling with respect to a text collection; cluster documents by
implicit topic and discover significant trends over time; and provide
part-of-speech tagging and phrase chunking. A friend has used it an likes
it, but I haven't played with it personally.
- OpenNLP contains a library of Java code for all sorts of NLP-tasks such as sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference resolution. Good documentation of the libraries is given on the site.
- If you just need a part-of-speech tagger you can check out Stanford's tagger. This is also in Java. The site links to a tutorial using the tagger on XML data. Stanford also provides a parser, a named entity recognizer, and a classifier that are all available separately from the Stanford NLP Group's software page.
- GATE, General Architecture for text
Engineering is a NLP toolkit, which includes support for information
extraction and uses a Weka interface. It appears robust and well-used, but
I have no direct experience with it.
- Evaluating Your Projects: There are three main ways you can
evaluate your system: two pertain to the system as a whole (including the
UI) and the other looks at the performance of one or more submodules. In
all cases try to present your material graphically (instead of a big
table). When creating such a graph, beware of Microsoft Office default
templates which include gratuitous chart junk. Instead
- Informal User Study of your System.
This is the most important type of user study and the one that is most
appropriate for people in this class. The basic idea is to watch a small
number of people using your system in order to understand what they are
trying to do, how well it works for them, what confuses them and what could
be improved. It is usually followed by improvements to the UI and perhaps
another evaluation in a process of iterative design improvement. One
reports the user's comments and your subsequent design changes. An
excellent thing to read before doing such a study is: Some
techniques for Observing Users by Kathleen Gomoll. An example of a good
paper which uses this technique is Summarizing
Personal Web Browsing Sessions by Mira Dontcheva et al., UIST
- Formal User Study of your System.
Once you have a polished UI design, it is common to do a more detailed study, with a
larger group of subjects, looking for statistically significant results.
It is unlikely that any 454 groups will have time to do this, but here is
an example of one paper which (in my biased view) does such a study nicely:
Improving the Performance of Motor-Impaired Users with
Automatically-Generated, Ability-Based Interfaces, by
Gajos, K. and Wobbrock, J. and Weld, D., CHI 2008.
- Module performance study.
Most (if not all) groups should include at least one experiment of this
form. Fortunately, with advance planning, these don't take very
long. Indeed, you did something of this form with HW1 and your evaluation
of the naive Byaes classifier. The
trick is to plan what you will measure before you write your code.
Pick a performance measure that is relevant to the system you have built:
precision? recall? speed? accuracy? throughput? latency? In the simplest
case, just measure this aspect of your system. Ideally, however, you will
measure two versions of your system and compare the two. For example,
classifier accuracy using a bag of words representation vs bag of words
augmented with part of speech tags. Or throughput with and without your
snazzy caching scheme. This is why it is important to plan such an
experiment before you have implemented the caching mechanism - so you can
easily turn it on and off. Here's one example of a paper which include
results of this form:
Extraction from WIkipedia: Moving Down the Long Tail by Fei Wu, Raphael
Hoffmann, and Daniel S. Weld, KDD 2008.
- Related Courses and Materials