Steam-powered Turing Machine University of Washington Department of Computer Science & Engineering
 CSE454 Project Description
  CSE Home   About Us    Search    Contact Info 

Administrivia
 Home
 Using course email
 Email archive
 Policies
Content
 Overview
 Resources
 Lecture slides
Assignments
 Reading
 Project
   

Using CRF++

CRF++ is a simple, flexible, open-source implementation of Conditional Random Fields (CRFs) that you might want to use in your projects. If all you need is a CRF for extractions, this may be a better choice than some of the larger multi-purpose machine learning libraries (such as Mallet - which may be better if you need more advanced tools).



Downloading CRF++


You can get the software from http://crfpp.sourceforge.net/ either as a binary for Windows machines, or the source code. You can compile the source code on any machine. It took me no time at all to get it up and running on my macbook.

The website has usage information and explains how to set up your training, test and model files. It is pretty straightforward. Most of the information on this page is copied in some part from that. Hopefully between this and the CRF++ website, you will be alright!




Compiling CRF++


On your own machine

To configure and compile the c++ source code be sure to have a c++ compiler on your machine and type:

./configure

After that runs, you can make the source code by running:

make

To install the program, you need to switch to the root user of the machine. On my mac I can do this by running:

sudo make install

and entering my password. The CRF++ website says to run:

su
make install I'm not sure what will be best for your machine.

On cubist (or another UW machine)

Since you don't have root access on this machine, you won't be able to run su/sudo.

To fix this, edit the configure command to be:

./configure --prefix=/your_project_directory/

Then, you can run make install without needing to run su/sudo.



Configuring your training data file and template file


Your training data file should consist of sentences (or sequences of tokens) that you want to label. Each token/word in a sentence gets its own line and there is a blank line between sentences:

Test
sentence
one
.

Test
sentence
two
.

Each token line also holds all the features of that token. This could be the token's part of speech, the type of entity it is, the first two characters of the word, etc. Features are separates by spaces. Also,the true label for the token is written at the end of the line.

Test adjective T WORD
sentence noun s WORD
one noun o NUMBER
. . . PUNCT

It is important to note that you must have the same number of features per line and the features must be written in the same order on each line. For example in the above training example all the token lines would have the same format:

Word Part_of_speech First_character LABEL

Notice also, that the features on each line are specific to only that token (I will explain how to add window features and more complex features in a moment). If you want to have global features you will need to add them to each line. So if a sentence can come from a type A page or a type B page, you would label each token with that feature:

Test adjective T TYPEB WORD
sentence noun s TYPEB WORD
one noun o TYPEB NUMBER
. . . TYPEB PUNCT

Test adjective T TYPEA WORD
sentence noun s TYPEA WORD
two noun t TYPEA NUMBER
. . . TYPEA PUNCT


Now that you know how your training data file will look, you can learn how to specify what features exactly you want in your CRF model.


The template file

(Note: this is explained MUCH better on the CRF++ website. I am just giving an overview. I suggest you go to the CRF++ site to read up on the template file syntax. It isn't too terrible.)

The template file sets up which features to use during a run of the CRF. Looking at the training file as a table, if you are currently at the line for the word "one" in the previous example, x%[0,0] represents the current word. Likewise, x%[0,1] is the current part of speech, and x%[0,2] is the current word's first character. Then, x%[-1,0] would be the previous word, x%[-1, 1] would be the previous word's part of speech, x%[1,0] would be the next word and so on. Basically if you have a template file that says:

U0:%x[0,0]
U1:%x[0,1]
U2:%x[0,2]
U3:%x[-1,0]
U4:%x[1,0]
then at any token that the CRF examines it will take into account that word, the word's POS, the word's first character and the previous and next words when making a decision about what label to give the token. The "U"s are the label names. They need to have "U" as the first character for unigram features. The numbers are arbitrary. You can create unigram features that are combinations of features such as:

U5:%x[0,0]/%x[0,1]

which creates features that are word/pos such as "two/noun".

To automatically create bigram features you can add a line that says B to the template file. This will create bigram features between the Ux of the last token and the Ux of the current token.



Training and Testing


To train a CRF on a training data set using a specific template file just run:

crf_learn template_file train_file model_file

This will output a CRF model into the model_file file. Then, to run on test data, you can run:

crf_test -m model_file test_files ...

crf_test prints out each token in the test file with its features (and true label if given) and the label given to the token by the crf model.

A test file looks just like a training file. However, you can omit the labels on the ends of the lines. If the labels stay, you can learn the precision and recall of your model. However, since this is testing data and could be real data from websites, you may not have these labels which is fine.

You can set specific error rates and iterations for learning and ways to format the output of testing. The options for crf_learn and crf_test are documented on the CRF++ website.



Examples


The CRF++ download has a few examples for you to play with. I also wrote a toy example for information extraction assuming the line features:

word part_of_speech entity_type page_type(assuming A or B) LABEL You can look at my files here:


Information Extraction


CSE logo Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX