Comparing Learners

After implementing a new learner or gathering a data set you will probably start to explore their properties. One common way that an algorithm's performance on data sets is measured is by performing cross-validation - randomly splitting the data set into N collections, performing N runs training on N - 1 of the collections and testing on the remaining one, and averaging the performance over all the runs.

VFML provides three tools to help do this. The highest-level one is batchtest--with one command you can compare a collection of learners on a collection of datasets.  Next is xvalidate which evaluates a single learner on a single dataset. Finally folddata can be used to create the training/testing sets that were used by the higher level tools.  All three tools can use seeded random number generators to exactly reproduce every experiment they perform.

This collection of tools was designed to be convenient while still allowing low level access for debugging. For example, you might be running batchtest to compare a large number of learners and datasets. You notice that the learner you are working on has unusually bad performance on a specific dataset, so you use folddata and the seed from batchtest to recreate the exact dataset that was giving you trouble and do whatever debugging you need. Then you can use xvalidate to run the updated learner on the dataset to make sure you've corrected the problem before you need to spend the time to re-run the complete batchtest.

We have converted a collection of real world datasets for use with VFML.

Using batchtest

Example for: using batchtest to perform cross-validation

Requires: the <VFML-root>/bin directory be in your path. 

This is an example that demonstrates the execution of the batchtest tool for cross-validation.  In order for it to work, you'll need to make sure the <VFML-root>/bin directory is in your path.

Change to the <VFML-root>/examples/using-batchtest/ directory.   This directory contains a fake dataset and a number of input files for batchtest.   Use your favorite text editor to open the learners file:

mostcommonclass -u -f
naivebayes -u -f

#c50wrapper -f
#c45wrapper -args "-g" -f
#c45wrapper -f
#vfdt -u -batch -prune -f

This file is set up to run the mostcommonclass learner and naivebayes (the lines that begin with # are comments but show how you could use some of the other VFML learners with batchtest).  When you develop a learner you can use this file as a starting point as you might want to compare with these learners to get a sense of how well you are doing.

The -u flag tells mostcommonclass to test itself on data in <filestem>.test and output the error-rate and size (which is always 0 for mostcommonclass). The -f flag tells mostcommonclass that the next argument it sees will be the filestem it should use for the run. Batchtest will call the learners by appending the filestem to the end of the lines in the learners file, so the executed command lines will be something like:

mostcommonclass -u -f <filestem>

Now look at the fake-dataset file:

banana :: banana

This line describes a dataset.  The section of the line before the '::' is the path to the directory that contains the dataset. The section after the '::' is the file stem for the dataset.

Now run:

batchtest -data fake-datasets -learn learners -folds 3

to do 3-fold cross-validation of two mostcommonclass learners on the banana dataset.   You will see output something like this:

Running with seed: 5315
'mostcommonclass -u -f' :
24.524 (2.030) 0.000 (0.000) 0.003 (0.003) 0.000 (0.000)
'naivebayes -u -f' :
24.524 (2.030) 0.000 (0.000) 0.003 (0.003) 0.003 (0.003)

The first line of the output tells you that batchtest used the seed 5315 to create the datasets for cross-validation, if you later run the command:

batchtest -data fake-datasets -learn learners -folds 3 -seed 5315

you would reproduce the exact same datasets.  The next part of the output says that, on average, on the banana dataset, the mostcommonclass learner has a 24.524% error rate with a standard deviation of 2.03%, produced a model of size 0, ran in about 0.003 seconds of user time and 0.000 seconds of system time.  Naivebayes's performance is similar on this dataset, but you can expect niavebayes to do substantially better than mostcommonclass on almost any real-world dataset.

The <VFML-root>/examples/using-batchtest/directory contains three other files which you might find very useful: uci-all, uci-allknown and uci-discrete.  These files contain descriptions of the uci-datasets which we distribute with VFML.  If you've downloaded those datasets, you can use these files to easily test your learners on those datasets.  The uci-discrete file lists every dataset that contains only discrete attributes, the uci-all datasets lists every single dataset in our distribution, the uci-allknown dataset lists every dataset that has no unknown attributes.  To test your learner on the largest possible collection of datasets update the learners file and run:

batchtest -data uci-all -learn learners

In this directory you will also find a file called sig.awk.  It is a very simple script that will summarize the performance of two learners; execute it by running:

batchtest -data uci-discrete -learn learners > test.out
awk -f sig.awk test.out

The output will be something like:

second won big on audiology:
second won big on breast-cancer-wisconsin:
second won big on car:
second won big on house:
second won big on monks:
second won big on mushroom:
second won big on nursery:
second won big on promoter:
second won big on splice-jxn:
second won big on tic-tac-toe:
second won big on voting:
second won big on zoo:
First won 0 -- 0 by the sum of the stdevs
Second won 12 -- 12 by the sum of the stdevs

This means that the second learn in the learners file, naivebayes in this case, won on all 12 datasets, and that it won by more than the sum of the standard deviations on all of the datasets.