Using batchtest

Example for: using batchtest to perform cross-validation

Requires: the <VFML-root>/bin directory be in your path.

This is an example that demonstrates the execution of the batchtest tool for cross-validation. In order for it to work, you'll need to make sure the <VFML-root>/bin directory is in your path.

Change to the <VFML-root>/examples/using-batchtest/directory. This directory contains a fake dataset and a number of input files for batchtest. Use your favorite text editor to open the learners file:

mostcommonclass -u -f naivebayes -u -f #c50wrapper -f #c45wrapper -args "-g" -f #c45wrapper -f #vfdt -u -batch -prune -f

This file is set up to run the mostcommonclass learner and naivebayes (the lines that begin with # are comments but show how you could use some of the other VFML learners with batchtest). When you develop a learner you can use this file as a starting point as you might want to compare with these learners to get a sense of how well you are doing.

The -u flag tells mostcommonclass to test itself on data in <filestem>.test and output the error-rate and size (which is always 0 for mostcommonclass). The -f flag tells mostcommonclass that the next argument it sees will be the filestem it should use for the run. Batchtest will call the learners by appending the filestem to the end of the lines in the learners file, so the executed command lines will be something like:

mostcommonclass -u -f <filestem>

Now look at the fake-dataset file:

banana :: banana

This line describes a dataset. The section of the line before the '::' is the path to the directory that contains the dataset. The section after the '::' is the file stem for the dataset.

Now run:

batchtest -data fake-datasets -learn learners -folds 3

to do 3-fold cross-validation of two mostcommonclass learners on the banana dataset. You will see output something like this:

Running with seed: 5315 'mostcommonclass -u -f' : 24.524 (2.030) 0.000 (0.000) 0.003 (0.003) 0.000 (0.000) 'naivebayes -u -f' : 24.524 (2.030) 0.000 (0.000) 0.003 (0.003) 0.003 (0.003)

The first line of the output tells you that batchtest used the seed 5315 to create the datasets for cross-validation, if you later run the command:

batchtest -data fake-datasets -learn learners -folds 3 -seed 5315

you would reproduce the exact same datasets. The next part of the output says that, on average, on the banana dataset, the mostcommonclass learner has a 24.524% error rate with a standard deviation of 2.03%, produced a model of size 0, ran in about 0.003 seconds of user time and 0.000 seconds of system time. Naivebayes's performance is similar on this dataset, but you can expect niavebayes to do substatially better than mostcommonclass on almost any real-world dataset.

The <VFML-root>/examples/using-batchtest/directory contains three other files which you might find very useful: uci-all, uci-allknown and uci-discrete. These files contain descriptions of the uci-datasets which we distribute with VFML. If you've downloaded those datasets, you can use these files to easily test your learners on those datasets. The uci-discrete file lists every dataset that contains only discrete attributes, the uci-all datasets lists every single dataset in our distribution, the uci-allknown dataset lists every dataset that has no unknown attributes. To test your learner on the largest possible collection of datasets update the learners file and run:

batchtest -data uci-all -learn learners

In this directory you will also find a file called sig.awk. It is a very simple script that will summarize the performance of two learners; execute it by running:

batchtest -data uci-discrete -learn learners > test.out awk -f sig.awk test.out

The output will be something like:

second won big on audiology: second won big on breast-cancer-wisconsin: second won big on car: second won big on house: second won big on monks: second won big on mushroom: second won big on nursery: second won big on promoter: second won big on splice-jxn: second won big on tic-tac-toe: second won big on voting: second won big on zoo: First won 0 -- 0 by the sum of the stdevs Second won 12 -- 12 by the sum of the stdevs

This means that the second learn in the learners file, naivebayes in this case, won on all 12 datasets, and that it won by more than the sum of the standard deviations on all of the datasets.