xvalidate File Reference

Detailed Description

Performs cross validation of a learner on a data set.

You will probably want to use the batchtest tool for large experiments; xvalidate will help you to quickly test things, perhaps as a debugging aid.

You can use xvalidate with large datasets, but you will need enough disk space to hold 'folds' copies of the data. The learner you use with xvalidate must also be able to work with large datasets.

Xvalidate takes input in C4.5 format and uses folddata, which must be in your path. You use the -c option to tell xvalidate how to run the learner. Xvalidate will append the names of the folds of the datasets to the end of the -c string, the learner must accept the name and read input appropriately.

Xvalidate expects the learner to output results in the following format:

error-rate size

The learner's error rate on the test set, followed by some whitespace, followed by the size of the learned model (in whatever unit you want), followed by a newline.

Xvalidate will collect the output of the runs of the learner, average them, and report:

mean-error-rate (standard deviation of error rate) mean-size (standard deviation of size) average-utime (standard deviation of utime) average-stime (standard deviation of stime)

for example:

26.111 (5.500) 0.000 (0.000) 0.013 (0.005) 0.010 (0.008)

The times are very accurate on UNIX. Under CYGNUS (windows) utime will be slightly overestimated and stime will be zero.

Arguments

-f <filestem>
- Set the name of the dataset (default DF)
-source <dir>
- Set the source data directory (default '.')
-c <command>
- Set the learner command. The name of the dataset will be appended to the end of this string and used to invoke the learner (This is a required argument)
-folds <n>
- Sets the number of train/test sets to create (default 10)
-seed <n>
- Sets the random seed, multiple runs with the same seed will produce the same datasets (defaults to a random seed). If you use a random seed, the value of the randomly selected seed will be printed at the start of the run. You can later use that seed to repeat the experiment. You can also pass the same seed to folddata to recreate the exact test/training sets for closer inspection.
-v
- Can be used multiple times to increase the debugging output

Example

xvalidate -source datasets/mushroom -f mushroom -folds 15 -seed 100 -c "mostcommonclass -u -f"

Does 15-fold cross-validation of the 'mostcommonclass' learner on the dataset called 'mushroom' in the 'datasets/mushroom' directory. The mostcommonclass learner will be invoked as: mostcommonclass -u -f <constructed-dataset-name> for each of the 15 constructed datasets. It will use a seeded random number generator so the exact experiment could be reproduced.

Generated for VFML by

hosted by