folddata File Reference

Detailed Description

Randomly splits a data set into a collection of train/test pairs.

The folddata tool

This tool is used by the xvalidate tool; you may find it useful. For example, your algorithm might perform very poorly on a specific cross validation run, you could use this tool to reproduce the datasets and try to track down the problem.

Folddata splits a dataset into a number of testing and training sets as needed for doing cross-validation. It takes each example in the original dataset and randomly assigns it to one of the 'folds' is it creating (note that this randomness means that the folds won't be exactly evenly sized). Folddata then outputs one dataset for each fold with the examples from the fold as the test set and the examples in all the other folds as the training set.

Folddata works efficiently on large datasets, but will require enough disk space to hold 'folds' copies of the dataset.

Folddata takes input and does output in c4.5 format. It expects to find the files <stem>.names and <stem>.data and outputs <stem>[0 - n].[names, data, test].

Arguments

-f <filestem>
- Set the stem name (default DF)
-target <dir>
- Set the output directory (default '.')
-source <dir>
- Set the directory that contains the dataset (default '.')
-folds <n>
- Sets the number of train/test sets to create (default 10)
-seed <n>
- Sets the random seed, multiple runs with the same seed will produce the same datasets (defaults to a random seed)
-h
- Display usage information and exit
-v
- Can be used multiple times to increase the debugging output

Example

folddata -f test -target output -folds 15 -seed 10

Will create 15 folds from test.names and test.data and put them in the directory named output as test[0-14].names, test[0-14].data, and test[0-14].test. It will use a seeded random generator, so that the exact same dataset could be reproduced.

Generated for VFML by

hosted by