Main Page | Modules | Data Structures | File List | Globals | Related Pages

folddata File Reference

Detailed Description

Randomly splits a data set into a collection of train/test pairs.

The folddata tool

This tool is used by the xvalidate tool; you may find it useful.   For example, your algorithm might perform very poorly on a specific cross validation run, you could use this tool to reproduce the datasets and try to track down the problem.

Folddata splits a dataset into a number of testing and training sets as needed for doing cross-validation.  It takes each example in the original dataset and randomly assigns it to one of the 'folds' is it creating (note that this randomness means that the folds won't be exactly evenly sized).  Folddata then outputs one dataset for each fold with the examples from the fold as the test set and the examples in all the other folds as the training set.

Folddata works efficiently on large datasets, but will require enough disk space to hold 'folds' copies of the dataset.

Folddata takes input and does output in c4.5 format.   It expects to find the files <stem>.names and <stem>.data and outputs <stem>[0 - n].[names, data, test].



folddata -f test -target output -folds 15 -seed 10

Will create 15 folds from test.names and and put them in the directory named output as test[0-14].names, test[0-14].data, and test[0-14].test.  It will use a seeded random generator, so that the exact same dataset could be reproduced.

Generated for VFML by doxygen hosted by Logo