treedata File Reference

Detailed Description

Creates a synthetic data set by sampling from a randomly generated DecisionTree.

This program creates a synthetic binary tree and then uses it to label data which can then be used to evaluate learning algorithms. It has been used to evaluate the vfdt system.

The synthetic tree is generated starting from a single node as follows. A leaf is selected and is either split on one of the active attributes (each discrete attribute is used at most once on a path from the root to a leaf) or it is pruned. The probability of pruning is set by the prunePercent parameter but is 0 if the leaf is not below firstPruneLevel and is 1 if the leaf is below maxPruneLevel. If the attribute selected for a split is continuous a threshold is generated uniformly in the range 0-1 except that Tree Data ensures that the chosen threshold is not redundant with an earlier split.

Once the tree's structure is created (all leaves have been pruned) a class label is randomly assigned to each leaf and then redundant subtrees (where every leaf has the same classification) are pruned.

Data (training, testing, and pruning) is generated by creating an example and setting its attributes with uniform probability. The tree is used to label the data, and then the class and discrete lables are resampled with uniform probability (without replacement) as specified by the noise parameter and continuous attributes are changed by sampling from a gaussian with mean of their current value and standard deviation of noise.

Using the same conceptSeed (along with the same other parameter) results in the same concept being created. The same seed results in data being generated the same (so experiments are easily repeatable).

This program also outputs some additional statistics into the stem.stats file.

Arguments

-f 'stem name'
- (default DF)
-discrete 'number of discrete attributes'
- (default 100)
-continuous 'number of continous attributes'
- (default 10)
-classes 'number of classes'
- (default 2)
-train 'size of training set'
- (default 50000)
-test 'size of testing set'
- (default 50000)
-prune 'size of prune set'
- (default 50000)
-stdout
- Output the trainset to stdout (default to 'stem'.data)
-noise 'percentage noise (as float (eg: 10.2 is 10.2%)'
- (default 0)
-prunePercent '%of nodes to prune at each level'
- (default is 25, that's 25%)
-firstPruneLevel 'don't prune nodes before this level'
- (default is 3)
-maxPruneLevel 'prune every node after this level'
- (default is 18)
-conceptSeed 'the multiplier for the concept seed'
- (default 100)
-seed 'random seed'
- (default to random)
-v
- Increase the message level
-h
- Run with this argument to get a list of arguments and their meanings.

Example

treedata -discrete 10 -continuous 0 -noise 15 -conceptSeed 21 -seed 1234 -prunePercent 15 -train 100 -test 100 -prune 100

Creates 100 training, 100 testing, and 100 pruning examples from a concept tree made with 15% chance of pruning each node past level 3 and 100% chance of pruning past level 18. 15% noise is added to the data. Finally, the same data set will be produced by multiple calls to the function because of the seed arguments.

Generated for VFML by

hosted by