This program creates a synthetic binary tree and then uses it to label data which can then be used to evaluate learning algorithms. It has been used to evaluate the vfdt system.
The synthetic tree is generated starting from a single node as follows. A leaf is selected and is either split on one of the active attributes (each discrete attribute is used at most once on a path from the root to a leaf) or it is pruned. The probability of pruning is set by the prunePercent parameter but is 0 if the leaf is not below firstPruneLevel and is 1 if the leaf is below maxPruneLevel. If the attribute selected for a split is continuous a threshold is generated uniformly in the range 0-1 except that Tree Data ensures that the chosen threshold is not redundant with an earlier split.
Once the tree's structure is created (all leaves have been pruned) a class label is randomly assigned to each leaf and then redundant subtrees (where every leaf has the same classification) are pruned.
Data (training, testing, and pruning) is generated by creating an example and setting its attributes with uniform probability. The tree is used to label the data, and then the class and discrete lables are resampled with uniform probability (without replacement) as specified by the noise parameter and continuous attributes are changed by sampling from a gaussian with mean of their current value and standard deviation of noise.
Using the same conceptSeed (along with the same other parameter) results in the same concept being created. The same seed results in data being generated the same (so experiments are easily repeatable).
This program also outputs some additional statistics into the stem.stats file.
treedata -discrete 10 -continuous 0 -noise 15 -conceptSeed 21 -seed 1234 -prunePercent 15 -train 100 -test 100 -prune 100
Creates 100 training, 100 testing, and 100 pruning examples from a concept tree made with 15% chance of pruning each node past level 3 and 100% chance of pruning past level 18. 15% noise is added to the data. Finally, the same data set will be produced by multiple calls to the function because of the seed arguments.