Main Page | Modules | Data Structures | File List | Globals | Related Pages

vfdt File Reference

Detailed Description

Learns a decision tree from a high-speed data stream or very large data set.

vfdt is described in this paper. (This version of VFDT has many extensions since that paper was written, including the ability to learn from domains with continuous attributes, we hope to have a more up-to-date paper to cite here soon).

VFDT learns a decision tree as follows. It starts with a single leaf and starts collecting training examples from a data stream (with the -stdin argument) or from the file When it has enough data to know, with high confidence (see the -sc parameter below) that it knows which attribute is the best to partition the data with, it turns the leaf into an internal node splitting on the selected attribute and starts learning at the new leaves recursively. VFDT is an incremental online algoriithm in that it has a model available at any time during its run and refines the model over time as it is presented with additional training data (see the vfdt-engine::h interface, for an API to the learning engine if you want to incrementially learn decision trees in your own programs). VFDT will cache training examples in RAM if it has enough memory available (see the -growMegs parameter below) or it will just use them to update the statistics in the leaf where they belong and then free them. VFDT will also disable growing at unpromising leaves (and free the associated sufficient statistics) to save additional RAM when needed.

VFDT will be most effective when learning from very large data sets, in the millions or billions, where there is plenty of data for it to make good statistical decisions. For smaller data sets you may consider the -batch parameter which makes it load all data into RAM and learn as a traditional decision tree learner. If you suspect that there is concept drift in the data set you may like to use the cvfdt tool instead.

For a more detailed example of vfdt in action see the mining data streams walkthrough.

vfdt takes input and does output in c4.5 format. It expects to find the files <stem>.names and <stem>.data.


Generated for VFML by doxygen hosted by Logo