Mining Data Streams

This document describes the methods available for interfacing the learners in VFML with data streams. Some of the learners in VFML are not scalable and can not easily be used with data streams. To determine which are you should check the documentation of individual learners, but in general any learner that is prefaced with 'VF' is scalable.

The simplest, and most widely available, interface method is to pipe the training examples to the learner via standard in. The general procedure for this is as follows. You must write a program that connects to the data stream, does any aggregations needed, produces a .names file describing the data format, perhaps produces a testing set using the names format, and prints examples using the format to standard out indefinitely. Then pipe this output to one of the scalable learners with the -stdin argument. For example, if you output the .names and test file as stream.names and stream.test respectively:

MakeData | vfdt -f stream -stdin -u

would read (and learn from) examples from standard in until there were no more, test the accuracy of the resulting model on the data in stream.test, and report some results.

We will now demonstrate in more detail how you can do this with the treedata synthetic data generation program and the vfdt learner. Make sure you have vfml installed and the tools are in your path. You may want to change to a temporary directory. Now run the command:

treedata -f test -discrete 20 -continuous 0 -noise 5 -test 50000 -train 500000 -stdout | vfdt -f test -stdin -initialPause -schedule 1.44

Treedata generates a random decision tree and uses it to create a synthetic data set. The arguments we gave it here mean that the domain will have 20 discrete attributes and 0 continuous ones, that all examples will be corrupted by having 5% of their values (and classes) flipped, there there will be 50,000 examples reserved for testing and that a stream of 500,000 traiing examples will be written to standard out. Notice the '-f test' argument. This tells treedata to output a description of the domain in a file called 'test.names' and to put the testing data in a file called 'test.test'. The output of this command is piped into the vfdt learner, which will try to learn the tree generated by treedata by looking only at the training data. The '-f' arguement to vfdt tells it where to find the domain description and testing data. The '-stdin' flag tells it that it should read training examples from standard in. '-initalPause' makes vfdt sleep for a couple of seconds before trying to read 'test.names', incase treedata takes a little while writing it out. '-schedule 1.44' argument makes vfdt periodically test its learned model on the data in 'test.test' (vfdt is able to learn a model incrementally by incorporating data over time). You might also like to add a '-v' argument to vfdt, which will make it print a great deal of debugging output to stdout.

During the course of the run vfdt will output a series of numbers that will look like:

>> 10000        30.6580 7       4       0.41    5.78    716
>> 14400        28.0080 17      9       0.69    8.34    1061
>> 20736        28.0080 17      9       0.86    11.99   1432
...

These are reports on its learning. The first number is the number of training examples seen, the second is the error rate on the test set, the third is the number of nodes in the tree it learned, the fourth is the number of nodes that are currently growing, the fifth is the amount of time vfdt spent learning, the sixth is the amount of memory it is using, and the seventh is the number of statistical tests that it made. You might also be interested in examining the 'test.names' and 'test.data' files just to get a sense for what they contain.

Another interface method, which is currently only available with the vfdt learner, is to write a program that calls the learning engine directly. This allows you to have much more fine grained control of the learning process, to get access to the partially learned model for performance tasks, and to continue refining the model indefinitely. See the vfdt-engine.h documentation for more details on this process.