Loading Data

One of the first steps in any machine learning project is getting data into your program in a format it can work with. VFML is set up to get data from files that are in C4.5 format. there are two abstract data types you will need to learn about to do input, the example spec and the example types.

Instances of the ExampleSpec ADT contain a blueprint for the attributes and classes that make up the examples in your learning problem. You can create an ExampleSpec programmatically or by loading a C4.5 names file. Once created you can query it to get the information you need to work with the examples in your problem.

Instances of the Example ADT contain the information you will learn from. You create instances programmatically or by loading a C4.5 data or test file. Once created, you can query it for the values of its class and attributes.

Here is a code snippet that shows how easy it is to use VFML to load and work with examples.

#include "uwml.h"
#include <stdio.h>

ExampleSpecPtr es = ExampleSpecRead("test.names");
ExamplePtr e;
FILE *exampleIn = fopen("test.data", "r");

e = ExampleRead(exampleIn, es);
while(e != 0) { /* ExampleRead returns 0 when EOF */
   /* process example */
   e = ExampleRead(exampleIn, es);
}

This code loads an ExampleSpec from test.names, then iterates over each example in the file test.data, 'processing' them in turn. A construct similar to this will appear in most every program you write using VFML (although in some programs it may be hidden in library code).

For a complete listing of the things you can do with Examples and ExampleSpecs, see their interfaces in the reference documentation.

Scan Dataset Example Program

Example for: basic functions of Example and ExampleSpec.

This is a more detailed example that introduces everything you'll need to load a dataset and extract the information you will need to execute a learning algorithm.  It includes a made-up data set, a sample makefile, and a program which it loads, collects statistics from, and frees the data set.  The example's file are in the <VFML-root>/examples/scan-dataset/ directory.   This document presents the code with a detailed commentary and some suggestions for modifications.

You might like to go to the <VFML-root>/examples/scan-dataset/ directory and get your favorite code/text editor ready.

The Dataset

The dataset used for the scan-dataset example is made-up. Each example represents a banana sitting on a kitchen counter. The attributes tell how long each banana has been sitting on the counter and how many black spots each has.  From this information, you would like to predict if the banana is edible or spoiled.  Unfortunately people are always forgetting how long ago they got their bananas, so the attribute containing that information is sometimes unknown.

Look at the test.names file for the C4.5-name description of the dataset. The test.data file contains the examples, notice the '?'s indicating that some of the attribute values are unknown.  The first line of the .data file represents a banana that has been on the counter for 1 day, has a few spots, and is still edible.

The Makefile

Glance at the makefile; the top couple lines contain information you would need to update if you want to use the file with another project.

The makefile is set up to work as is for the scan-dataset example.  Make sure you've properly installed the VFML library (see the Getting Started section if you haven't done this yet), and changed to the <VFML-root>/examples/scan-dataset/ directory.  Type 'make' to build the example program.  Run it by typing scan-dataset, and look at the output.

The Code

Now let's take a look at the code, load scan-dataset.c into your editor.

Setup

#include "uwml.h"
#include <stdio.h>

These two include files will appear in just about every project build with VFML.   The first includes all the VFML interfaces, the second is needed to work with files, something you will do in most of your VFML project.

The next couple lines declare some global variables which we'll use to keep statistics about the data.  We used globals to highlight the separation between this less-interesting code from the code that does the real work of the example.

int main(void) {
   ExampleSpecPtr es = ExampleSpecRead("test.names");
   ExamplePtr e;
   FILE *exampleIn = fopen("test.data", "r");

These lines load the example spec, declare an example pointer, and open the example data file.  The example spec is very important, it contains a complete description of the dataset including attributes, their types and values, and the classes.  Your program will query the example spec to determine how to go about working with a particular dataset, what values to expect, and how to iterate over them.  You will also need to pass the spec to various VFML interfaces; it might be a good thing to make global in your projects.

exampleIn is initialized to contain a file handle to the data which is configured for reading.  The program will read examples from this file, one at a time, until there are no more left to read.

Note that the file names are hard coded as test.<names, data>.   test is called the filestem.  Your programs will need to accept a command line argument which allows the filestem to be set at runtime.

Query the Example Spec

The next couple lines make calls to the ExampleSpec interface to figure out some properties of the test dataset.  First we figure out how many attributes and classes there are.

   printf("There are %d attributes.\n",
               ExampleSpecGetNumAttributes(es));
   printf("There are %d classes.\n",
               ExampleSpecGetNumClasses(es));

Then we figure out some more information about the attributes.  In the example, we hard code the attribute indexes; a real learner would have to be more sophisticated.   Notice that attribute indexing (and all other indexing in VFML) is zero-based, just like C arrays.


   if(ExampleSpecIsAttributeContinuous(es, 0)) {
      printf(" Attribute with index 0 is continuous.\n");
   }

   if(ExampleSpecIsAttributeDiscrete(es, 1)) {
      printf(" Attribute with index 1 is discrete \
                             and has %d values.\n",
                ExampleSpecGetAttributeValueCount(es, 1));
   }

Scan the Data

The Scan-Dataset program loads, examines, and frees the examples from the data set in turn.  Most learners will need to load the entire dataset into RAM and do some significant processing.  Finding the right data structure can be a bit problematic.   Arrays have quick random access but are a bit inconvenient when you don't know the size of the data set ahead of time.  Linked lists are easy to build but are slow to access

After reading an example, the program tests the values of its attributes and record some statistics.  As above, the example program hard codes indexes to attributes and values.

   e = ExampleRead(exampleIn, es);
   while(e != 0) { /* ExampleRead returns 0 when EOF */
      /* keep a count of the examples */
      gNumExamples++;

      /* keep a count of how many of them are spoiled */
      if(!ExampleIsClassUnknown(e)) {
         if(ExampleGetClass(e) == 1) {
            gNumSpoiled++;
         }
      }

Scan-dataset always checks each value to make sure it isn't 'unknown' before attempting to use it.  The result of accessing an unknown value is undefined.


      /* keep a sum of the number of days */
      if(!ExampleIsAttributeUnknown(e, 0)) {
         gSumDays +=
              ExampleGetContinuousAttributeValue(e, 0);
      } else {
         gNumDaysUnknown++;
      }
      /* keep a total of the number of bananas
                      that have a few spots */
      if(!ExampleIsAttributeUnknown(e, 1)) {
         if(ExampleGetDiscreteAttributeValue(e, 1) == 1) {
            gNumFewSpots++;
         }
      }

VFML allows you to access any attribute as either continuous or discrete, but accessing with the wrong type will return a garbage value at best.  You should always use the ExampleSpec interface to check an attribute's type before you access it.


      /* now move on to the next example */
      ExampleFree(e);
      e = ExampleRead(exampleIn, es);
   }

Scan-dataset prints out some statistics when it's done scanning the data.