Main Page | Modules | Data Structures | File List | Globals | Related Pages

ExampleGroupStats.h File Reference


Detailed Description

Sufficient statistics for Entropy and Gini.

Tracks and maintains the sufficient statistics needed to calculate Entropy and Gini of discrete and continuous attributes, as well as make some queries about the probability of events in the data.

Go to the source code of this file.

Data Structures

struct  _ExampleGroupStats_
 Sufficient statistics for Entropy and Gini. More...


Typedefs

typedef _ExampleGroupStats_ ExampleGroupStats
 Sufficient statistics for Entropy and Gini.

typedef _ExampleGroupStats_ExampleGroupStatsPtr
 Sufficient statistics for Entropy and Gini.


Functions

ExampleGroupStatsPtr ExampleGroupStatsNew (ExampleSpecPtr es, AttributeTrackerPtr at)
 Creates a structure to track sufficient statistics.

void ExampleGroupStatsFree (ExampleGroupStatsPtr egs)
 Frees all the memory that was being used by the structure.

void ExampleGroupStatsDeactivate (ExampleGroupStatsPtr egs)
 Temporarily frees the memory being used to hold statistics.

void ExampleGroupStatsReactivate (ExampleGroupStatsPtr egs)
 Reallocates the memory that is freed by a call to ExampleGroupStatsDeactivate.

void ExampleGroupStatsAddExample (ExampleGroupStatsPtr egs, ExamplePtr e)
 Adds the information from the example to the statistics structure.

void ExampleGroupStatsWrite (ExampleGroupStatsPtr egs, FILE *out)
 A debugging function that prints a representation of the stats structure to specified file.

long ExampleGroupStatsNumExamplesSeen (ExampleGroupStatsPtr egs)
 Number of examples being tracked by the structure.

AttributeTrackerPtr ExampleGroupStatsGetAttributeTracker (ExampleGroupStatsPtr egs)
 Returns the attribute tracker associated with the structure.

int ExampleGroupStatsIsAttributeActive (ExampleGroupStatsPtr egs, int num)
 Tests if the attribute is active.

void ExampleGroupStatsIgnoreAttribute (ExampleGroupStatsPtr egs, int num)
 Frees the memory being used by the attribute and stops tracking it.

int ExampleGroupStatsGetMostCommonClassLaplace (ExampleGroupStatsPtr egs, int addClass, int addCount)
 Returns the index of the most common class, but adds addCount samples to addClass.

int ExampleGroupStatsGetMostCommonClass (ExampleGroupStatsPtr egs)
 Returns the index of the most common class.

long ExampleGroupStatsGetMostCommonClassCount (ExampleGroupStatsPtr egs)
 Returns the number of examples with the most common class that were seen by the structure.

int ExampleGroupStatsGetMostCommonClassForAttVal (ExampleGroupStatsPtr egs, int att, int val)
 Returns the most common class among examples where the specified attribute has the specified value.

int ExampleGroupStatsIsPure (ExampleGroupStatsPtr egs)
 Returns 1 if all the examples shown to the structure have the same class.

float ExampleGroupStatsGetValuePercent (ExampleGroupStatsPtr egs, int attNum, int valNum)
 Returns the fraction of examples that have the specified value for the specified attribute.

double ExampleGroupStatsGetValueGivenClassMEstimate (ExampleGroupStatsPtr egs, int attNum, int valNum, int classNum)
 Returns P(att = value | class).

float ExampleGroupStatsGetClassPercent (ExampleGroupStatsPtr egs, int classNum)
 Returns P(class).

float ExampleGroupStatsGetPercentBelowThreshold (ExampleGroupStatsPtr egs, int attNum, float thresh)
 Returns the fraction of examples with a value below the specified threshold.

double ExampleGroupStatsGetValueGivenClassMEstimateLogP (ExampleGroupStatsPtr egs, int attNum, int valNum, int classNum)
 Returns a smoothed P(att = value | class).

double ExampleGroupStatsGetClassLogP (ExampleGroupStatsPtr egs, int classNum)
 Returns the log of the fraction of examples that have the specified class.

float ExampleGroupStatsEntropyTotal (ExampleGroupStatsPtr egs)
 Returns the entropy of the class attribute of all examples seen so far.

float ExampleGroupStatsEntropyDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum)
 Returns the weighted entropy of the class attribute after partitoning the data by the values of the specified attribute.

float ExampleGroupStatsEntropyPlusDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float delta)
 Returns upper bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute.

float ExampleGroupStatsEntropyMinusDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float delta)
 Returns lower bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute.

void ExampleGroupStatsEntropyContinuousAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float *firstIndex, float *firstThresh, float *secondIndex, float *secondThresh)
 Finds the entropy of the best split thresholds.

float ExampleGroupStatsGiniTotal (ExampleGroupStatsPtr egs)
 Returns the gini index of the class attribute of all examples seen so far.

float ExampleGroupStatsGiniDiscreteAttributeSplit (ExampleGroupStatsPtr egs, int attNum)
 Returns the weighted gini of the class attribute after partitoning the data by the values of the specified attribute.

void ExampleGroupStatsGiniContinuousAttributeSplit (ExampleGroupStatsPtr egs, int attNum, float *firstIndex, float *firstThresh, float *secondIndex, float *secondThresh)
 Finds the Gini index of the best split thresholds.

void ExampleGroupStatsIgnoreSplitsWorseThanEntropy (ExampleGroupStatsPtr egs, int attNum, float entropyThresh)
 Stop monitoring some thresholds.

void ExampleGroupStatsIgnoreSplitsWorseThanGini (ExampleGroupStatsPtr egs, int attNum, float giniThresh)
 Stop monitoring some thresholds.

int ExampleGroupStatsLimitSplitsEntropy (ExampleGroupStatsPtr egs, int attNum, int maxSplits, int pruneDownTo)
 Reduce the number of thresholds being considered if above the max.

void ExampleGroupStatsStopAddingSplits (ExampleGroupStatsPtr egs, int attNum)
 Stop adding new split thresholds, but continue to use future examples to evaluate the existing ones.

int ExampleGroupStatsNumSplitThresholds (ExampleGroupStatsPtr egs, int attNum)
 Returns the number of thresholds that are being monitored for the specified attribute.

int ExampleGroupStatsGetMostCommonClassAboveThreshold (ExampleGroupStatsPtr egs, int attNum, float threshold)
 Returns the most common class above the specified value.

int ExampleGroupStatsGetMostCommonClassBelowThreshold (ExampleGroupStatsPtr egs, int attNum, float threshold)
 Returns the most common class below the specified value.


Typedef Documentation

typedef struct _ExampleGroupStats_ ExampleGroupStats
 

Sufficient statistics for Entropy and Gini.

typedef struct _ExampleGroupStats_ * ExampleGroupStatsPtr
 

Sufficient statistics for Entropy and Gini.


Function Documentation

void ExampleGroupStatsAddExample ExampleGroupStatsPtr  egs,
ExamplePtr  e
 

Adds the information from the example to the statistics structure.

void ExampleGroupStatsDeactivate ExampleGroupStatsPtr  egs  ) 
 

Temporarily frees the memory being used to hold statistics.

Does not free the whole structure. A later call to ExampleGroupStatsReactivate will restore the memory (but not the counts that used to be there). This is a convienient way to focus RAM usage (and learning) in one part of the instance space while keeping the book keeping around to quickly resume learning in another. You shouldn't try to add examples to a deactiveated structure.

void ExampleGroupStatsEntropyContinuousAttributeSplit ExampleGroupStatsPtr  egs,
int  attNum,
float *  firstIndex,
float *  firstThresh,
float *  secondIndex,
float *  secondThresh
 

Finds the entropy of the best split thresholds.

Calculates the entropy of splitting the specified attribute by every threshold under consideration (values are sorted and then a threshold is considered between each pair of adjacent values that have different class). The remaining arguments return the entropy of the best and second best thresholds, along with the thresholds themselves.

This function adds an MDL penelty similar to the one Quinlan uses in C4.5.

Should only be called for continuous attributes.

float ExampleGroupStatsEntropyDiscreteAttributeSplit ExampleGroupStatsPtr  egs,
int  attNum
 

Returns the weighted entropy of the class attribute after partitoning the data by the values of the specified attribute.

Should only be called for discrete attributes.

float ExampleGroupStatsEntropyMinusDiscreteAttributeSplit ExampleGroupStatsPtr  egs,
int  attNum,
float  delta
 

Returns lower bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute.

This uses the Hoeffding bound and the empirical probabilities to return a value that is lower than the true entropy with probability 1

  • delta.

Should only be called for discrete attributes.

float ExampleGroupStatsEntropyPlusDiscreteAttributeSplit ExampleGroupStatsPtr  egs,
int  attNum,
float  delta
 

Returns upper bound on weighted entropy of the class attribute after partitoning the data by the values of the specified attribute.

This uses the Hoeffding bound and the empirical probabilities to return a value that is higher than the true entropy with probability 1

  • delta.

Should only be called for discrete attributes.

float ExampleGroupStatsEntropyTotal ExampleGroupStatsPtr  egs  ) 
 

Returns the entropy of the class attribute of all examples seen so far.

void ExampleGroupStatsFree ExampleGroupStatsPtr  egs  ) 
 

Frees all the memory that was being used by the structure.

AttributeTrackerPtr ExampleGroupStatsGetAttributeTracker ExampleGroupStatsPtr  egs  ) 
 

Returns the attribute tracker associated with the structure.

double ExampleGroupStatsGetClassLogP ExampleGroupStatsPtr  egs,
int  classNum
 

Returns the log of the fraction of examples that have the specified class.

float ExampleGroupStatsGetClassPercent ExampleGroupStatsPtr  egs,
int  classNum
 

Returns P(class).

int ExampleGroupStatsGetMostCommonClass ExampleGroupStatsPtr  egs  ) 
 

Returns the index of the most common class.

int ExampleGroupStatsGetMostCommonClassAboveThreshold ExampleGroupStatsPtr  egs,
int  attNum,
float  threshold
 

Returns the most common class above the specified value.

Should only be called for continuous attributes.

int ExampleGroupStatsGetMostCommonClassBelowThreshold ExampleGroupStatsPtr  egs,
int  attNum,
float  threshold
 

Returns the most common class below the specified value.

Should only be called for continuous attributes.

long ExampleGroupStatsGetMostCommonClassCount ExampleGroupStatsPtr  egs  ) 
 

Returns the number of examples with the most common class that were seen by the structure.

int ExampleGroupStatsGetMostCommonClassForAttVal ExampleGroupStatsPtr  egs,
int  att,
int  val
 

Returns the most common class among examples where the specified attribute has the specified value.

Should only be called for discrete attributes.

int ExampleGroupStatsGetMostCommonClassLaplace ExampleGroupStatsPtr  egs,
int  addClass,
int  addCount
 

Returns the index of the most common class, but adds addCount samples to addClass.

Use addClass of -1 for no addition (or just call ExampleGroupStatsGetMostCommonClass). This adding allows you to, for example, smooth the class towards the parent class during decision tree induction.

float ExampleGroupStatsGetPercentBelowThreshold ExampleGroupStatsPtr  egs,
int  attNum,
float  thresh
 

Returns the fraction of examples with a value below the specified threshold.

Should only be called for continuous attributes.

double ExampleGroupStatsGetValueGivenClassMEstimate ExampleGroupStatsPtr  egs,
int  attNum,
int  valNum,
int  classNum
 

Returns P(att = value | class).

Returns the fraction of examples among those that have the specified class that have the specified value for the specified attribute, but smooths the return value by adding a small amount (that decreases with the number of samples seen) to each class count first.

Should only be called for discrete attributes.

double ExampleGroupStatsGetValueGivenClassMEstimateLogP ExampleGroupStatsPtr  egs,
int  attNum,
int  valNum,
int  classNum
 

Returns a smoothed P(att = value | class).

Returns the log of the fraction of examples among those that have the specified class that have the specified value for the specified attribute, but smooths the return value by adding a small amount (that decreases with the number of samples seen) to each class count first.

Should only be called for discrete attributes.

float ExampleGroupStatsGetValuePercent ExampleGroupStatsPtr  egs,
int  attNum,
int  valNum
 

Returns the fraction of examples that have the specified value for the specified attribute.

Should only be called for discrete attributes.

void ExampleGroupStatsGiniContinuousAttributeSplit ExampleGroupStatsPtr  egs,
int  attNum,
float *  firstIndex,
float *  firstThresh,
float *  secondIndex,
float *  secondThresh
 

Finds the Gini index of the best split thresholds.

Calculates the gini of splitting the specified attribute by every threshold under consideration (values are sorted and then a threshold is considered between each pair of adjacent values that have different class). The remaining arguments return the gini of the best and second best thresholds, along with the thresholds themselves.

Should only be called for continuous attributes.

float ExampleGroupStatsGiniDiscreteAttributeSplit ExampleGroupStatsPtr  egs,
int  attNum
 

Returns the weighted gini of the class attribute after partitoning the data by the values of the specified attribute.

Should only be called for discrete attributes.

float ExampleGroupStatsGiniTotal ExampleGroupStatsPtr  egs  ) 
 

Returns the gini index of the class attribute of all examples seen so far.

void ExampleGroupStatsIgnoreAttribute ExampleGroupStatsPtr  egs,
int  num
 

Frees the memory being used by the attribute and stops tracking it.

This is useful if you decide that some attribute will not be used (perhaps using some statistical tests) and would like to use the memory elsewhere.

void ExampleGroupStatsIgnoreSplitsWorseThanEntropy ExampleGroupStatsPtr  egs,
int  attNum,
float  entropyThresh
 

Stop monitoring some thresholds.

Stop monitoring every threshold with an entropy worse than the specified value. This frees some memory, but adding future values to the egs may require some interpolation to estimate the position of the new value in the array of all values for the attribute (and so this introduces some error into future calls for the Entropy or Gini of the attribute).

Should only be called for continuous attributes.

void ExampleGroupStatsIgnoreSplitsWorseThanGini ExampleGroupStatsPtr  egs,
int  attNum,
float  giniThresh
 

Stop monitoring some thresholds.

Stop monitoring every threshold with an Gini worse than the specified value. This frees some memory, but adding future values to the egs may require some interpolation to estimate the position of the new value in the array of all values for the attribute (and so this introduces some error into future calls for the Entropy or Gini of the attribute).

Should only be called for continuous attributes.

int ExampleGroupStatsIsAttributeActive ExampleGroupStatsPtr  egs,
int  num
 

Tests if the attribute is active.

Returns 1 if the attribute was active in the initial attribute tracker and has not been ignored by a call to ExampleGroupStatsIgnoreAttribute since then.

int ExampleGroupStatsIsPure ExampleGroupStatsPtr  egs  ) 
 

Returns 1 if all the examples shown to the structure have the same class.

int ExampleGroupStatsLimitSplitsEntropy ExampleGroupStatsPtr  egs,
int  attNum,
int  maxSplits,
int  pruneDownTo
 

Reduce the number of thresholds being considered if above the max.

If the attribute is monitoring more than 'maxSplits' split thresholds this function will find the best 'pruneDownTo' based on entropy and start ignoring all the rest. This frees some memory, but adding future values to the egs may require some interpolation to estimate the position of the new value in the array of all values for the attribute (and so this introduces some error into future calls for the Entropy or Gini of the attribute).

Returns the number of thresholds that were pruned.

Should only be called for continuous attributes.

ExampleGroupStatsPtr ExampleGroupStatsNew ExampleSpecPtr  es,
AttributeTrackerPtr  at
 

Creates a structure to track sufficient statistics.

Creates a structure to track the statistics needed to cacluate several common machine metrics for the attributes that are active in the AttributeTracker. This function takes over the memory for the AttributeTracker and will free it when ExampleGroupStatsFree For categorical attributes this uses memory proportional to the number of classes * the number of values of the attribute. For continuous attributes this uses constant memory at first, but as examples are added with ExampleGroupStatsAddExample the memory grows proportionally with the number of unique values of the attribute.

long ExampleGroupStatsNumExamplesSeen ExampleGroupStatsPtr  egs  ) 
 

Number of examples being tracked by the structure.

Outputs the number of examples added to the structure with ExampleGroupStatsAddExample since the last call to ExampleGroupStatsReactivate.

int ExampleGroupStatsNumSplitThresholds ExampleGroupStatsPtr  egs,
int  attNum
 

Returns the number of thresholds that are being monitored for the specified attribute.

Should only be called for continuous attributes.

void ExampleGroupStatsReactivate ExampleGroupStatsPtr  egs  ) 
 

Reallocates the memory that is freed by a call to ExampleGroupStatsDeactivate.

void ExampleGroupStatsStopAddingSplits ExampleGroupStatsPtr  egs,
int  attNum
 

Stop adding new split thresholds, but continue to use future examples to evaluate the existing ones.

Should only be called for continuous attributes.

void ExampleGroupStatsWrite ExampleGroupStatsPtr  egs,
FILE *  out
 

A debugging function that prints a representation of the stats structure to specified file.


Generated for VFML by doxygen hosted by SourceForge.net Logo