Data Overfitting
Overfitting, definition
- Given a set of trees T, a tree t ? T is said to overfit the training data if there is some alternative tree t’, such that t has better accuracy than t’ over the training examples, but t’ has better accuracy than t over the entire set of examples
The decision not to stop until attributes or examples are exhausted is somewhat arbitrary
- you could always stop and take the majority decision, and the tree would be shorter as a result!
The standard stopping rule provides 100% accuracy on the training set, but not necessarily on the test set
- if there is noise in the training data
- if the training data is too small to give good coverage
- likely to be spurious correlation