Overfitting (continued)
How to avoid overfitting
- stop growing the tree before it perfectly classifies the training data
- allow overfitting, but post-prune the tree
Training and validation sets
- training set is used to form the learned hypothesis
- validation set used to evaluate the accuracy over subsequent data, and to evaluate the impact of pruning
- justification: validation set is unlikely to exhibit the same noise and spurious correlation
- rule of thumb: 2/3 to the training set, 1/3 to the validation set