Overfitting (continued)
How to avoid overfitting
- stop growing the tree before it perfectly classifies the training data
- allow overfitting, but post-prune the tree
Training and validation sets
- training set is used to build the tree
- a separate validation set is used to evaluate the accuracy over subsequent data, and to evaluate the impact of pruning
- validation set is unlikely to exhibit the same noise and spurious correlation
- rule of thumb: 2/3 to the training set, 1/3 to the validation set