Why splitting the data into the training and testing set is not enough

I know that in order to access the performance of the classifier I have to split the data into training/test set. But reading this:

When evaluating different settings (“hyperparameters”) for estimators,
such as the C setting that must be manually set for an SVM, there is
still a risk of overfitting on the test set because the parameters can
be tweaked until the estimator performs optimally. This way, knowledge
about the test set can “leak” into the model and evaluation metrics no
longer report on generalization performance. To solve this problem,
yet another part of the dataset can be held out as a so-called
“validation set”: training proceeds on the training set, after which
evaluation is done on the validation set, and when the experiment
seems to be successful, final evaluation can be done on the test set.

I see that another (third) validation set is introduced which is justified by overfitting of the test set during the hyperparameters tuning.

The problem is that I can not understand how this overfitting can appear and therefore can not understand the justification of the third set.

Answer

Even though you are training models exclusively on the training data, you are optimizing hyperparameters (e.g. C for an SVM) based on the test set. As such, your estimate of performance can be optimistic, because you are essentially reporting best-case results. As some on this site have already mentioned, optimization is the root of all evil in statistics.

Performance estimates should always be done on completely independent data. If you are optimizing some aspect based on test data, then your test data is no longer independent and you would need a validation set.

Another way to deal with this is via nested cross-validation, which consists of two cross-validation procedures wrapped around eachother. The inner cross-validation is used in tuning (to estimate the performance of a given set of hyperparameters, which is optimized) and the outer cross-validation estimates generalization performance of the entire machine learning pipeline (i.e., optimizing hyperparameters + training the final model).

Attribution
Source : Link , Question Author : Salvador Dali , Answer Author : Community

Leave a Comment