# Why splitting the data into the training and testing set is not enough

I know that in order to access the performance of the classifier I have to split the data into training/test set. But reading this:

When evaluating different settings (“hyperparameters”) for estimators,
such as the C setting that must be manually set for an SVM, there is
still a risk of overfitting on the test set because the parameters can
be tweaked until the estimator performs optimally. This way, knowledge
about the test set can “leak” into the model and evaluation metrics no
longer report on generalization performance. To solve this problem,
yet another part of the dataset can be held out as a so-called
“validation set”: training proceeds on the training set, after which
evaluation is done on the validation set, and when the experiment
seems to be successful, final evaluation can be done on the test set.

I see that another (third) validation set is introduced which is justified by overfitting of the test set during the hyperparameters tuning.

The problem is that I can not understand how this overfitting can appear and therefore can not understand the justification of the third set.

Even though you are training models exclusively on the training data, you are optimizing hyperparameters (e.g. $C$ for an SVM) based on the test set. As such, your estimate of performance can be optimistic, because you are essentially reporting best-case results. As some on this site have already mentioned, optimization is the root of all evil in statistics.