# In statistical learning theory, isn’t there a problem of overfitting on a test set?

Let’s consider the problem about classifying the MNIST dataset.

According to Yann LeCun’s MNIST Webpage, ‘Ciresan et al.’ got 0.23% error rate on MNIST test set using Convolutional Neural Network.

Let’s denote MNIST training set as $D_{train}$, MNIST test set as $D_{test}$, the final hypothesis they obtained using $D_{train}$ as $h_{1}$,and their error rate on MNIST Test set using $h_{1}$ as $E_{test}(h_{1}) = 0.0023$.

In their viewpoint, since $D_{test}$ is randomly sampled test set from input space regardless of $h_{1}$, they can insist that out-of-sample error performance of their final hypothesis $E_{out}(h_{1})$ is bounded as following from Hoeffding’s Inequality
$$P[|E_{out}(h_{1}) – E_{test}(h_{1})| < \epsilon|] \geq 1 – 2e^{2\epsilon^{2}N_{test}}$$
where $N_{test}=|D_{test}|$.

In other words, at least probability $1-\delta$,
$$E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2\over\delta}}$$

Let’s consider another viewpoint.
Suppose that some person wants to classify MNIST test set well. So he first looked at Yann LeCun’s MNIST Webpage, and found following results obtained by other people using 8 different models,

and picked his model $g$ which performed best on the MNIST test set among 8 models.

For him, the learning process was picking a hypothesis $g$ which performed best on test set $D_{test}$ from a hypothesis set $H_{trained}=\{h_1, h_2, .. ,h_8\}$.

Thus, the error on test set $E_{test}(g)$ is ‘in-sample’ error for this learning process, so he can apply the VC bound for finite hypothesis sets as following inequality.
$$P[|E_{out}(g)-E_{in}(g)|<\epsilon] \geq 1 – 2|H_{trained}|e^{2\epsilon^{2}N_{test}}$$

In other words, at least probability $1-\delta$,
$$E_{out}(g) \leq E_{test}(g) + \sqrt{{1 \over 2N_{test}}ln{2|H_{trained}|\over\delta}}$$

This result implies that there could be overfitting on test set if we pick the model performs best among several models.

In this case, the person might pick $h_{1}$, which has the lowest error rate $E_{test}(h_{1}) = 0.0023$. Since $h_{1}$ is the best hypothesis among 8 models on this particular test set $D_{test}$, there could be some possibility that $h_{1}$ is a hypothesis overfitted on the MNIST test set.

Thus, this person can insist the following inequality.
$$E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2|H_{trained}|\over\delta}}$$

Consequently, we got two inequalities
$$P[\;E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2\over\delta}}\;] \geq 1-\delta$$ and
$$P[\;E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2|H_{trained}|\over\delta}}\;] \geq 1-\delta$$.

Howerver, it is obvious that these two inequalities are incompatible.

Where am I doing wrong? Which one is right and which one is wrong?

If the latter is wrong, what is the right way to apply the VC bound for finite hypothesis sets in this case?

Among those two inequalities, I think the later is wrong. In brief, what’s wrong here is the identity $g=h_1$ given that $g$ is a function of the test data while $h_1$ is a model that is independent of test data.

In fact, $g$ is one of the 8 models in $H_{trained} = \{ h_1, h_2,…, h_8 \}$ that best predicts test set $D_{test}$.

Therefore, $g$ is a function of $D_{test}$. For a specific test set, $D^*_{test}$ (like the one you mentioned), it could happens that $g(D^*_{test}) = h_1$, but in general, depending on the test set, $g(D_{test})$ could take any value in $H_{trained}$. On the other hand $h_1$ is just one value in $H_{trained}$.

For the other question:

If the latter is wrong, what is the right way to apply the VC bound for finite hypothesis sets in this case?

Just don’t replace $g$ by $h_1$, you will get the correct bound (for $g$, of course) and it will have no conflict with the other bound (which is for $h_1$).