Consistency of the learning process

I have two questions related to the concept of “learning consistency” for those who are familiar with statistical learning theory a la Vapnik.

Question 1.
The learning process is called consistent (for the class of functions $\mathcal{F}$ and probability distribution $P$) if

R_{emp}(f^*_l) \buildrel P \over \to \inf_{f \in \mathcal{F}} R(f),\;l \to \infty
R(f^*_l) \buildrel P \over \to \inf_{f \in \mathcal{F}} R(f),\;l \to \infty

These two conditions are independent. On p. 83 of Vapnik’s “Statistical Learning Theory” there is an example of a set of classifiers $\mathcal{F}$ such that the second convergence takes place but the first one doesn’t. I was thinking about an example of a set of classifiers such that the first convergence takes place but the second one doesn’t, and couldn’t come up with anything. Can anyone help me here?

Question 2.
The learning process is called nontrivially consistent (or stictly consistent) (for the class of functions $\mathcal{F}$ and probability distribution $P$) if for any real number $c \in R$ such that set $\Lambda(c) = \{ f | R(f) \geq c \}$ is not empty we have:

\inf_{f_l \in \Lambda(c)}R_{emp}(f_l) = R_{emp}(f^*_l) \buildrel P \over \to \inf_{f \in \Lambda(c)} R(f),\;l \to \infty

P. 81 of Vapnik’s “Statistical Learning Theory” provides an illustration of why we want to consider strict consistency instead of consistency defined in Question 1, i.e., why we want to introduce $\Lambda(c)$ and consider $\inf_{f \in \Lambda(c)}$ for any $c$. All other texts that consider strict consistency essentially duplicate Vapnik’s illustration when they want to explain the rationale behind the concept of strict consistency. However, I’m not really happy with Vapnik’s illustration due to 2 reasons: first, it’s done in terms of loss functions $Q(z, \alpha)$ and not the classifiers, and, second, Fig. 3.2. from the book doesn’t really make sense when we consider the common loss function for classification problems, i.e., the function that is equal to 0 when predicted class label is equal to the true class label and to 1 otherwise.

So, is it possible to give another, more sensible, illustration of the rationale behind the concept of strict consistency? Essentially, we need an example of a set of classifiers such that these classifiers are not consistent (in terms of the definition from Question 1) and some new classifier performing better that any of the classifiers from the set, so that when we add this classifiers to the set we end up with the case of “trivial consistency”. Any ideas?


For your Question 1, I have an example, but it requires the loss function
to take the value $\infty$. I am pretty sure we can give an example
that only requires an unbounded loss function, but that would be a bit
more work to construct. An open question is whether there’s an example
with a bounded loss function.

Consider the classification setting, where the probability distribution
$P$ is on a space $\mathcal{Z}=\mathcal{X}\times\{0,1\}$. We’ll
denote an example by $z=(x,y)$, with $x\in\mathcal{X}$ and $y\in\{0,1\}$.
Let $\mathcal{F}=\mathcal{X}^{\{0,1\}}$ be the space of all classification
functions on $\mathcal{X}$. Define the loss function

0 & \text{for }f(x)=y\\
\infty & \mbox{otherwise,}
for any $f\in\mathcal{F}$. In other words, whether you get one example
wrong or all of them wrong, your risk is $\infty$.

Now, suppose $\mathcal{X}=\left\{ x_{1},x_{2},\ldots\right\} $ is
some countably infinite set, and let $P$ be any probability distribution
for which $P(\{x_{i}\})>0$ for all $i=1,2,\ldots$. Also, let’s assume
that there is a deterministic classification function, i.e. there exists $c\in\mathcal{F}$ for which $y_{i}=c(x_{i})$ for $i=1,2,…$. This implies
that $\inf_{f\in\mathcal{F}}R(f)=0$.

Then for each $l$, $R_{emp}(f_{l}^{*})=0$, but $R(f_{l}^{*})=\infty$
(unless there is an extremely lucky choice of $f_{l}^{*}$ among all
those $f\in\mathcal{F}$ that have $0$ empirical error). Thus $R_{emp}(f_{l}^{*})\to\inf_{f\in\mathcal{F}}R(f)$,
but $R(f_{l}^{*})$ does not converge to that value.

For Question 2, I agree that his example does not seem to apply to the classification case, and I don’t see an obvious way to make such an example

Source : Link , Question Author : Leo , Answer Author : DavidR

Leave a Comment