What does interpolating the training set actually mean?

I just read this article: Understanding Deep Learning (Still) Requires Rethinking Generalization

In section 6.1 I stumbled upon the following sentence

Specifically, in the overparameterized regime where the model capacity
greatly exceeds the training set size, fitting all the training
examples (i.e., interpolating the training set), including noisy ones,
is not necessarily at odds with generalization.

I do not fully understand the term “interpolating” in the context of fitting training data. Why do we speak of “interpolation” in this context? What does the term exactly mean here? Is there any other term that can be used instead?

In my understanding interpolation means the prediction within the training domain for some novel input that was not part of the training set.


Your question already got two nice answers, but I feel that some more context is needed.

First, we are talking here about overparametrized models and the double descent phenomenon. By overparametrized models we mean such that have way more parameters than datapoints. For example, Neal (2019), and Neal et al (2018) trained a network with hundreds of thousands of parameters for a sample of 100 MNIST images. The discussed models are so large, that they would be unreasonable for any practical applications. Because they are so large, they are able to fully memorize the training data. Before the double descent phenomenon attracted more attention in the machine learning community, memorizing the training data was assumed to lead to overfitting and poor generalization in general.

As already mentioned by @jcken, if a model has a huge number of parameters, it can easily fit a function to the data such that it “connects all the dots” and at prediction time just interpolates between the points. I’ll repeat myself, but until recently we would assume that this would lead to overfitting and poor performance. With the insanely huge models, this doesn’t have to be the case. The models would still interpolate, but the function would be so flexible that it won’t hurt the test set performance.

To understand it better, consider the lottery ticket hypothesis. Loosely speaking, it says that if you randomly initialize and train a big machine learning model (deep network), this network would contain a smaller sub-network, the “lottery ticket”, such that you could prune the big network while keeping the performance guarantees. The image below (taken from the linked post), illustrates such pruning. Having a huge number of parameters is like buying piles of lottery tickets, the more you have, the higher your chance of winning. In such a case, you can find a lottery ticket model that interpolates between the datapoints but also generalizes.

Animated illustration of an algorithm pruning a neural network to a smaller sub-network.

Another way to think about it is to consider a neural network as a kind of ensemble model. Every neural network has a pre-ultimate layer (image below, adapted from this), that you can think of as a collection of intermediate representations of your problem. The outputs of this layer are then aggregated (usually using a dense layer) for making the final prediction. This is like ensembling many smaller models. Again, if the smaller models memorized the data, even if each would overfit, by aggregating them, the effects would hopefully cancel out.

Fully connected neural network diagram. The last hidden layer is circled in red.

All the machine learning algorithms kind of interpolate between the datapoints, but if you more parameters than data, you would literally memorize the data and interpolate between them.

Source : Link , Question Author : Gilfoyle , Answer Author : Tim

Leave a Comment