Why does a function being smoother make it more likely?

I am currently studying the textbook Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams. Chapter 1 Introduction says the following:

Given this training data we wish to make predictions for new inputs \mathbf{\mathrm{x}_*} that we have not seen in the training set. Thus it is clear that the problem at hand is inductive; we need to move from the finite training data \mathcal{D} to a function f that makes predictions for all possible input values. To do this we must make assumptions about the characteristics of the underlying function, as otherwise any function which is consistent with the training data would be equally valid. A wide variety of methods have been proposed to deal with the supervised learning problem; here we describe two common approaches. The first is to restrict the class of functions that we consider, for example by only considering linear functions of the input. The second approach is (speaking rather loosely) to give a prior probability to every possible function, where higher probabilities are given to functions that we consider to be more likely, for example because they are smoother than other functions.

I am curious about this part:

The second approach is (speaking rather loosely) to give a prior probability to every possible function, where higher probabilities are given to functions that we consider to be more likely, for example because they are smoother than other functions.

Why does a function being smoother make it more likely?

Answer

While the author mentions it as an “example”, it is true that, generally, smoother functions are often preferred in modelling the characteristics of the “true” underlying function, and therefore may be assigned a higher “prior probability”, as the author maintains. Why is this? You may learn more about it by reading this similar question here, but essentially, there is no real justification for it, just the conventional belief that most things occurring in nature tend to change gradually rather than in a non-continuous way. Practically, smoother functions are desired because they are more easily differentiated and may have convenient mathematical properties. More on that discussion here.

However, though I would say that smooth functions are still widely anchored in statistical methods, in my experience over the years we have been working more and more with non-smooth functions. Examples I can think of include in the context of real-world optimization problems, interpolation problems, and many applications of deep neural networks (an easy one to see is the common ReLU activation function).

In any case, while this question easily spurs debate, I think opportunities to ponder underlying principles are great!

Attribution
Source : Link , Question Author : The Pointer , Answer Author : Sharon Choong

Leave a Comment