Do Gaussian process (regression) have the universal approximation property?

Can any continuous function on [a, b], where a and b are real numbers, be approximated or arbitrarily close to the function (in some norm) by Gaussian Processes (Regression)?


As @Dougal notes, there are two different ways in which your question may be interpreted. They are closely related, even if it may not seem so.

The first interpretation is: let X be a compact subset of Rd (compactness is fundamental for all of the following!!!), let k(x,x) be a continuous covariance function (or kernel) defined on X×X, and denote with C(X) the normed space of continuous functions on X, equipped with the maximum norm ||||. For any function fC(X), can f be approximated to a prespecified tolerance ϵ by a function in the RKHS (Reproducing Kernel Hilbert Space) associated to k? You may well wonder what an RKHS is, an what all this has to do with Gaussian Process Regression. An RKHS K(X) is the closure of the vector space formed by all possible finite linear combinations of all possible functions fy(x)=k(x,y) where yX. This is very strictly related to Gaussian process regression, because given a Gaussian process prior GP(0,k(x,x)) on the space C(X), then the (closure of the) space of all possible posterior means which can be generated by Gaussian Process Regression is exactly the RKHS. As a matter of fact, all possible posterior means are of the form


i.e., they are finite linear combinations of functions fxi(x)=k(x,xi). Thus, we’re effectively asking if, given a Gaussian Process prior GP(0,k(x,x)) on C(X), for any function fC(X) there is always a function f in the (closure of the) space of all functions which can be generated by GPR, which is as close as desired to f.

The answer, for some particular kernels (including the classic Squared Exponential kernel, but not including the polynomial kernel), is yes. It can be proved that for such kernels K(X) is dense in C(X), i.e., for any fC(X) and for any tolerance ϵ, there is an f in K(X) such that ||ff||<ϵ. Note the assumptions: X is compact, f is continuous and k is a continuous kernel having the so-called universal approximation property. See here for a full proof in a more general (thus complicated) context.

This result is much less powerful than it looks at first sight. Even if f is in the (closure of the) space of the posterior means which can be generated by GPR, we haven't proved that it is the particular posterior mean returned by GPR, for a training set large enough, where of course the training set consists of noisy observations of f at points x1,,xn. We haven't even proved that the posterior mean returned by GPR converges at all, for n! This is actually the second interpretation suggested by @Dougal. The answer to this question depends on the answer to the first question: if there isn't any function f in the RKHS which is a "good approximation" to f, of course we cannot hope that the posterior mean returned by GPR converges to it. However, it's a different question. If you would like to have an answer to this question too, please ask a new question.

Source : Link , Question Author : Michael D , Answer Author : DeltaIV

Leave a Comment