It is often said that gaussian process regression corresponds (GPR) to bayesian linear regression with a (possibly) infinite amount of basis functions. I am currently trying to understand this in detail to get an intuition for what kind of models I can express using GPR.
- Do you think that this is a good approach to try to understand GPR?
In the book Gaussian Processes for Machine learning Rasmussen and Williams show that the set of gaussian processes described by the parameterised exponential squared kernel k(x,x′;l)=σ2pexp(−(x−x)22l2) can be equivalently described as bayesian regression with prior belief w∼N(0,σ2pI) on the weights and an infinite amount of basis functions of the form ϕc(x;l)=exp(−(x−c)22l2)
Thus, the parameterisation of the kernel could by fully translated into a parameterisation of the basis functions.
- Can the parameterisation of a differentiable kernel always be translated into parameterisation of the prior and the basis functions or are there differentiable kernels where e.g. the number of the basis functions depends on the configuration?
My understanding so far is that for a fixed kernel function k(x,x’) Mercer’s Theorem tells us that k(x,x′) can be expressed as k(x,x′)=∞∑i=1λiϕi(x)ϕi(x′) where ϕi is a function either into the reals or the complex numbers. Thus, for a given kernel the corresponding bayesian regression model has prior w∼N(0,diag([λ21,…])) and basis functions ϕi. Thus, every GP can even by formulated as bayesian linear regression model with diagonal prior.
However, if we now use mercers theorem for every configuration of a parameterised kernel k(x,x′,θ) that is differentiable at every θ the corresponding eigenvalues and eigenfunctions might by different for every configuration.
My next question is about the inverse of mercers theorem.
- Which sets of basis functions lead to valid kernels?
And the extension
- Which sets of parameterised basis functions lead to valid differentiable kernels?
Here are a few remarks. Perhaps someone else can fill in the details.
1) Basis representations are always a good idea. It’s hard to avoid them if you want to actually do something computational with your covariance function. The basis expansion can give you an approximation to the kernel and something to work with. The hope is that you can find a basis that makes sense for the problem you are trying to solve.
2) I’m not quite sure what you mean by a configuration in this question. At least one of the basis functions will need to be a function of θ for the kernel to depend on θ. So yes, the eigenfunctions will vary with the parameter. They would also vary with different parametrizations.
Typically, the number of basis functions will be (countably) infinite, so the number won’t vary with the parameter, unless some values caused the kernel to become degenerate.
I also don’t understand what you mean in point 2 about the process having Bayesian prior w∼N(0,diag[λ21,…]) since w has not been mentioned up until this point. Also,diag[λ21,…] seems to be an infinite dimensional matrix, and I have a problem with that.
3) Which set of basis functions form valid kernels? If you’re thinking about an eigenbasis, then the functions need to be orthogonal with respect to some measure. There are two problems. 1) The resulting kernel has to be positive definite … and that’s OK if the λi are positive. And 2) the expansion has to converge. This will depend on the λi, which need to dampen fast enough to ensure the convergence of the expression. Convergence will also depend on the domain of the x‘s
If the basis functions are not orthogonal then it will be more difficult to show that a covariance defined from them is positive definite. Obviously, in that case you are not dealing with an eigen-expansion, but with some other way of approximating the function of interest.
However, I don’t think people typically start from a bunch of functions and then try to build a covariance kernel from them.
RE: Differentiability of the kernel and differentiability of the basis functions. I don’t actually know the answer to this question, but I would offer the following observation.
Functional analysis proceeds by approximating functions (from an infinite dimensional space) by finite sums of simpler functions. To make this work, everything depends on the type of convergence involved. Typically, if you are working on a compact set with strong convergence properties (uniform convergence or absolute summability) on the functions of interest, then you get the kind of intuitive result that you are looking for: the properties of the simple functions pass over to the limit function — e.g. if the kernel is a differentiable function of a parameter, then the expansion functions must be differentiable functions of the same parameter, and vice-versa. Under weaker convergence properties or non-compact domains, this does not happen. In my experience, there’s a counter-example to every “reasonable” idea one comes up with.
Note: To forestall possible confusion from readers of this question, note that the Gaussian expansion of point 1 is not an example of the eigen-expansion of point 2.