TLDR: Do thin plate regression splines have a probabilistic/Bayesian interpretation?Given input-output pairs (xi,yi), i=1,...,n; I want to estimate a function f(⋅) as follows

f(x)≈u(x)=ϕ(xi)Tβ+n∑i=1αik(x,xi),

where k(⋅,⋅) is a kernel function and ϕ(xi) is a feature vector of size m<n. The coefficients αi and βi can be found by solving

min

where the rows of \Phi are given by \phi(x_i)^T and, with some abuse of notation, the i,j'th entry of the kernel matrix K is {\displaystyle k(x_{i},x_{j})} . This gives

\begin{equation}

\alpha^*=\lambda^{-1}(I+\lambda^{-1}K)^{-1}(Y-\Phi\beta^*)

\end{equation}

\begin{equation}

\beta^*=\{\Phi^T(I+\lambda^{-1}K)^{-1}\Phi\}^{-1}\Phi^T(I+\lambda^{-1}K)^{-1}Y.

\end{equation}

Assuming that k(\cdot,\cdot) is a positive definite kernel function, this solution can be seen as the Best Linear Unbiased Predictor for the following Bayesian model:

\begin{equation}

y~\vert~(\beta,h(\cdot))~\sim~N(\phi(x)\beta+h(x),\sigma^2),

\end{equation}

\begin{equation}

h(\cdot)~\sim~GP(0,\tau k(\cdot,\cdot)),

\end{equation}

\begin{equation}

\beta\propto1,

\end{equation}

where \sigma^2/\tau=\lambda and GP denotes a Gaussian process. See for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665800/My question is as follows. Suppose that I let k(x,x'):=|x-x'|^2 \ln(|x-x'|) and \phi(x)^T=(1,x), i.e. thin plate spline regression. Now, k(\cdot,\cdot) is not a positive semidefinite function and the above interpretation doesn't work. Does the above model and its solution still have a probabilistic interpretation as for the case the k(\cdot,\cdot) is positive semidefinite?

**Answer**

Let the model of the question be written as

\begin{equation}

\tag{1}

Y_i = \boldsymbol{\phi}(\mathbf{x}_i)^\top\boldsymbol{\beta} +

h(\mathbf{x}_i) + \varepsilon_i

\end{equation}

where h(\mathbf{x}) is an unobserved GP with index

\mathbf{x} \in \mathbb{R}^d and \varepsilon_i is a normal noise

term with variance \sigma^2. The GP is usually assumed to be

centered, stationary and non-deterministic. Note that the term

\boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\beta} can be

regarded as a (deterministic) GP with kernel

\boldsymbol{\phi}(\mathbf{x})^\top \mathbf{B}\,

\boldsymbol{\phi}(\mathbf{x}) where \mathbf{B} is a ``infinite-valued''

covariance matrix. Indeed, by taking \mathbf{B} := \rho \, \mathbf{I}

with \rho \to \infty we get the kriging equations of the

question. This is often named the *diffuse prior* for

\boldsymbol{\beta}. A proper posterior for \boldsymbol{\beta}

results only when the matrix \boldsymbol{\Phi} has full rank. So the

model writes as well as

\begin{equation}

\tag{2}

Y_i = \zeta(\mathbf{x}_i) + \varepsilon_i

\end{equation}

where \zeta(\mathbf{x}) is a GP. The same Bayes interpretation can

be used with restrictions when \zeta(\mathbf{x}) is no longer a GP

but rather is an *Intrinsic Random Function* (IRF). The derivation can

be found in the book of G. Wahba. Readable presentations

of the concept of IRF are e.g. in the book by N. Cressie and the

article by Mardia et al cited below. IRFs are similar to the

well-known integrated processes in the discrete-time context (such as

ARIMA): an IRF is transformed into a classical GP by a kind of

differencing operation.

Here are two examples of IRF for d=1. Firstly, consider a Wiener

process \zeta(x) with its initial condition \zeta(0) = 0 replaced

by a *diffuse* initial condition: \zeta(0) is normal with an

infinite variance. Once a value \zeta(x) is known, the IRF can be

predicted as is the Wiener GP. Secondly, consider an *integrated
Wiener process* given by the equation \text{d}^2 \zeta(x) /

\text{d}x^2 = \text{d} W(x)/\text{d}x where W(x) is a Wiener

process. To get a GP we now need two scalar parameters: two values

\zeta(x) and \zeta(x') for x \neq x', or the values \zeta(x)

and \text{d}\zeta(x) / \text{d}x at some chosen x. We may

consider that the two extra parameters are jointly Gaussian with an

infinite 2 \times 2 covariance matrix. In both examples, as soon as

a suitable finite set of observations is available, the IRF is nearly

coped with as a GP. Moreover we used a differential operator: L :=

\text{d}/ \text{d}x and L := \text{d}^2/ \text{d}x^2 respectively.

The nullspace is a linear space \mathcal{F} of functions \phi(x)

such that L \phi = 0. It contains the constant function

\phi_1(x)=1 in the first case and the functions \phi_1(x)=1 and

\phi_2(x) = x in the second case. Note that in the first example

\zeta(x) - \zeta(x + \delta) is GP for any fixed \delta in the

first example and similarly \zeta(x-\delta) - 2 \zeta(x) + \zeta(x +

\delta) is a GP in the second case.

For a general dimension d, consider a linear space \mathcal{F} of functions

defined on \mathbb{R}^d. We call an *increment*

relative to \mathcal{F} a finite collection of s locations

\mathbf{x}_i \in \mathbb{R}^d and s real weights \nu_i such that

\sum_{i=1}^s \, \nu_i \,\phi(\mathbf{x}_i) = 0

\text{ for all } \phi \in \mathcal{F}.

Consider \mathcal{F} as being the nullspace of our examples. For the first example we can take e.g. s=2 with x_1 and x_2

arbitrary and [1, \, -1]. For the second example we can take s = 3

equally spaced x_is and \boldsymbol{\nu} = [1,\,-2,\,1]. The

definition of an IRF involves a space of functions \mathcal{F} and a

function g(\mathbf{x}, \, \mathbf{x}') which is *conditionally
positive* w.r.t. \mathcal{F}, which means that

\sum_{i=1}^s \sum_{j=1}^s \nu_i \nu_j \, g(\mathbf{x}_i,

\, \mathbf{x}'_j) \geq 0

holds as soon as [\nu_i,\,\mathbf{x}_i]_{i=1}^s is an increment

w.r.t. \mathcal{F}. From \mathcal{F} and g(\mathbf{x},\,\mathbf{x}')

we can make a covariance kernel hence a GP as in Mardia et al. We can

start from a linear differential operator L and use the nullspace as \mathcal{F}; the IRF will then have connection with the equation L \zeta = a Gaussian noise.

The computation of the prediction of the IRF is nearly the same as in

the question, with k(\mathbf{x},\,\mathbf{x}') replaced by

g(\mathbf{x},\,\mathbf{x}'), but with the \phi_i(\mathbf{x}) now

forming a basis of \mathcal{F}. The extra constraint

\boldsymbol{\Phi}^\top \boldsymbol{\alpha} = \mathbf{0} must be

added in the optimisation problem, which will grant that

\boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha} \geq

0. We still can add more basis functions which are not in

\mathcal{F} if needed; this will have the effect of adding a

deterministic GP, say

\boldsymbol{\psi}(\mathbf{x})^\top\boldsymbol{\gamma} to the IRF

\zeta(\mathbf{x}) in (2).

The thin-plate spline depends on an integer m such that 2m> d, the

space \mathcal{F} contains polynomials with low degree, with

dimension p(m) depending on m and d. It can be shown that if

E(r) is the following function for r \geq 0 E(r) :=

\begin{cases} (-1)^{m + 1 + d /2} \, r^{2m-d} \log r & d \text{

even},\\ r^{2m-d} & d \text{ odd,} \end{cases} then

g(\mathbf{x},\,\mathbf{x}') := E(\|\mathbf{x} - \mathbf{x}'\|)

defines a conditionally positive w.r.t. \mathcal{F}. The

construction relates to a differential operator L. It turns out

that for d=1 and m=2 the thin plate spline is nothing than the

usual natural cubic spline, which relates to the integrated Wiener

example above, with g(x,\,x') = |x - x'|^3. So

(2) is nothing then than the usual smoothing spline model.

When d=2 and m=2 the nullspace has dimension p(m)=3

and is generated by the functions 1, x_1 and x_2.

Cressie N *Statistics for Spatial Data*. Wiley 1993.

Wahba G *Spline Models for Observational Data*. SIAM 1990.

Wang, Y *Smoothing Splines, Methods and Applications*. Chapman and Hall, 2011.

**Attribution***Source : Link , Question Author : MthQ , Answer Author : Yves*