Probabilistic interpretation of Thin Plate Smoothing Splines

TLDR: Do thin plate regression splines have a probabilistic/Bayesian interpretation?

Given input-output pairs (xi,yi), i=1,...,n; I want to estimate a function f() as follows
f(x)u(x)=ϕ(xi)Tβ+ni=1αik(x,xi),
where k(,) is a kernel function and ϕ(xi) is a feature vector of size m<n. The coefficients αi and βi can be found by solving
min
where the rows of \Phi are given by \phi(x_i)^T and, with some abuse of notation, the i,j'th entry of the kernel matrix K is {\displaystyle k(x_{i},x_{j})} . This gives
\begin{equation}
\alpha^*=\lambda^{-1}(I+\lambda^{-1}K)^{-1}(Y-\Phi\beta^*)
\end{equation}

\begin{equation}
\beta^*=\{\Phi^T(I+\lambda^{-1}K)^{-1}\Phi\}^{-1}\Phi^T(I+\lambda^{-1}K)^{-1}Y.
\end{equation}

Assuming that k(\cdot,\cdot) is a positive definite kernel function, this solution can be seen as the Best Linear Unbiased Predictor for the following Bayesian model:
\begin{equation}
y~\vert~(\beta,h(\cdot))~\sim~N(\phi(x)\beta+h(x),\sigma^2),
\end{equation}

\begin{equation}
h(\cdot)~\sim~GP(0,\tau k(\cdot,\cdot)),
\end{equation}

\begin{equation}
\beta\propto1,
\end{equation}

where \sigma^2/\tau=\lambda and GP denotes a Gaussian process. See for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2665800/

My question is as follows. Suppose that I let k(x,x'):=|x-x'|^2 \ln(|x-x'|) and \phi(x)^T=(1,x), i.e. thin plate spline regression. Now, k(\cdot,\cdot) is not a positive semidefinite function and the above interpretation doesn't work. Does the above model and its solution still have a probabilistic interpretation as for the case the k(\cdot,\cdot) is positive semidefinite?

Answer

Let the model of the question be written as
\begin{equation}
\tag{1}
Y_i = \boldsymbol{\phi}(\mathbf{x}_i)^\top\boldsymbol{\beta} +
h(\mathbf{x}_i) + \varepsilon_i
\end{equation}

where h(\mathbf{x}) is an unobserved GP with index
\mathbf{x} \in \mathbb{R}^d and \varepsilon_i is a normal noise
term with variance \sigma^2. The GP is usually assumed to be
centered, stationary and non-deterministic. Note that the term
\boldsymbol{\phi}(\mathbf{x})^\top \boldsymbol{\beta} can be
regarded as a (deterministic) GP with kernel
\boldsymbol{\phi}(\mathbf{x})^\top \mathbf{B}\,
\boldsymbol{\phi}(\mathbf{x})
where \mathbf{B} is a ``infinite-valued''
covariance matrix. Indeed, by taking \mathbf{B} := \rho \, \mathbf{I}
with \rho \to \infty we get the kriging equations of the
question. This is often named the diffuse prior for
\boldsymbol{\beta}. A proper posterior for \boldsymbol{\beta}
results only when the matrix \boldsymbol{\Phi} has full rank. So the
model writes as well as
\begin{equation}
\tag{2}
Y_i = \zeta(\mathbf{x}_i) + \varepsilon_i
\end{equation}

where \zeta(\mathbf{x}) is a GP. The same Bayes interpretation can
be used with restrictions when \zeta(\mathbf{x}) is no longer a GP
but rather is an Intrinsic Random Function (IRF). The derivation can
be found in the book of G. Wahba. Readable presentations
of the concept of IRF are e.g. in the book by N. Cressie and the
article by Mardia et al cited below. IRFs are similar to the
well-known integrated processes in the discrete-time context (such as
ARIMA): an IRF is transformed into a classical GP by a kind of
differencing operation.

Here are two examples of IRF for d=1. Firstly, consider a Wiener
process \zeta(x) with its initial condition \zeta(0) = 0 replaced
by a diffuse initial condition: \zeta(0) is normal with an
infinite variance. Once a value \zeta(x) is known, the IRF can be
predicted as is the Wiener GP. Secondly, consider an integrated
Wiener process
given by the equation \text{d}^2 \zeta(x) /
\text{d}x^2 = \text{d} W(x)/\text{d}x
where W(x) is a Wiener
process. To get a GP we now need two scalar parameters: two values
\zeta(x) and \zeta(x') for x \neq x', or the values \zeta(x)
and \text{d}\zeta(x) / \text{d}x at some chosen x. We may
consider that the two extra parameters are jointly Gaussian with an
infinite 2 \times 2 covariance matrix. In both examples, as soon as
a suitable finite set of observations is available, the IRF is nearly
coped with as a GP. Moreover we used a differential operator: L :=
\text{d}/ \text{d}x
and L := \text{d}^2/ \text{d}x^2 respectively.
The nullspace is a linear space \mathcal{F} of functions \phi(x)
such that L \phi = 0. It contains the constant function
\phi_1(x)=1 in the first case and the functions \phi_1(x)=1 and
\phi_2(x) = x in the second case. Note that in the first example
\zeta(x) - \zeta(x + \delta) is GP for any fixed \delta in the
first example and similarly \zeta(x-\delta) - 2 \zeta(x) + \zeta(x +
\delta)
is a GP in the second case.

For a general dimension d, consider a linear space \mathcal{F} of functions
defined on \mathbb{R}^d. We call an increment
relative to \mathcal{F} a finite collection of s locations
\mathbf{x}_i \in \mathbb{R}^d and s real weights \nu_i such that

\sum_{i=1}^s \, \nu_i \,\phi(\mathbf{x}_i) = 0
\text{ for all } \phi \in \mathcal{F}.

Consider \mathcal{F} as being the nullspace of our examples. For the first example we can take e.g. s=2 with x_1 and x_2
arbitrary and [1, \, -1]. For the second example we can take s = 3
equally spaced x_is and \boldsymbol{\nu} = [1,\,-2,\,1]. The
definition of an IRF involves a space of functions \mathcal{F} and a
function g(\mathbf{x}, \, \mathbf{x}') which is conditionally
positive
w.r.t. \mathcal{F}, which means that

\sum_{i=1}^s \sum_{j=1}^s \nu_i \nu_j \, g(\mathbf{x}_i,
\, \mathbf{x}'_j) \geq 0

holds as soon as [\nu_i,\,\mathbf{x}_i]_{i=1}^s is an increment
w.r.t. \mathcal{F}. From \mathcal{F} and g(\mathbf{x},\,\mathbf{x}')
we can make a covariance kernel hence a GP as in Mardia et al. We can
start from a linear differential operator L and use the nullspace as \mathcal{F}; the IRF will then have connection with the equation L \zeta = a Gaussian noise.

The computation of the prediction of the IRF is nearly the same as in
the question, with k(\mathbf{x},\,\mathbf{x}') replaced by
g(\mathbf{x},\,\mathbf{x}'), but with the \phi_i(\mathbf{x}) now
forming a basis of \mathcal{F}. The extra constraint
\boldsymbol{\Phi}^\top \boldsymbol{\alpha} = \mathbf{0} must be
added in the optimisation problem, which will grant that
\boldsymbol{\alpha}^\top \mathbf{K} \boldsymbol{\alpha} \geq
0
. We still can add more basis functions which are not in
\mathcal{F} if needed; this will have the effect of adding a
deterministic GP, say
\boldsymbol{\psi}(\mathbf{x})^\top\boldsymbol{\gamma} to the IRF
\zeta(\mathbf{x}) in (2).

The thin-plate spline depends on an integer m such that 2m> d, the
space \mathcal{F} contains polynomials with low degree, with
dimension p(m) depending on m and d. It can be shown that if
E(r) is the following function for r \geq 0 E(r) :=
\begin{cases} (-1)^{m + 1 + d /2} \, r^{2m-d} \log r & d \text{
even},\\ r^{2m-d} & d \text{ odd,} \end{cases}
then
g(\mathbf{x},\,\mathbf{x}') := E(\|\mathbf{x} - \mathbf{x}'\|)
defines a conditionally positive w.r.t. \mathcal{F}. The
construction relates to a differential operator L. It turns out
that for d=1 and m=2 the thin plate spline is nothing than the
usual natural cubic spline, which relates to the integrated Wiener
example above, with g(x,\,x') = |x - x'|^3. So
(2) is nothing then than the usual smoothing spline model.
When d=2 and m=2 the nullspace has dimension p(m)=3
and is generated by the functions 1, x_1 and x_2.

Cressie N Statistics for Spatial Data. Wiley 1993.

Mardia KV, Kent JT, Goodall CR and Little JA. Kriging and splines with derivative information. Biometrika (1996), 83,1, pp. 207-221.

Wahba G Spline Models for Observational Data. SIAM 1990.

Wang, Y Smoothing Splines, Methods and Applications. Chapman and Hall, 2011.

Attribution
Source : Link , Question Author : MthQ , Answer Author : Yves

Leave a Comment