# Nystroem Method for Kernel Approximation

I have been reading about the Nyström method for low-rank kernel aproximation. This method is implemented in scikit-learn [1] as a method to project data samples to a low-rank approximation of the kernel feature mapping.

To the best of my knowledge, given a training set $\{x_i\}_{i=1}^n$ and a kernel function, it generates a low-rank approximation of the $n \times n$ kernel matrix $K$ by applying SVD to $W$ and $C$.

$K = \left [ \begin{array}{cc} W & K_{21}^T \\ K_{21} & K_{22} \end{array} \right ]$ $C = \left [\begin{array}{cc} W \\ K_{21} \end{array}\right ]$, $W \in \mathbb{R}^{l\times l}$

However, I do not understand how the low-rank approximation of the Kernel matrix can be used to project new samples to the approximated kernel feature space. The papers I have found (e.g. [2]) are not of great help, for they are little didactic.

Also, I am curious about the computational complexity of this method, both in training and test phases.

Let’s derive the Nyström approximation in a way that should make the answers to your questions clearer.

The key assumption in Nyström is that the kernel function is of rank $m$. (Really we assume that it’s approximately of rank $m$, but for simplicity let’s just pretend it’s exactly rank $m$ for now.) That means that any kernel matrix is going to have rank at most $m$,
and in particular

is rank $m$.
Therefore there are $m$ nonzero eigenvalues, and we can write the eigendecomposition of $K$ as

with eigenvectors stored in $U$, of shape $n \times m$,
and eigenvalues arranged in $\Lambda$, an $m \times m$ diagonal matrix.

So, let’s pick $m$ elements, usually uniformly at random but possibly according to other schemes – all that matters in this simplified version is that $K_{11}$ be of full rank. Once we do, just relabel the points so that we end up with the kernel matrix in blocks:

where we evaluate each entry in $K_{11}$ (which is $m \times m$) and $K_{21}$ ($(n-m) \times m$), but don’t want to evaluate any entries in $K_{22}$.

Now, we can split up the eigendecomposition according to this block structure too:

where $U_1$ is $m \times m$ and $U_2$ is $(n-m) \times m$.
But note that now we have $K_{11} = U_1 \Lambda U_1^T$. So we can find $U_1$ and $\Lambda$ by eigendecomposing the known matrix $K_{11}$.

We also know that $K_{21} = U_2 \Lambda U_1^T$. Here, we know everything in this equation except $U_2$, so we can solve for what eigenvalues that implies: right-multiply both sides by $(\Lambda U_1^T)^{-1} = U_1 \Lambda^{-1}$ to get

Now we have everything we need to evaluate $K_{22}$:

In (*), we’ve found a version of the Nyström embedding you might have seen simply as the definition. This tells us the effective kernel values that we’re imputing for the block $K_{22}$.

In (**), we see that the feature matrix $K_{21} K_{11}^{-\frac12}$, which is shape $(n-m) \times m$, corresponds to these imputed kernel values. If we use $K_{11}^{\frac12}$ for the $m$ points, we have a set of $m$-dimensional features

We can just quickly verify that $\Phi$ corresponds to the correct kernel matrix:

So, all we need to do is train our regular learning model with the $m$-dimensional features $\Phi$. This will be exactly the same (under the assumptions we’ve made) as the kernelized version of the learning problem with $K$.

Now, for an individual data point $x$, the features in $\Phi$ correspond to

For a point $x$ in partition 2, the vector $\begin{bmatrix} k(x, x_1) & \dots & k(x, x_m) \end{bmatrix}$ is just the relevant row of $K_{21}$, so that stacking these up gives us $K_{21} K_{11}^{-\frac12}$ – so $\phi(x)$ agrees for points in partition 2. It also works in partition 1: there, the vector is a row of $K_{11}$, so stacking them up gets $K_{11} K_{11}^{-\frac12} = K_{11}^{\frac12}$, again agreeing with $\Phi$. So…it’s still true for an unseen-at-training-time test point $x_\text{new}$. You just do the same thing:

Because we assumed the kernel is rank $m$, the matrix $\begin{bmatrix}K_{\text{train}} & K_{\text{train,test}} \\ K_{\text{test,train}} & K_{\text{test}} \end{bmatrix}$ is also of rank $m$, and the reconstruction of $K_\text{test}$ is still exact by exactly the same logic as for $K_{22}$.

Above, we assumed that the kernel matrix $K$ was exactly rank $m$. This is not usually going to be the case; for a Gaussian kernel, for example, $K$ is always rank $n$, but the latter eigenvalues typically drop off pretty quickly, so it’s going to be close to a matrix of rank $m$, and our reconstructions of $K_{21}$ or $K_{\text{test},1}$ are going to be close to the true values but not exactly the same. They’ll be better reconstructions the closer the eigenspace of $K_{11}$ gets to that of $K$ overall, which is why choosing the right $m$ points is important in practice.

Note also that if $K_{11}$ has any zero eigenvalues, you can replace inverses with pseudoinverses and everything still works; you just replace $K_{21}$ in the reconstruction with $K_{21} K_{11}^\dagger K_{11}$.

You can use the SVD instead of the eigendecomposition if you’d like; since $K$ is psd, they’re the same thing, but the SVD might be a little more robust to slight numerical error in the kernel matrix and such, so that’s what scikit-learn does. scikit-learn’s actual implementation does this, though it uses $\max(\lambda_i, 10^{-12})$ in the inverse instead of the pseudoinverse.