Nystroem Method for Kernel Approximation

I have been reading about the Nyström method for low-rank kernel aproximation. This method is implemented in scikit-learn [1] as a method to project data samples to a low-rank approximation of the kernel feature mapping.

To the best of my knowledge, given a training set {xi}ni=1 and a kernel function, it generates a low-rank approximation of the n×n kernel matrix K by applying SVD to W and C.

K=[WKT21K21K22] C=[WK21], WRl×l

However, I do not understand how the low-rank approximation of the Kernel matrix can be used to project new samples to the approximated kernel feature space. The papers I have found (e.g. [2]) are not of great help, for they are little didactic.

Also, I am curious about the computational complexity of this method, both in training and test phases.

[1] http://scikit-learn.org/stable/modules/kernel_approximation.html#nystroem-kernel-approx

[2] http://www.jmlr.org/papers/volume13/kumar12a/kumar12a.pdf

Answer

Let’s derive the Nyström approximation in a way that should make the answers to your questions clearer.

The key assumption in Nyström is that the kernel function is of rank m. (Really we assume that it’s approximately of rank m, but for simplicity let’s just pretend it’s exactly rank m for now.) That means that any kernel matrix is going to have rank at most m,
and in particular
K=[k(x1,x1)k(x1,xn)k(xn,x1)k(xn,xn)],
is rank m.
Therefore there are m nonzero eigenvalues, and we can write the eigendecomposition of K as
K=UΛUT
with eigenvectors stored in U, of shape n×m,
and eigenvalues arranged in Λ, an m×m diagonal matrix.

So, let’s pick m elements, usually uniformly at random but possibly according to other schemes – all that matters in this simplified version is that K11 be of full rank. Once we do, just relabel the points so that we end up with the kernel matrix in blocks:
K=[K11KT21K21K22],
where we evaluate each entry in K11 (which is m×m) and K21 ((nm)×m), but don’t want to evaluate any entries in K22.

Now, we can split up the eigendecomposition according to this block structure too:
K=UΛUT=[U1U2]Λ[U1U2]T=[U1ΛUT1U1ΛUT2U2ΛUT1U2ΛUT2],
where U1 is m×m and U2 is (nm)×m.
But note that now we have K11=U1ΛUT1. So we can find U1 and Λ by eigendecomposing the known matrix K11.

We also know that K21=U2ΛUT1. Here, we know everything in this equation except U2, so we can solve for what eigenvalues that implies: right-multiply both sides by (ΛUT1)1=U1Λ1 to get
U2=K21U1Λ1.
Now we have everything we need to evaluate K22:
K22=U2ΛUT2=(K21U1Λ1)Λ(K21U1Λ1)T=K21U1(Λ1Λ)Λ1UT1KT21=K21U1Λ1UT1KT21=K21K111KT21=(K21K1211)(K21K1211)T.

In (*), we’ve found a version of the Nyström embedding you might have seen simply as the definition. This tells us the effective kernel values that we’re imputing for the block K22.

In (**), we see that the feature matrix K21K1211, which is shape (nm)×m, corresponds to these imputed kernel values. If we use K1211 for the m points, we have a set of m-dimensional features
Φ=[K1211K21K1211].
We can just quickly verify that Φ corresponds to the correct kernel matrix:
ΦΦT=[K1211K21K1211][K1211K21K1211]T=[K1211K1211K1211K1211KT21K21K1211K1211K21K1211K1211KT21]=[K11KT21K21K21K111KT21]=K.

So, all we need to do is train our regular learning model with the m-dimensional features Φ. This will be exactly the same (under the assumptions we’ve made) as the kernelized version of the learning problem with K.

Now, for an individual data point x, the features in Φ correspond to
ϕ(x)=[k(x,x1)k(x,xm)]K1211.
For a point x in partition 2, the vector [k(x,x1)k(x,xm)] is just the relevant row of K21, so that stacking these up gives us K21K1211 – so ϕ(x) agrees for points in partition 2. It also works in partition 1: there, the vector is a row of K11, so stacking them up gets K11K1211=K1211, again agreeing with Φ. So…it’s still true for an unseen-at-training-time test point xnew. You just do the same thing:
Φtest=Ktest,1K1211.
Because we assumed the kernel is rank m, the matrix [KtrainKtrain,testKtest,trainKtest] is also of rank m, and the reconstruction of Ktest is still exact by exactly the same logic as for K22.


Above, we assumed that the kernel matrix K was exactly rank m. This is not usually going to be the case; for a Gaussian kernel, for example, K is always rank n, but the latter eigenvalues typically drop off pretty quickly, so it’s going to be close to a matrix of rank m, and our reconstructions of K21 or Ktest,1 are going to be close to the true values but not exactly the same. They’ll be better reconstructions the closer the eigenspace of K11 gets to that of K overall, which is why choosing the right m points is important in practice.

Note also that if K11 has any zero eigenvalues, you can replace inverses with pseudoinverses and everything still works; you just replace K21 in the reconstruction with K21K11K11.

You can use the SVD instead of the eigendecomposition if you’d like; since K is psd, they’re the same thing, but the SVD might be a little more robust to slight numerical error in the kernel matrix and such, so that’s what scikit-learn does. scikit-learn’s actual implementation does this, though it uses max in the inverse instead of the pseudoinverse.

Attribution
Source : Link , Question Author : Daniel López , Answer Author : Danica

Leave a Comment