I have been reading about the Nyström method for low-rank kernel aproximation. This method is implemented in scikit-learn [1] as a method to project data samples to a low-rank approximation of the kernel feature mapping.

To the best of my knowledge, given a training set {xi}ni=1 and a kernel function, it generates a low-rank approximation of the n×n kernel matrix K by applying SVD to W and C.

K=[WKT21K21K22] C=[WK21], W∈Rl×l

However,

I do not understand how the low-rank approximation of the Kernel matrix can be used to project new samples to the approximated kernel feature space. The papers I have found (e.g. [2]) are not of great help, for they are little didactic.Also, I am curious about the computational complexity of this method, both in training and test phases.

[1] http://scikit-learn.org/stable/modules/kernel_approximation.html#nystroem-kernel-approx

[2] http://www.jmlr.org/papers/volume13/kumar12a/kumar12a.pdf

**Answer**

Let’s derive the Nyström approximation in a way that should make the answers to your questions clearer.

The key assumption in Nyström is that the kernel function is of rank m. (Really we assume that it’s *approximately* of rank m, but for simplicity let’s just pretend it’s *exactly* rank m for now.) That means that *any* kernel matrix is going to have rank at most m,

and in particular

K=[k(x1,x1)…k(x1,xn)⋮⋱⋮k(xn,x1)…k(xn,xn)],

is rank m.

Therefore there are m nonzero eigenvalues, and we can write the eigendecomposition of K as

K=UΛUT

with eigenvectors stored in U, of shape n×m,

and eigenvalues arranged in Λ, an m×m diagonal matrix.

So, let’s pick m elements, usually uniformly at random but possibly according to other schemes – all that matters in this simplified version is that K11 be of full rank. Once we do, just relabel the points so that we end up with the kernel matrix in blocks:

K=[K11KT21K21K22],

where we evaluate each entry in K11 (which is m×m) and K21 ((n−m)×m), but don’t want to evaluate any entries in K22.

Now, we can split up the eigendecomposition according to this block structure too:

K=UΛUT=[U1U2]Λ[U1U2]T=[U1ΛUT1U1ΛUT2U2ΛUT1U2ΛUT2],

where U1 is m×m and U2 is (n−m)×m.

But note that now we have K11=U1ΛUT1. So we can find U1 and Λ by eigendecomposing the known matrix K11.

We also know that K21=U2ΛUT1. Here, we know everything in this equation except U2, so we can solve for what eigenvalues that implies: right-multiply both sides by (ΛUT1)−1=U1Λ−1 to get

U2=K21U1Λ−1.

Now we have everything we need to evaluate K22:

K22=U2ΛUT2=(K21U1Λ−1)Λ(K21U1Λ−1)T=K21U1(Λ−1Λ)Λ−1UT1KT21=K21U1Λ−1UT1KT21=K21K−111KT21=(K21K−1211)(K21K−1211)T.

In (*), we’ve found a version of the Nyström embedding you might have seen simply as the definition. This tells us the effective kernel values that we’re imputing for the block K22.

In (**), we see that the *feature matrix* K21K−1211, which is shape (n−m)×m, corresponds to these imputed kernel values. If we use K1211 for the m points, we have a set of m-dimensional features

Φ=[K1211K21K−1211].

We can just quickly verify that Φ corresponds to the correct kernel matrix:

ΦΦT=[K1211K21K−1211][K1211K21K−1211]T=[K1211K1211K1211K−1211KT21K21K−1211K1211K21K−1211K−1211KT21]=[K11KT21K21K21K−111KT21]=K.

So, all we need to do is train our regular learning model with the m-dimensional features Φ. This will be *exactly* the same (under the assumptions we’ve made) as the kernelized version of the learning problem with K.

Now, for an individual data point x, the features in Φ correspond to

ϕ(x)=[k(x,x1)…k(x,xm)]K−1211.

For a point x in partition 2, the vector [k(x,x1)…k(x,xm)] is just the relevant row of K21, so that stacking these up gives us K21K−1211 – so ϕ(x) agrees for points in partition 2. It also works in partition 1: there, the vector is a row of K11, so stacking them up gets K11K−1211=K1211, again agreeing with Φ. So…it’s still true for an unseen-at-training-time test point xnew. You just do the same thing:

Φtest=Ktest,1K−1211.

Because we assumed the kernel is rank m, the matrix [KtrainKtrain,testKtest,trainKtest] is also of rank m, and the reconstruction of Ktest is still exact by exactly the same logic as for K22.

Above, we assumed that the kernel matrix K was *exactly* rank m. This is not usually going to be the case; for a Gaussian kernel, for example, K is *always* rank n, but the latter eigenvalues typically drop off pretty quickly, so it’s going to be *close to* a matrix of rank m, and our reconstructions of K21 or Ktest,1 are going to be *close to* the true values but not exactly the same. They’ll be better reconstructions the closer the eigenspace of K11 gets to that of K overall, which is why choosing the right m points is important in practice.

Note also that if K11 has any zero eigenvalues, you can replace inverses with pseudoinverses and everything still works; you just replace K21 in the reconstruction with K21K†11K11.

You can use the SVD instead of the eigendecomposition if you’d like; since K is psd, they’re the same thing, but the SVD might be a little more robust to slight numerical error in the kernel matrix and such, so that’s what scikit-learn does. scikit-learn’s actual implementation does this, though it uses max in the inverse instead of the pseudoinverse.

**Attribution***Source : Link , Question Author : Daniel López , Answer Author : Danica*