What are the ways to choose what kernel would result in good data separation in the final data output by kernel PCA (principal component analysis), and what are the ways to optimize parameters of the kernel?
Layman’s terms if possible would be greatly appreciated, and links to papers that explain such methods would also be nice.
The general approach to select an optimal kernel (either the type of kernel, or kernel parameters) in any kernel-based method is cross-validation. See here for the discussion of kernel selection for support vector machines: How to select kernel for SVM?
The idea behind cross-validation is that we leave out some “test” data, run our algorithm to fit the model on the remaining “training” data, and then check how well the resulting model describes the test data (and how big the error is). This is repeated for different left-out data, errors are averaged to form an average cross-validated error, and then different algorithms can be compared in order to choose one yielding the lowest error. In SVM one can use e.g. classification accuracy (or related measures) as the measure of model performance. Then one would select a kernel that yields the best classification of the test data.
The question then becomes: what measure of model performance can one use in kPCA? If you want to achieve “good data separation” (presumably good class separation), then you can somehow measure it on the training data and use that to find the best kernel. Note, however, that PCA/kPCA are not designed to yield good data separation (they do not take class labels into account at all). So generally speaking, one would want another, class-unrelated, measure of model performance.
In standard PCA one can use reconstruction error as the performance measure on the test set. In kernel PCA one can also compute reconstruction error, but the problem is that it is not comparable between different kernels: reconstruction error is the distance measured in the target feature space; and different kernels correspond to different target spaces… So we have a problem.
One way to tackle this problem is to somehow compute the reconstruction error in the original space, not in the target space. Obviously the left-out test data point lives in the original space. But its kPCA reconstruction lives in the [low-dimensional subspace of] the target space. What one can do, though, is to find a point (“pre-image”) in the original space that would be mapped as close as possible to this reconstruction point, and then measure the distance between the test point and this pre-image as reconstruction error.
I will not give all the formulas here, but instead refer you to some papers and only insert here several figures.
The idea of “pre-image” in kPCA was apparently introduced in this paper:
- Mika, S., Schölkopf, B., Smola, A. J., Müller, K. R., Scholz, M., & Rätsch, G. (1998). Kernel PCA and De-Noising in Feature Spaces. In NIPS (Vol. 11, pp. 536-542).
Mika et al. are not doing cross-validation, but they need pre-images for de-noising purposes, see this figure:
Denoised (thick) points are pre-images of kPCA projections (there is no test and training here). It is not a trivial task to find these pre-images: one needs to use gradient descent, and the loss function will depend on the kernel.
And here is a very recent paper that used pre-images for cross-validation purposes and kernel/hyperparameter selection:
- Alam, M. A., & Fukumizu, K. (2014). Hyperparameter selection in kernel principal component analysis. Journal of Computer Science, 10(7), 1139-1150.
This is their algorithm:
And here are some results (that I think are pretty much self-explanatory):