I am trying to understand the intuition behind kernel SVM’s. Now, I understand how linear SVM’s work, whereby a decision line is made which splits the data as best it can. I also understand the principle behind porting data to a higher-dimensional space, and how this can make it easier to find a linear decision line in this new space. What I do not understand is how a kernel is used to project data points to this new space.

What I know about a kernel is that it effectively represents the “similarity” between two data points. But how does this relate to the projection?

**Answer**

Let $h(x)$ be the projection to high dimension space $\mathcal{F}$. Basically the kernel function $K(x_1,x_2)=\langle h(x_1),h(x_2)\rangle$, which is the inner-product. So it’s not used to project data points, but rather an outcome of the projection. It can be considered a measure of similarity, but in an SVM, it’s more than that.

The optimization for finding the best separating hyperplane in $\mathcal{F}$ involves $h(x)$ only through the inner-product form. That’s to say, if you know $K(\cdot,\cdot)$, you don’t need to know the exact form of $h(x)$, which makes the optimization easier.

Each kernel $K(\cdot,\cdot)$ has a corresponding $h(x)$ as well. So if you’re using an SVM with that kernel, then you’re implicitly finding the linear decision line in the space that $h(x)$ maps into.

Chapter 12 of *Elements of Statistical Learning* gives a brief introduction to SVM . This gives more detail about the connection between kernel and feature mapping:

http://statweb.stanford.edu/~tibs/ElemStatLearn/

**Attribution***Source : Link , Question Author : Karnivaurus , Answer Author : Sycorax*