I am rather new to the field of Gaussian processes and how they are being applied in machine learning. I keep reading and hearing about the covariance functions being the main attraction of these methods. So could anyone explain in an intuitive manner what is happening in these covariance functions?
Otherwise, if you could point out to a specific tutorial or document explaining them.
In loose terms, a kernel or covariance function k(x,x′) specifies the statistical relationship between two points x,x′ in your input space; that is, how markedly a change in the value of the Gaussian Process (GP) at x correlates with a change in the GP at x′. In some sense, you can think of k(⋅,⋅) as defining a similarity between inputs (*).
Typical kernels might simply depend on the Euclidean distance (or linear transformations thereof) between points, but the fun starts when you realize that you can do much, much more.
As David Duvenaud puts it:
Kernels can be defined over all types of data structures: Text,
images, matrices, and even kernels. Coming up with a kernel on a new
type of data used to be an easy way to get a NIPS paper.
For an easy overview of kernels for GPs, I warmly recommend his Kernel Cookbook and references therein.
(*) As @Dikran Marsupial notes, beware that the converse is not true; not all similarity metrics are valid kernels (see his answer).