Hyperparameter tuning in Gaussian Process Regression

I am trying to tune the hyperparameters of the gaussian process regression algorithm I’ve implemented. I simply want to maximize the log marginal likelihood given by the formula log(y|X,θ)=12yTK1yy12log(det
where K is the covariance matrix with the elements K_{ij}=k(x_i,x_j)=b^{-1}\exp(-\frac{1}{2}(x_i-x_j)^TM(x_i-x_j))+a^{-1}\delta_{ij} where M=lI and a,b and l are hyperparameters.

partial derivative of the log marginal likelihood w.r.t parameters is given by the following \frac{\log(\mathbf{y}|X,\mathbf{\theta})}{d\theta}=\frac{1}{2}\mathrm{trace}(K^{-1}\frac{dK}{d\theta})+\frac{1}{2}(\mathbf{y}\frac{dK}{d\theta}K^{-1}\frac{dK}{d\theta}\mathbf{y})

As the entries of K depend on the parameters, so do derivatives and inverse of K. This means, when a gradient-based optimizer is employed, evaluating the gradient at a given point (parameter value) will require recomputation of the covariance matrix. In my application, this is not feasible because computing the covariance matrix from scratch and computing its inverse in every iteration of gradient ascent is too expensive. My question is what my options are to find a fairly good combination of these three parameters? and I also don’t know which parameter to optimize first and I would appreciate any pointers on this issue as well.

Answer

You are right that you need a new covariance matrix computation on every iteration of gradient ascent. So if the matrix computation is not feasible for your setting, then, I think, you cannot use gradient-based marginal likelihood optimization.

My suggestion is to use gradient-free methods for hyperparameter tuning, such as grid search, random search, or Bayesian optimization-based search. These methods are widely used for optimization hyperparameters of other machine learning algorithms e.g. SVMs.

I suggest the grid search for your first try. You basically form a table (grid) of possible hyperparameters, try every one, and look for the best validation performance (or best marginal likelihood).

Grid search would yield a suboptimal set of hyperparameters, and you have to specify grid by yourself.(tip: make grid in a log scale) but far less computation is needed. (and you don’t need gradient!)

If you are not familiar with grid search, you can look up Wikipedia:Hyperparameter Optimization – Grid Search

Attribution
Source : Link , Question Author : bfaskiplar , Answer Author : Sangwoong Yoon

Leave a Comment