I have recently started to delve into Gaussian processes. During my review, I have found a book which states that one can interpret the mean of a Gaussian process as a combination of basis functions, i.e.:

\bar{f}(x^*)=\sum_{n=1}^N \alpha_i k(x_i,x^*) \tag{1}

where N is the number of training points for the Gaussian process, k is a RBF kernel, and a_i is the i-th entry of a vector

\alpha=[\alpha_1,…,\alpha_N]^T=(K+\sigma_n^{2}I)^{-1}y\tag{2}

where K is the Gram matrix (the N-by-N matrix of kernel evaluations at the training points, where entry K_{n,m}=k(x_n,x_m)) and y is a vector of length N containing the predicted values at thetraining points x_i,i=1,…,N. These equations are taken from Rasmussen & Williams (page 11, equation 2.27). In my case, we can assume that \sigma_n=0, so

\alpha=[\alpha_1,…,\alpha_N]^T=K^{-1}y\tag{3}

Now here is the problem: if I follow this form, my Gaussian process does not correctly fit the training data. If I try other implementations, the Gaussian process fits the data correctly. Unfortunately, I require the Gaussian process in the form of Equation (1) because I want to take the derivative of (1) wrt x.Could you please check whether I have made an error somewhere in the code example below? My solution according to (1) is plotted as a green dotted line, the alternative approach I used is plotted as a red dotted line.

`import numpy as np import matplotlib.pyplot as plt np.random.seed(1) def evaluate_kernel(x1,x2,hs): """ This function takes two arrays of shape (N x D) and (M x D) as well as a vector of bandwidths hs (M) and returns a (N x M) matrix of RBF kernel evaluations. D is the dimensionality of the parameters; here D = 1 """ # Pre-allocate empty matrix matrix = np.zeros((x1.shape[0],x2.shape[0])) for n in range(x2.shape[0]): dist = np.linalg.norm(x1-x2[n,:],axis=1) matrix[:,n] = np.exp(-(dist**2)/(2*hs[n])) return matrix # Create training samples N = 20 x_train = np.random.uniform(0,1,size=(N,1)) y_train = np.cos(x_train*2*np.pi) # Set the bandwidths to 1 for now hs = np.ones(N)/100 # Get the Gaussian Process parameters K = evaluate_kernel(x_train,x_train,hs) params = np.dot(np.linalg.inv(K.copy()),y_train) # Get the evaluation points M = 101 x_test = np.linspace(0,1,M).reshape((M,1)) K_star = evaluate_kernel(x_test,x_train,hs) # Evaluate the posterior mean mu = np.dot(K_star,params) # Plot the results plt.scatter(x_train,y_train) plt.plot(x_test,mu,'g:') # Alternative approach: works ------------------------------------------------- # Alternative approach # Apply the kernel function to our training points L = np.linalg.cholesky(K) # Compute the mean at our test points. Lk = np.linalg.solve(L, K_star.T) mu_alt = np.dot(Lk.T, np.linalg.solve(L, y_train)).reshape((101,)) plt.plot(x_test,mu_alt,'r:')`

**Answer**

Covariance matrix of Gaussian process K is defined in terms of evaluations of the kernel function k over the pairs of datapoints, i.e. K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j). For train X and test X_* datasets, we have submatrices K = K(X, X) and K_* = K(X, X_*). In such case, predictive mean of the Gaussian process is

\mu = K_* K^\top y

Eyeballing the code, I don’t see any obvious bug. You need to do standard debugging, so for every step check if the outputs match what you would expect from processing the inputs (values, shapes, etc). Also, I’d recommend starting with simple, unoptimized code, as *premature optimization is a root of all evil*. For example: for evaluating the kernel use old-fashioned for-loops rather than vectorized code, moreover, you seem to use K_* = K(X_*, X) to avoid transposing, instead write it exactly as in the equation, and only if it works as expected, optimize the code. Finally, write unit tests.

**Attribution***Source : Link , Question Author : J.Galt , Answer Author : Tim*