Estimating the parameters is often done through maximum likelihood estimation. In this vignette, derivatives of parameters are calculated to give the parameter gradient. This can then be used in numerical optimization, such as with BFGS, and should be much faster than calculating without having the gradient.

In this vignette, I follow the lead of the GPfit paper. Instead of maximizing the likelihood, we find parameter estimates by minimizing the deviance. For correlation parameters \(\theta\), the deviance is shown below.

\[ -2\log L_{\theta} \propto \log |R| + n\log[(Y-1_n \hat{\mu}(\theta))^T R^{-1} (Y-1_n \hat{\mu}(\theta))] = \mathcal{D}\]

For now I will assume that \(\hat{\mu}\) does not depend on \(\theta\), and will replace \(Y-\hat{\mu}(\theta)\) with \(Z\), as shown below.

\[ \mathcal{D} = \log |R| + n\log[Z^T R^{-1} Z]\]

Thus the only dependence on the correlation parameters is through \(R\).

Via the GPML book section A.3.1, we need the following equations:

\[ \frac{\partial }{\partial \theta} R^{-1} = -R^{-1} \frac{\partial R}{\partial \theta} R^{-1}\]

\[ \frac{\partial }{\partial \theta} \log |R| = \text{tr}(R ^ {-1}\frac{\partial R}{\partial \theta} ) \]

Now we can calculate the derivative of the deviance in terms of \(\frac{\partial R}{\partial \theta}\).

\[ \frac{\partial \mathcal{D}}{\partial \theta} = \text{tr}(R ^ {-1}\frac{\partial R}{\partial \theta} ) - \frac{n}{Z^T R^{-1} Z} Z^T R^{-1} \frac{\partial R}{\partial \theta} R^{-1} Z \]

Now we just need to find \(\frac{\partial R}{\partial \theta}\), which depends on the specific correlation function \(R(\theta)\).

The correlation function is usually augmented by adding a nugget term \(\delta\) to the diagonal:

\[ R = R^* + \delta I \]

The nugget accounts for noise in the responses, smoothing the predicted response function. It also helps with numerical stability, which is a serious problem when there is a lot of data. Often a small value is used even the function is noiseless.

For the nugget,

\[\frac{\partial R}{\partial \delta} = I\]

Thus, the derivative for the deviance is very simple. \[ \frac{\partial \mathcal{D}}{\partial \delta} = \text{tr}(R ^ {-1} ) - \frac{n}{Z^T R^{-1} Z} Z^T R^{-1} R^{-1} Z \]

This equation gives the derivative of the deviance with respect to the nugget regardless of the correlation function used.

The Gaussian correlation function has parameter vector \(\theta = (\theta_1, ..., \theta_d)\).

Ignoring the nugget term, the \(i\),\(j\) entry of the correlation matrix is

\[ R_{ij}(x, y) = \exp\bigg[-\sum_{k=1}^{d} \theta_k (x_{ik} - x_{jk})^2 \bigg] \]

\[ \frac{\partial}{\partial \theta_l} R_{ij}(x, y) = -\exp\bigg[-\sum_{k=1}^{d} \theta_k (x_{ik} - x_{jk})^2 \bigg] (x_{il} - x_{jl})^2 \]

This will give the matrix \(\frac{\partial R}{\partial \theta_l}\), which can be used with the previous equations to calculate \(\frac{\partial \mathcal{D}}{\partial \theta_l}\).

The lifted brownian covariance function is \[ c(x, x') = \psi(x) + \psi(x') + \psi(x-x') - 1 \] where \[ \psi(h) = (1 + ||h||_a^{2\gamma} ) ^{\beta} \]

and \[ ||h||_a^2 = \boldsymbol{h}^T \begin{bmatrix} a_1 & 0 & \dots & 0 \\ 0 & a_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & a_d \end{bmatrix} \boldsymbol{h} = \sum_{i=1}^d a_i h_i^2\]

This is different from the others because it is not a correlation function, which ranges from 0 to 1, but is the covariance function itself. Thus we will have to use first deviance before inserting \(\hat{\sigma}^2\).

\[ \frac{\partial}{\partial \beta} \psi(h) = (1 + ||h||_a^{2\gamma} ) ^{\beta} \log (1 + ||h||_a^{2\gamma} ) = \psi(h) \log (1 + ||h||_a^{2\gamma} ) \]

\[ \frac{\partial}{\partial \gamma} \psi(h) = \beta (1 + ||h||_a^{2\gamma} ) ^{\beta-1} ||h||_a^{2\gamma} \log(||h||_a^{2}) \] \[ \frac{\partial}{\partial a_i} ||h||_a^{2} = h_i^2 \]

\[ \frac{\partial}{\partial a_i} ||h||_a^{2\gamma} = \gamma ||h||_a^{2(\gamma-1)} h_i^2 \]

\[ \frac{\partial}{\partial a_i} \psi(h) = \beta (1 + ||h||_a^{2\gamma} ) ^{\beta-1} \gamma ||h||_a^{2(\gamma-1)} h_i^2 \]

The likelihood for data from a Gaussian process follows the standard multivariate normal probability distribution function (pdf). \[ L = (2 \pi)^{-k/2} |\Sigma|^{-1/2} \exp[\frac{-1}{2}(Y - \mu) \Sigma^{-1} (Y - \mu)] \] The log likelihood is generally easier to work with.

\[ \log L = \frac{-k}{2} \log(2 \pi) + \frac{-1}{2}\log|\Sigma| + \frac{-1}{2}(Y - \mu) \Sigma^{-1} (Y - \mu) \] To simplify, we can multiply it by -2, and call this the deviance, denoted here as \(\mathcal{D}\).

\[ \mathcal{D} = -2\log L = k \log(2 \pi) + \log|\Sigma| + (Y - \mu) \Sigma^{-1} (Y - \mu) \] \(k\) can be ignored since it usually constant while optimizing parameters. \[ \mathcal{D} = -2\log L \propto \log|\Sigma| + (Y - \mu) \Sigma^{-1} (Y - \mu) \]

\[ \frac{\partial \mathcal{D}}{\partial \theta} = \text{tr}(\Sigma ^ {-1}\frac{\partial \Sigma}{\partial \theta} ) - Z^T \Sigma^{-1} \frac{\partial \Sigma}{\partial \theta} \Sigma^{-1} Z \]