When an analytical Jacobian is available, is it better to approximate the Hessian by JTJJ^TJ, or by finite differences of the Jacobian?

Let’s say I’m computing some model parameters my minimizing the sum squared residuals, and I’m assuming my errors are Gaussian. My model produces analytical derivatives, so the optimizer does not need to use finite differences. Once the fit is complete, I want to calculate standard errors of the fitted parameters.

Generally, in this situation, the Hessian of the error function is taken to be related to the covariance matrix by:

where $\sigma^2$ is the variance of the residuals.

When no analytical derivatives of the error are available, it is typically impractical to compute the Hessian, so $J^TJ$ is taken as a good approximation.

However, in my case, I’ve got an analytical J, so it’s relatively cheap for me to compute H by finite differencing J.

So, my question is this: Would it be more accurate to approximate H using my exact J and applying the above approximation, or to approximate H by finite differencing J?

GOOD question. First, recall where this approximation $H \approx J^T J$ comes from. Let $(x_i, y_i)$ be your data points, $f(\cdot)$ be your model and $\beta$ be the parameters of your model. Then the objective function of the non-linear least squares problem is $\frac{1}{2} r^T r$ where $r$ is the vector of the residuals, $r_i = y_i - f(x_i, \beta)$. The exact Hessian of the objective function is $H = J^T J + \sum r_i \nabla^2 r_i$. So the error in this approximation is $H - J^T J = \sum r_i \nabla^2 r_i$. It’s a good approximation when the residuals, themselves, are small; or when the 2nd derivative of the residuals is small. Linear least squares can be considered a special case where the 2nd derivative of the residuals is zero.
As for finite difference approximation, it is relatively cheap. To compute a central difference, you’ll need to evaluate the Jacobian an additional $2n$ times (a forward difference will cost you $n$ additional evaluations, so I wouldn’t bother). The error of the central difference approximation is proportional to $\nabla^4 r$ and $h^2$, where $h$ is the step size. The optimal step size is $h \sim \epsilon^\frac{1}{3}$, where $\epsilon$ is machine precision. So unless the derivatives of the residuals are blowing up, it’s pretty clear that the finite difference approximation should be a LOT better. I should point out that, while the computation is minimal, the bookkeeping is nontrivial. Each finite difference on the Jacobian will give you one row of the Hessian for each residual. You’ll then have to reassemble the Hessian using the formula above.