In the standard maximum likelihood setting (iid sample Y1,…,Yn from some distribution with density fy(y|θ0)) and in case of a correctly specified model the Fisher information is given by
where the expectation is taken with respect to the true density that generated the data. I have read that the observed Fisher information
is used primary because the integral involved in calculating the (expected) Fisher Information might not be feasible in some cases. What confuses me is that even if the integral is doable, expectation has to be taken with respect to the true model, that is involving the unknown parameter value θ0. If that is the case it appears that without knowing θ0 it is impossible to compute I. Is this true?
You’ve got four quanties here: the true parameter θ0, a consistent estimate ˆθ, the expected information I(θ) at θ and the observed information J(θ) at θ.
These quantities are only equivalent asymptotically, but that is typically how they are used.
The observed information
converges in probability to the expected information
when Y is an iid sample from f(θ0). Here Eθ0(x) indicates the expectation w/r/t the distribution indexed by θ0: ∫xf(x|θ0)dx.
This convergence holds because of the law of large numbers, so the assumption that Y∼f(θ0) is crucial here.
When you’ve got an estimate ˆθ that converges in probability to the true parameter θ0 (ie, is consistent) then you can substitute it for anywhere you see a θ0 above, essentially due to the continuous mapping theorem∗, and all of the convergences continue to hold.
∗ Actually, it appears to be a bit subtle.
As you surmised, observed information is typically easier to work with because differentiation is easier than integration, and you might have already evaluated it in the course of some numeric optimization. In some circumstances (the Normal distribution) they will be the same.
The article “Assessing the Accuracy of the Maximum Likelihood Estimator: Observed Versus Expected
Fisher Information” by Efron and Hinkley (1978) makes an argument in favor of the observed information for finite samples.