Observed information matrix is a consistent estimator of the expected information matrix?

I am trying to prove that the observed information matrix evaluated at the weakly consistent maximum likelihood estimator (MLE), is a weakly consistent estimator of the expected information matrix. This is a widely quoted result but nobody gives a reference or a proof (I have exhausted I think the first 20 pages of google results and my stats textbooks)!

Using a weakly consistent sequence of MLEs I can use the weak law of large numbers (WLLN) and the continuous mapping theorem to get the result I want. However I believe the continuous mapping theorem cannot be used. Instead I think the uniform law of large numbers (ULLN) needs to be used. Does anybody know of a reference that has a proof of this? I have an attempt at the ULLN but omit it for now for brevity.

I apologise for the length of this question but notation has to be introduced. The notation is as folows (my proof is at the end).

Assume we have an iid sample of random variables {Y1,,YN} with densities f(˜Y|θ), where θΘRk (here ˜Y is a just a general random variable with the same density as any one of the members of the sample). The vector Y=(Y1,,YN)T is the vector of all the sample vectors where YiRn for all i=1,,N. The true parameter value of the densities is θ0, and ˆθN(Y) is the weakly consistent maximum likelihood estimator (MLE) of θ0. Subject to regularity conditions the Fisher Information matrix can be written as

I(θ)=Eθ[Hθ(logf(˜Y|θ)]

where Hθ is the Hessian matrix. The sample equivalent is

IN(θ)=Ni=1Iyi(θ),

where Iyi=Eθ[Hθ(logf(Yi|θ)]. The observed information matrix is;

J(θ)=Hθ(logf(y|θ),

(some people demand the matrix is evaluated at ˆθ but some don’t). The sample observed information matrix is;

JN(θ)=Ni=1Jyi(θ)

where Jyi(θ)=Hθ(logf(yi|θ).

I can prove convergence in probability of the estimator N1JN(θ) to I(θ), but not of N1JN(ˆθN(Y)) to I(θ0). Here is my proof so far;

Now (JN(θ))rs=Ni=1(Hθ(logf(Yi|θ))rs is element (r,s) of JN(θ), for any r,s=1,,k. If the sample is iid, then by the weak law of large numbers (WLLN), the average of these summands converges in probability to Eθ[(Hθ(logf(Y1|θ))rs]=(IY1(θ))rs=(I(θ))rs. Thus N1(JN(θ))rsP(I(θ))rs for all r,s=1,,k, and so N1JN(θ)PI(θ). Unfortunately we cannot simply conclude N1JN(ˆθN(Y))PI(θ0) by using the continuous mapping theorem since N1JN() is not the same function as I().

Any help on this would be greatly appreciated.

Answer

\newcommand{\convp}{\stackrel{P}{\longrightarrow}}

I guess directly establishing some sort of uniform law of large numbers
is one possible approach.

Here is another.

We want to show that \frac{J^N(\theta_{MLE})}{N} \convp I(\theta^*).

(As you said, we have by the WLLN that \frac{J^N(\theta)}{N} \convp I(\theta). But this doesn’t directly help us.)

One possible strategy is to show that
|I(\theta^*) – \frac{J^N(\theta^*)}{N}| \convp 0.

and


|\frac{J^N(\theta_{MLE})}{N} – \frac{J^N(\theta^*)}{N}| \convp 0

If both of the results are true, then we can combine them to get

|I(\theta^*) – \frac{J^N(\theta_{MLE})}{N}| \convp 0,

which is exactly what we want to show.

The first equation follows from the weak law of large numbers.

The second almost follows from the continuous mapping theorem, but unfortunately our function g() that we want to apply the CMT to changes with N:
our g is really g_N(\theta) := \frac{J^N(\theta)}{N}. So we
cannot use the CMT.

(Comment: If you
examine the proof of the CMT on Wikipedia,
notice that the set B_\delta they define in their proof for us now
also depends on n. We essentially need some sort of equicontinuity at \theta^*
over our functions g_N(\theta).)

Fortunately, if you assume that the family \mathcal{G} = \{g_N | N=1,2,\ldots\}
is stochastically equicontinuous at \theta^*, then it immediately
follows that for \theta_{MLE} \convp \theta^*,
\begin{align*}
|g_n(\theta_{MLE}) – g_n(\theta^*)| \convp 0.
\end{align*}

(See here: http://www.cs.berkeley.edu/~jordan/courses/210B-spring07/lectures/stat210b_lecture_12.pdf for a definition of stochastic equicontinuity at \theta^*, and a proof of the above fact.)

Therefore, assuming that \mathcal{G} is SE at \theta^*, your desired result holds
true and the empirical Fisher information converges to the population Fisher information.

Now, the key question of course is, what sort of conditions do you need
to impose on \mathcal{G} to get SE?
It looks like one way to do this is to establish a Lipshitz condition
on the entire class of functions \mathcal{G} (see here: http://econ.duke.edu/uploads/media_items/uniform-convergence-and-stochastic-equicontinuity.original.pdf ).

Attribution
Source : Link , Question Author : dandar , Answer Author : Dapz

Leave a Comment