I am trying to prove that the observed information matrix evaluated at the weakly consistent maximum likelihood estimator (MLE), is a weakly consistent estimator of the expected information matrix. This is a widely quoted result but nobody gives a reference or a proof (I have exhausted I think the first 20 pages of google results and my stats textbooks)!

Using a weakly consistent sequence of MLEs I can use the weak law of large numbers (WLLN) and the continuous mapping theorem to get the result I want. However I believe the continuous mapping theorem cannot be used. Instead I think the uniform law of large numbers (ULLN) needs to be used. Does anybody know of a reference that has a proof of this? I have an attempt at the ULLN but omit it for now for brevity.

I apologise for the length of this question but notation has to be introduced. The notation is as folows (my proof is at the end).

Assume we have an iid sample of random variables {Y1,…,YN} with densities f(˜Y|θ), where θ∈Θ⊆Rk (here ˜Y is a just a general random variable with the same density as any one of the members of the sample). The vector Y=(Y1,…,YN)T is the vector of all the sample vectors where Yi∈Rn for all i=1,…,N. The true parameter value of the densities is θ0, and ˆθN(Y) is the weakly consistent maximum likelihood estimator (MLE) of θ0. Subject to regularity conditions the Fisher Information matrix can be written as

I(θ)=−Eθ[Hθ(logf(˜Y|θ)]

where Hθ is the Hessian matrix. The sample equivalent is

IN(θ)=N∑i=1Iyi(θ),

where Iyi=−Eθ[Hθ(logf(Yi|θ)]. The observed information matrix is;

J(θ)=−Hθ(logf(y|θ),

(some people demand the matrix is evaluated at ˆθ but some don’t). The sample observed information matrix is;

JN(θ)=∑Ni=1Jyi(θ)

where Jyi(θ)=−Hθ(logf(yi|θ).

I can prove convergence in probability of the estimator N−1JN(θ) to I(θ), but not of N−1JN(ˆθN(Y)) to I(θ0). Here is my proof so far;

Now (JN(θ))rs=−∑Ni=1(Hθ(logf(Yi|θ))rs is element (r,s) of JN(θ), for any r,s=1,…,k. If the sample is iid, then by the weak law of large numbers (WLLN), the average of these summands converges in probability to −Eθ[(Hθ(logf(Y1|θ))rs]=(IY1(θ))rs=(I(θ))rs. Thus N−1(JN(θ))rsP→(I(θ))rs for all r,s=1,…,k, and so N−1JN(θ)P→I(θ). Unfortunately we cannot simply conclude N−1JN(ˆθN(Y))P→I(θ0) by using the continuous mapping theorem since N−1JN(⋅) is not the same function as I(⋅).

Any help on this would be greatly appreciated.

**Answer**

\newcommand{\convp}{\stackrel{P}{\longrightarrow}}

I guess directly establishing some sort of uniform law of large numbers

is one possible approach.

Here is another.

We want to show that \frac{J^N(\theta_{MLE})}{N} \convp I(\theta^*).

(As you said, we have by the WLLN that \frac{J^N(\theta)}{N} \convp I(\theta). But this doesn’t directly help us.)

One possible strategy is to show that

|I(\theta^*) – \frac{J^N(\theta^*)}{N}| \convp 0.

and

|\frac{J^N(\theta_{MLE})}{N} – \frac{J^N(\theta^*)}{N}| \convp 0

If both of the results are true, then we can combine them to get

|I(\theta^*) – \frac{J^N(\theta_{MLE})}{N}| \convp 0,

which is exactly what we want to show.

The first equation follows from the weak law of large numbers.

The second *almost* follows from the continuous mapping theorem, but unfortunately our function g() that we want to apply the CMT to changes with N:

our g is really g_N(\theta) := \frac{J^N(\theta)}{N}. So we

cannot use the CMT.

(Comment: If you

examine the proof of the CMT on Wikipedia,

notice that the set B_\delta they define in their proof for us now

also depends on n. We essentially need some sort of equicontinuity at \theta^*

over our functions g_N(\theta).)

Fortunately, if you assume that the family \mathcal{G} = \{g_N | N=1,2,\ldots\}

is stochastically equicontinuous at \theta^*, then it immediately

follows that for \theta_{MLE} \convp \theta^*,

\begin{align*}

|g_n(\theta_{MLE}) – g_n(\theta^*)| \convp 0.

\end{align*}

(See here: http://www.cs.berkeley.edu/~jordan/courses/210B-spring07/lectures/stat210b_lecture_12.pdf for a definition of stochastic equicontinuity at \theta^*, and a proof of the above fact.)

Therefore, assuming that \mathcal{G} is SE at \theta^*, your desired result holds

true and the empirical Fisher information converges to the population Fisher information.

Now, the key question of course is, what sort of conditions do you need

to impose on \mathcal{G} to get SE?

It looks like one way to do this is to establish a Lipshitz condition

on the entire class of functions \mathcal{G} (see here: http://econ.duke.edu/uploads/media_items/uniform-convergence-and-stochastic-equicontinuity.original.pdf ).

**Attribution***Source : Link , Question Author : dandar , Answer Author : Dapz*