# Statistical Inference Under Misspecification

The classical treatment of statistical inference relies on the assumption that that a correctly specified statistical is used exists. That is, the distribution $\mathbb{P}^*(Y)$ that generated the observed data $y$ is part of the statistical model $\mathcal{M}$:
However, in most situations we cannot assume that this is really true. I wonder what happens with statistical inference procedures if we drop the correctly specified assumption.

I have found some work by White 1982 on ML-estimates under misspecification. In it is argued that the maximum likelihood estimator is a consistent estimator for the distribution that minimizes the KL-divergence out of all distributions within the statistical model and the true distribution $\mathbb{P}^*$.

What happens to confidence set estimators? Lets recapitulate confidence set estimators. Let $\delta:\Omega_Y \rightarrow 2^\Theta$ be a set estimator, where $\Omega_Y$ is the sample space and $2^\Theta$ the power set over the parameter space $\Theta$. What we would like to know is the probability of the event that the sets produced by $\delta$ include the true distribution $\mathbb{P}^*$, that is

However, we of course don’t know the true distribution $\mathbb{P}^*$. The correctly specified assumption tells us that $\mathbb{P}^* \in \mathcal{M}$. However, we still don’t know which distribution of the model it is. But, is a lower bound for the probability $A$. Equation $B$ is the classical defintion of the confidence level for a confidence set estimator.

If we drop the correctly specified assumption, $B$ is not necessarily a lower bound for $A$, the term that we are actually interested in, anymore. Indeed, if we assume that the model is misspecied, which is arguably the case for most realistic situations, $A$ is 0, because the true distribution $P^*$ is not contained within the statistical model $\mathcal{M}$.

From another perspective one could think about what $B$ relates to when the model is misspecified. This a more specific question. Does $B$ still have a meaning, if the model is misspecified. If not, why are we even bothering with parametric statistics?

I guess White 1982 contains some results on these issues. Unluckily, my lack of mathematical background hinders me from understanding much that is written there.

Let $$y_1, \ldots, y_ny_1, \ldots, y_n$$ be the observed data which is presumed to be a realization of a sequence of i.i.d. random variables $$Y_1, \ldots, Y_nY_1, \ldots, Y_n$$ with common probability density function $$p_ep_e$$ defined with respect to a sigma-finite measure $$\nu\nu$$. The density $$p_ep_e$$ is called Data Generating Process (DGP)
density.

In the researcher’s probability model
$${\cal M} \equiv \{ p(y ; \theta) : \theta \in \Theta \}{\cal M} \equiv \{ p(y ; \theta) : \theta \in \Theta \}$$ is a collection
of probability density functions which are indexed by a parameter vector
$$\theta\theta$$. Assume each density in $${\cal M}{\cal M}$$ is a defined with respect to
a common sigma-finite measure $$\nu\nu$$ (e.g., each density could be a probability
mass function with the same sample space $$SS$$).

It is important to keep the density $$p_ep_e$$ which actually generated the
data conceptually distinct from the probability model of the data. In
classic statistical treatments a careful separaration of these concepts
is either ignored, not made, or it is assumed right from the beginning
that the probability model is correctly specified.

A correctly specified model $${\cal M}{\cal M}$$ with respect to $$p_ep_e$$ is defined
as a model where $$p_e \in {\cal M}p_e \in {\cal M}$$ $$\nu\nu$$-almost everywhere. When
$${\cal M}{\cal M}$$ is misspecified with respect to $$p_ep_e$$ this corresponds
to the case where the probability model is not correctly specified.

If the probability model is correctly specified, then there exists
a $$\theta^*\theta^*$$ in the parameter space $$\Theta\Theta$$ such that
$$p_e(y) = p(y ; \theta^*)p_e(y) = p(y ; \theta^*)$$ $$\nu\nu$$-almost everywhere. Such a parameter
vector is called the “true parameter vector”. If the probability model
is misspecified, then the true parameter vector does not exist.

Within White’s model misspecification framework the goal is to find the parameter estimate $$\hat{\theta}_n\hat{\theta}_n$$ that minimizes
$$\hat{\ell}_n({\theta}) \equiv (1/n) \sum_{i=1}^n \log p(y_i ; { \theta})\hat{\ell}_n({\theta}) \equiv (1/n) \sum_{i=1}^n \log p(y_i ; { \theta})$$ over some compact parameter space $$\Theta\Theta$$. It is assumed that
a unique strict global minimizer, $$\theta^*\theta^*$$, of the
expected value of $$\hat{\ell}_n\hat{\ell}_n$$ on $$\Theta\Theta$$ is located in the interior of $$\Theta\Theta$$. In the lucky case where the probability model is correctly specified, $$\theta^*\theta^*$$ may be interpreted as the “true parameter value”.

In the special case where the probability model is correctly
specified, then $$\hat{\theta}_n\hat{\theta}_n$$ is the familiar maximum likelihood estimate.
If we don’t know have absolute knowledge that the probability model
is correctly specified, then $$\hat{\theta}_n\hat{\theta}_n$$ is called a quasi-maximum
likelihood estimate and the goal is to estimate $$\theta^*\theta^*$$.
If we get lucky and the probability model is
correctly specified, then the quasi-maximum likelihood estimate reduces as
a special case to the familiar maximum likelihood estimate and
$$\theta^*\theta^*$$ becomes the true parameter value.

Consistency within White’s (1982) framework corresponds to convergence
to $$\theta^*\theta^*$$ without requiring that $$\theta^*\theta^*$$ is necessarily the true
parameter vector. Within White’s framework, we would never estimate
the probability of the event that the sets produced by δ include the TRUE distribution P*. Instead, we would always estimate the probability distribution P** which is the probability of the event that the sets
produced by δ include the distribution specified by the density
$$p(y ; \theta^*)p(y ; \theta^*)$$.

Finally, a few comments about model misspecification. It is easy to find
examples where a misspecified model is extremely useful and very predictive.
For example, consider a nonlinear (or even a linear) regression model
with a Gaussian residual error term whose variance is extremely small
yet the actual residual error in the environment is not Gaussian.

It is also easy to find examples where a correctly specified model
is not useful and not predictive. For example, consider a random walk
model for predicting stock prices which predicts tomorrow’s closing
price is a weighted sum of today’s closing priced and some Gaussian
noise with an extremely large variance.

The purpose of the model misspecification framework is not to ensure model
validity but rather to ensure reliability. That is, ensure that the sampling error associated with your parameter estimates, confidence intervals, hypothesis tests, and so on are correctly estimated despite the presence of either a small or large amount of model misspecification. The quasi-maximum likelihood
estimates are asymptotically normal centered at $$\theta^*\theta^*$$ with a covariance matrix estimator which depends upon both the first and second derivatives of the negative log-likelihood function. In the special case where you get lucky and the model is correct then all of the formulas reduce to the familiar classical statistical framework where the goal is to estimate the “true” parameter values.