The classical treatment of statistical inference relies on the assumption that that a correctly specified statistical is used exists. That is, the distribution P∗(Y) that generated the observed data y is part of the statistical model M:

P∗(Y)∈M={Pθ(Y):θ∈Θ} However, in most situations we cannot assume that this is really true.I wonder what happens with statistical inference procedures if we drop the correctly specified assumption.I have found some work by White 1982 on ML-estimates under misspecification. In it is argued that the maximum likelihood estimator is a consistent estimator for the distribution Pθ1=argmin that minimizes the KL-divergence out of all distributions within the statistical model and the true distribution \mathbb{P}^*.

What happens to confidence set estimators? Lets recapitulate confidence set estimators. Let \delta:\Omega_Y \rightarrow 2^\Theta be a set estimator, where \Omega_Y is the sample space and 2^\Theta the power set over the parameter space \Theta. What we would like to know is the probability of the event that the sets produced by \delta include the true distribution \mathbb{P}^*, that is \mathbb{P}^*(\mathbb{P}^* \in \{P_\theta : \theta \in \delta(Y)\}):=A.

However, we of course don’t know the true distribution \mathbb{P}^*. The correctly specified assumption tells us that \mathbb{P}^* \in \mathcal{M}. However, we still don’t know which distribution of the model it is. But, \inf_{\theta \in \Theta} \mathbb{P}_\theta(\theta \in \delta(Y)):=B is a lower bound for the probability A. Equation B is the classical defintion of the confidence level for a confidence set estimator.

If we drop the correctly specified assumption, B is not necessarily a lower bound for A, the term that we are actually interested in, anymore. Indeed, if we assume that the model is misspecied, which is arguably the case for most realistic situations, A is 0, because the true distribution P^* is not contained within the statistical model \mathcal{M}.

From another perspective one could think about what B relates to when the model is misspecified. This a more specific question.

Does B still have a meaning, if the model is misspecified.If not, why are we even bothering with parametric statistics?I guess White 1982 contains some results on these issues. Unluckily, my lack of mathematical background hinders me from understanding much that is written there.

**Answer**

Let y_1, \ldots, y_n be the observed data which is presumed to be a realization of a sequence of i.i.d. random variables Y_1, \ldots, Y_n with common probability density function p_e defined with respect to a sigma-finite measure \nu. The density p_e is called Data Generating Process (DGP)

density.

In the researcher’s probability model

{\cal M} \equiv \{ p(y ; \theta) : \theta \in \Theta \} is a collection

of probability density functions which are indexed by a parameter vector

\theta. Assume each density in {\cal M} is a defined with respect to

a common sigma-finite measure \nu (e.g., each density could be a probability

mass function with the same sample space S).

It is important to keep the density p_e which actually generated the

data conceptually distinct from the probability model of the data. In

classic statistical treatments a careful separaration of these concepts

is either ignored, not made, or it is assumed right from the beginning

that the probability model is correctly specified.

A correctly specified model {\cal M} with respect to p_e is defined

as a model where p_e \in {\cal M} \nu-almost everywhere. When

{\cal M} is misspecified with respect to p_e this corresponds

to the case where the probability model is not correctly specified.

If the probability model is correctly specified, then there exists

a \theta^* in the parameter space \Theta such that

p_e(y) = p(y ; \theta^*) \nu-almost everywhere. Such a parameter

vector is called the “true parameter vector”. If the probability model

is misspecified, then the true parameter vector does not exist.

Within White’s model misspecification framework the goal is to find the parameter estimate \hat{\theta}_n that minimizes

\hat{\ell}_n({\theta}) \equiv (1/n) \sum_{i=1}^n \log p(y_i ; { \theta}) over some compact parameter space \Theta. It is assumed that

a unique strict global minimizer, \theta^*, of the

expected value of \hat{\ell}_n on \Theta is located in the interior of \Theta. In the lucky case where the probability model is correctly specified, \theta^* may be interpreted as the “true parameter value”.

In the special case where the probability model is correctly

specified, then \hat{\theta}_n is the familiar maximum likelihood estimate.

If we don’t know have absolute knowledge that the probability model

is correctly specified, then \hat{\theta}_n is called a quasi-maximum

likelihood estimate and the goal is to estimate \theta^*.

If we get lucky and the probability model is

correctly specified, then the quasi-maximum likelihood estimate reduces as

a special case to the familiar maximum likelihood estimate and

\theta^* becomes the true parameter value.

Consistency within White’s (1982) framework corresponds to convergence

to \theta^* without requiring that \theta^* is necessarily the true

parameter vector. Within White’s framework, we would never estimate

the probability of the event that the sets produced by δ include the TRUE distribution P*. Instead, we would always estimate the probability distribution P** which is the probability of the event that the sets

produced by δ include the distribution specified by the density

p(y ; \theta^*).

Finally, a few comments about model misspecification. It is easy to find

examples where a misspecified model is extremely useful and very predictive.

For example, consider a nonlinear (or even a linear) regression model

with a Gaussian residual error term whose variance is extremely small

yet the actual residual error in the environment is not Gaussian.

It is also easy to find examples where a correctly specified model

is not useful and not predictive. For example, consider a random walk

model for predicting stock prices which predicts tomorrow’s closing

price is a weighted sum of today’s closing priced and some Gaussian

noise with an extremely large variance.

The purpose of the model misspecification framework is not to ensure model

validity but rather to ensure reliability. That is, ensure that the sampling error associated with your parameter estimates, confidence intervals, hypothesis tests, and so on are correctly estimated despite the presence of either a small or large amount of model misspecification. The quasi-maximum likelihood

estimates are asymptotically normal centered at \theta^* with a covariance matrix estimator which depends upon both the first and second derivatives of the negative log-likelihood function. In the special case where you get lucky and the model is correct then all of the formulas reduce to the familiar classical statistical framework where the goal is to estimate the “true” parameter values.

**Attribution***Source : Link , Question Author : Julian Karch , Answer Author : RMG*