# How to interpret the Delta Method?

I’m reading through https://www.statlect.com/asymptotic-theory/delta-method it defined the Delta Method as:

The delta method is a method that allows us to derive, under
appropriate conditions, the asymptotic distribution of
$g(\hat{\theta}_n)$ from the asymptotic distribution of
$\hat{\theta}$.

and one example says, in short:

A sequence of $\hat{\theta}_i$ is asymptotically normal with mean=1
and variance=1. We want to derive the asymptotic distribution of the
sequence $\hat{\theta}^2$

And the solution is:

$$\sqrt{n}(\hat{\theta}_n^2-1) \xrightarrow{D} N(0,4)$$

1. How do I interpret this result? This doesn’t tell me the distribution of $\hat{\theta}_n^2$, instead it tells me the distribution of a shifted and scaled version of it.
2. The steps to arrive at the solution suggest the variance of $\hat{\theta}_n^2$ is 4, and they just plugged it into the $N(0,4)$ above. If this is true, how come the variance of $\hat{\theta}_n^2$ is the variance of $\sqrt{n}(\hat{\theta}_n^2-1)$ ?

## Some intuition behind the delta method:

The Delta method can be seen as combining two ideas:

1. Continuous, differentiable functions can be approximated locally by an affine transformation.
2. An affine transformation of a multivariate normal random variable is multivariate normal.

The 1st idea is from calculus, the 2nd is from probability. The loose intuition / argument goes:

• The input random variable $$\tilde{\boldsymbol{\theta}}_n$$ is asymptotically normal (by assumption or by application of a central limit theorem in the case where $$\tilde{\boldsymbol{\theta}}_n$$ is a sample mean).

• The smaller the neighborhood, the more $$\mathbf{g}(\mathbf{x})$$ looks like an affine transformation, that is, the more the function looks like a hyperplane (or a line in the 1 variable case).

• Where that linear approximation applies (and some regularity conditions hold), the multivariate normality of $$\tilde{\boldsymbol{\theta}}_n$$ is preserved when function $$\mathbf{g}$$ is applied to $$\tilde{\boldsymbol{\theta}}_n$$.

• Note that function $$\mathbf{g}$$ has to satisfy certain conditions for this to be true. Normality isn’t preserved in the neighborhood around $$x=0$$ for $$g(x) = x^2$$ because you’ll basically get both halves of the bell curve mapped to the same side: both $$x=-2$$ and $$x=2$$ get mapped to $$y=4$$. You need $$g$$ strictly increasing or decreasing in the neighborhood so that this doesn’t happen.

### Idea 1: Locally, any continuous, differentiable function looks affine

An idea of calculus is if you zoom in enough on a continuous, differentiable function, it will look like a line (or hyperplane in the multivariate case). If we have some vector valued function $$\mathbf{g}(\mathbf{x})$$, in a small enough neighborhood around $$\mathbf{c}$$ you can approximate $$\mathbf{g}(\mathbf{c} + \boldsymbol{\epsilon})$$ with the below affine function of $$\boldsymbol{\epsilon}$$:

$$\mathbf{g}(\mathbf{c} + \boldsymbol{\epsilon}) \approx \mathbf{g}(\mathbf{c}) + \frac{\partial \mathbf{g}(\mathbf{c})}{\partial \mathbf{x}’} \;\boldsymbol{\epsilon}$$

### Idea 2: An affine transformation of a multivariate normal random variable is multivariate normal

Let’s say we have $$\tilde{\boldsymbol{\theta}}$$ distributed multivariate normal with mean $$\boldsymbol{\mu}$$ and variance $$V$$. That is:
$$\tilde{\boldsymbol{\theta}} \sim \mathcal{N}\left( \boldsymbol{\mu}, V\right)$$

Consider a linear transformation $$A$$ and consider the multivariate normal random variable defined by the linear transformation $$A\tilde{\boldsymbol{\theta}}$$. It’s easy to show:
$$A\tilde{\boldsymbol{\theta}} – A\boldsymbol{\mu} \sim \mathcal{N}\left(\mathbf{0}, AVA’\right)$$

### Putting it together:

If we know that $$\tilde{\boldsymbol{\theta}} \sim \mathcal{N}\left( \boldsymbol{\mu}, V\right)$$ and that function $$\mathbf{g}(\mathbf{x})$$ can be approximated around $$\boldsymbol{\mu}$$ by $$\mathbf{g}(\boldsymbol{\mu}) + \frac{\partial \mathbf{g}(\boldsymbol{\mu})}{\partial \mathbf{x}’} \;\boldsymbol{\epsilon}$$ then putting ideas (1) and (2) together:

$$\mathbf{g}\left( \tilde{\boldsymbol{\theta}} \right) – \mathbf{g}(\boldsymbol{\mu}) \sim \mathcal{N} \left( \mathbf{0}, \frac{\partial \mathbf{g}(\boldsymbol{\mu})}{\partial \mathbf{x}’} V \frac{\partial \mathbf{g}(\boldsymbol{\mu})}{\partial \mathbf{x}’} ‘\right)$$

### What can go wrong?

We have a problem doing this if any component of $$\frac{\partial \mathbf{g}(\mathbf{c})}{\partial \mathbf{x}’}$$ is zero. (eg. $$g(x) = x^2$$ at $$x=0$$.) We need $$g$$ strictly increasing or decreasing in the region where $$\tilde{\boldsymbol{\theta}}_n$$ has probability mass.

This is also going to be a bad approximation if $$g$$ doesn’t look like an affine function in the region where $$\tilde{\boldsymbol{\theta}}_n$$ has probability mass.

It may also be a bad approximation if $$\tilde{\boldsymbol{\theta}}_n$$ isn’t normal.

# This problem:

$$g(x) = x^2 \quad \quad g'(x) = 2 x$$

If $$\sqrt{n}\left( \tilde{\theta} – \mu \right) \xrightarrow{d} \mathcal{N}(0, 1)$$
Applying the delta method you get…