Background:

I’m trying to follow Princeton’s review of MLE estimation for GLM.

I understand the basics of MLE estimation:

`likelihood`

,`score`

, observed and expected`Fisher information`

and the`Fisher scoring`

technique. And I know how to justify simple linear regression with MLE estimation.

The question:

I can’t understand even the first line of this method 🙁

What’s the intuition behind the $z_i$ working variables defined as:

$$ z_i = \hat\eta_i + (y_i -\hat\mu_i)\frac{d\eta_i}{d\mu_i}$$

Why are they used instead of $y_i$ to estimate $\beta$?

And what’s their relation with the

`response/link function`

which is the connection between $\eta$ and $\mu$If anyone has a simple explanation or can direct me to a more basic-level text about this I would be grateful.

**Answer**

Some years ago I wrote a paper about this for my students (in spanish), so I can try to rewrite those explanations here. I will look at IRLS (iteratively reweighted least squares) through a series of examples of increasing complexity. For the first example we need the concept of a location-scale family. Let $f_0$ be a density function centered at zero in some sense. We can construct a family of densities by defining

$$

f(x)= f(x;\mu,\sigma)= \frac{1}{\sigma} f_0\left(\frac{x-\mu}{\sigma}\right)

$$

where $\sigma > 0$ is a scale parameter and $\mu$ is a location parameter. In the measurement error model, where usual the error term is modeled as a normal distribution, we can in the place of that normal distribution use a location-scale family as constructed above. When $f_0$ is the standard normal distribution, the construction above gives the $\text{N}(\mu, \sigma)$ family.

Now we will use IRLS on some simple examples. First we will find the ML (maximum likelihood) estimators in the model

$$

Y_1,Y_2,\ldots,Y_n \hspace{1em} \text{i.i.d}

$$

with the density

$$

f(y)= \frac{1}{\pi} \frac{1}{1+(y-\mu)^2},\hspace{1em} y\in{\mathbb R},

$$

the Cauchy distribution the location family $\mu$ (so this is a location family). But first some notation. The weighted least squares estimator of $\mu$ is given by

$$

\mu^{\ast} = \frac{\sum_{i=1}^n w_i y_i}

{\sum_{i=1}^n w_i}.

$$

where $w_i$ is some weights. We will see that the ML estimator of $\mu$ can be expressed in the same form, with $w_i$ some function of the residuals

$$

\epsilon_i = y_i-\hat{\mu}.

$$

The likelihood function is given by

$$

L(y;\mu)= \left(\frac{1}{\pi}\right)^n \prod_{i=1}^n \frac{1}{1+(y_i-\mu)^2}

$$

and the loglikelihood function is given by

$$

l(y)= -n \log(\pi) – \sum_{i=1}^n \log\left(1+(y_i-\mu)^2\right).

$$

Its derivative with respect to $\mu$ is

$$

\begin{eqnarray}

\frac{\partial l(y)}{\partial \mu}&=&

0-\sum \frac{\partial}{\partial \mu} \log\left(1+(y_i-\mu)^2\right) \nonumber \\

&=& -\sum \frac{2(y_i-\mu)}{1+(y_i-\mu)^2}\cdot (-1) \nonumber \\

&=& \sum \frac{2 \epsilon_i}{1+\epsilon_i^2} \nonumber

\end{eqnarray}

$$

where $\epsilon_i=y_i-\mu$. Write $f_0(\epsilon)= \frac{1}{\pi} \frac{1}{1+\epsilon^2}$ and $f_0′(\epsilon)=\frac{1}{\pi} \frac{-1\cdot 2 \epsilon}{(1+\epsilon^2)^2}$, we get

$$

\frac{f_0′(\epsilon)}{f_0(\epsilon)} =

\frac{\frac{-1 \cdot2\epsilon}{(1+\epsilon^2)^2}}

{\frac{1}{1+\epsilon^2}} = -\frac{2\epsilon}{1+\epsilon^2}.

$$

We find

$$

\begin{eqnarray}

\frac {\partial l(y)} {\partial \mu}

& =& -\sum \frac {f_0′(\epsilon_i)} {f_0(\epsilon_i)} \nonumber \\

&=& -\sum \frac {f_0′(\epsilon_i)} {f_0(\epsilon_i)} \cdot

\left(-\frac{1}{\epsilon_i}\right)

\cdot (-\epsilon_i) \nonumber \\

&=& \sum w_i \epsilon_i \nonumber

\end{eqnarray}

$$

where we used the definition

$$

w_i= \frac{f_0′(\epsilon_i)}

{f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right)

= \frac{-2 \epsilon_i}

{1+\epsilon_i^2} \cdot \left(-\frac{1}{\epsilon_i}\right)

= \frac{2}{1+\epsilon_i^2}.

$$

Remembering that

$\epsilon_i=y_i-\mu$ we obtain the equation

$$

\sum w_i y_i = \mu \sum w_i,

$$

which is the estimating equation of IRLS. Note that

- The weights $w_i$ are always positive.
- If the residual is large, we give less weight to the corresponding observation.

To calculate the ML estimator in practice, we need a start value $\hat{\mu}^{(0)}$, we could use the median, for example. Using this value we calculate residuals

$$

\epsilon_i^{(0)} = y_i – \hat{\mu}^{(0)}

$$

and weights

$$

w_i^{(0)} = \frac{2}{1+\epsilon_i^{(0)} }.

$$

The new value of $\hat{\mu}$ is given by

$$

\hat{\mu}^{(1)} = \frac{\sum w_i^{(0)} y_i}

{\sum w_i^{(0)} }.

$$

Continuing in this way we define

$$

\epsilon_i^{(j)} = y_i- \hat{\mu}^{(j)}

$$ and

$$

w_i^{(j)} = \frac{2}{1+\epsilon_i^{(j)} }.

$$

The estimated value at the pass $j+1$ of the algorithm becomes

$$

\hat{\mu}^{(j+1)} = \frac{\sum w_i^{(j)} y_i}

{\sum w_i^{(j)} }.

$$

Continuing until the sequence

$$

\hat{\mu}^{(0)}, \hat{\mu}^{(1)}, \ldots, \hat{\mu}^{(j)}, \ldots

$$

converges.

Now we studies this process with a more general location and scale family, $f(y)= \frac{1}{\sigma} f_0(\frac{y-\mu}{\sigma})$, with less detail.

Let $Y_1,Y_2,\ldots,Y_n$ be independent with the density above. Define also $ \epsilon_i=\frac{y_i-\mu}{\sigma}$. The loglikelihood function is

$$

l(y)= -\frac{n}{2}\log(\sigma^2) + \sum \log(f_0\left(\frac{y_i-\mu}{\sigma}\right)).

$$

Writing $\nu=\sigma^2$, note that

$$

\frac{\partial \epsilon_i}{\partial \mu} =

-\frac{1}{\sigma}

$$

and

$$

\frac{\partial \epsilon_i}{\partial \nu} =

(y_i-\mu)\left(\frac{1}{\sqrt{\nu}}\right)’ =

(y_i-\mu)\cdot \frac{-1}{2 \sigma^3}.

$$

Calculating the loglikelihood derivative

$$

\frac{\partial l(y)}{\partial \mu} =

\sum \frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot

\frac{\partial \epsilon_i}{\partial \mu} =

\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot\left(-\frac{1}{\sigma}\right)=

-\frac{1}{\sigma}\sum\frac{f_o'(\epsilon_i)}{f_0(\epsilon_i)}\cdot

\left(-\frac{1}{\epsilon_i}\right)(-\epsilon_i) =

\frac{1}{\sigma}\sum w_i \epsilon_i

$$

and equaling this to zero gives the same estimating equation as the first example. Then searching for an estimator for $\sigma^2$:

$$

\begin{eqnarray}

\frac{\partial l(y)}{\partial \nu} &=& -\frac{n}{2}\frac{1}{\nu} +

\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot

\frac{\partial \epsilon_i}{\partial\nu} \nonumber \\

&=& -\frac{n}{2}\frac{1}{\nu}+\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}

\cdot \left(-\frac{(y_i-\mu)}{2\sigma^3}\right) \nonumber \\

&=& -\frac{n}{2}\frac{1}{\nu} – \frac{1}{2}\frac{1}{\sigma^2}

\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot \epsilon_i\nonumber \\

&=& -\frac{n}{2}\frac{1}{\nu}-\frac{1}{2}\frac{1}{\nu}

\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot

\left(-\frac{1}{\epsilon_i}\right)

(-\epsilon_i)\cdot\epsilon_i\nonumber \\

&=& -\frac{n}{2}\frac{1}{\nu}+\frac{1}{2}\frac{1}{\nu}\sum w_i \epsilon_i^2

\stackrel{!}{=} 0. \nonumber

\end{eqnarray}

$$

leading to the estimator

$$

\hat{\sigma^2} = \frac{1}{n}\sum w_i (y_i-\hat{\mu})^2.

$$

The iterative algorithm above can be used in this case as well.

In the following we give a numerical example using R, for the double exponential model (with known scale) and with data `y <- c(-5,-1,0,1,5)`

. For this data the true value of the ML estimator is 0.

The initial value will be `mu <- 0.5`

. One pass of the algorithm is

```
iterest <- function(y, mu) {
w <- 1/abs(y - mu)
weighted.mean(y, w)
}
```

with this function you can experiment with doing the iterations “by hand”

Then the iterative algorithm can be done by

```
mu_0 <- 0.5
repeat {mu <- iterest(y, mu_0)
if (abs(mu_0 - mu) < 0.000001) break
mu_0 <- mu }
```

**Exercise**: If the model is a $t_k$ distribution with scale parameter $\sigma$ show the iterations are given by the weight

$$

w_i = \frac{k + 1}{k + \epsilon_i^2}.

$$

**Exercise**: If the density is logistic, show the weights are given by

$$

w(\epsilon) = \frac{ 1-e^\epsilon}{1+e^\epsilon} \cdot – \frac{1}{\epsilon}.

$$

For the moment I will leave it here, I will continue this post.

**Attribution***Source : Link , Question Author : ihadanny , Answer Author : kjetil b halvorsen*