Can you give a simple intuitive explanation of IRLS method to find the MLE of a GLM?

Background:

I’m trying to follow Princeton’s review of MLE estimation for GLM.

I understand the basics of MLE estimation: likelihood, score, observed and expected Fisher information and the Fisher scoring technique. And I know how to justify simple linear regression with MLE estimation.


The question:

I can’t understand even the first line of this method 🙁

What’s the intuition behind the $z_i$ working variables defined as:

$$ z_i = \hat\eta_i + (y_i -\hat\mu_i)\frac{d\eta_i}{d\mu_i}$$

Why are they used instead of $y_i$ to estimate $\beta$?

And what’s their relation with the response/link function which is the connection between $\eta$ and $\mu$

If anyone has a simple explanation or can direct me to a more basic-level text about this I would be grateful.

Answer

Some years ago I wrote a paper about this for my students (in spanish), so I can try to rewrite those explanations here. I will look at IRLS (iteratively reweighted least squares) through a series of examples of increasing complexity. For the first example we need the concept of a location-scale family. Let $f_0$ be a density function centered at zero in some sense. We can construct a family of densities by defining
$$
f(x)= f(x;\mu,\sigma)= \frac{1}{\sigma} f_0\left(\frac{x-\mu}{\sigma}\right)
$$

where $\sigma > 0$ is a scale parameter and $\mu$ is a location parameter. In the measurement error model, where usual the error term is modeled as a normal distribution, we can in the place of that normal distribution use a location-scale family as constructed above. When $f_0$ is the standard normal distribution, the construction above gives the $\text{N}(\mu, \sigma)$ family.

Now we will use IRLS on some simple examples. First we will find the ML (maximum likelihood) estimators in the model
$$
Y_1,Y_2,\ldots,Y_n \hspace{1em} \text{i.i.d}
$$

with the density
$$
f(y)= \frac{1}{\pi} \frac{1}{1+(y-\mu)^2},\hspace{1em} y\in{\mathbb R},
$$

the Cauchy distribution the location family $\mu$ (so this is a location family). But first some notation. The weighted least squares estimator of $\mu$ is given by
$$
\mu^{\ast} = \frac{\sum_{i=1}^n w_i y_i}
{\sum_{i=1}^n w_i}.
$$

where $w_i$ is some weights. We will see that the ML estimator of $\mu$ can be expressed in the same form, with $w_i$ some function of the residuals
$$
\epsilon_i = y_i-\hat{\mu}.
$$

The likelihood function is given by
$$
L(y;\mu)= \left(\frac{1}{\pi}\right)^n \prod_{i=1}^n \frac{1}{1+(y_i-\mu)^2}
$$

and the loglikelihood function is given by
$$
l(y)= -n \log(\pi) – \sum_{i=1}^n \log\left(1+(y_i-\mu)^2\right).
$$

Its derivative with respect to $\mu$ is
$$
\begin{eqnarray}
\frac{\partial l(y)}{\partial \mu}&=&
0-\sum \frac{\partial}{\partial \mu} \log\left(1+(y_i-\mu)^2\right) \nonumber \\
&=& -\sum \frac{2(y_i-\mu)}{1+(y_i-\mu)^2}\cdot (-1) \nonumber \\
&=& \sum \frac{2 \epsilon_i}{1+\epsilon_i^2} \nonumber
\end{eqnarray}
$$

where $\epsilon_i=y_i-\mu$. Write $f_0(\epsilon)= \frac{1}{\pi} \frac{1}{1+\epsilon^2}$ and $f_0′(\epsilon)=\frac{1}{\pi} \frac{-1\cdot 2 \epsilon}{(1+\epsilon^2)^2}$, we get
$$
\frac{f_0′(\epsilon)}{f_0(\epsilon)} =
\frac{\frac{-1 \cdot2\epsilon}{(1+\epsilon^2)^2}}
{\frac{1}{1+\epsilon^2}} = -\frac{2\epsilon}{1+\epsilon^2}.
$$

We find
$$
\begin{eqnarray}
\frac {\partial l(y)} {\partial \mu}
& =& -\sum \frac {f_0′(\epsilon_i)} {f_0(\epsilon_i)} \nonumber \\
&=& -\sum \frac {f_0′(\epsilon_i)} {f_0(\epsilon_i)} \cdot
\left(-\frac{1}{\epsilon_i}\right)
\cdot (-\epsilon_i) \nonumber \\
&=& \sum w_i \epsilon_i \nonumber
\end{eqnarray}
$$

where we used the definition
$$
w_i= \frac{f_0′(\epsilon_i)}
{f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right)
= \frac{-2 \epsilon_i}
{1+\epsilon_i^2} \cdot \left(-\frac{1}{\epsilon_i}\right)
= \frac{2}{1+\epsilon_i^2}.
$$

Remembering that
$\epsilon_i=y_i-\mu$ we obtain the equation
$$
\sum w_i y_i = \mu \sum w_i,
$$

which is the estimating equation of IRLS. Note that

  1. The weights $w_i$ are always positive.
  2. If the residual is large, we give less weight to the corresponding observation.

To calculate the ML estimator in practice, we need a start value $\hat{\mu}^{(0)}$, we could use the median, for example. Using this value we calculate residuals
$$
\epsilon_i^{(0)} = y_i – \hat{\mu}^{(0)}
$$

and weights
$$
w_i^{(0)} = \frac{2}{1+\epsilon_i^{(0)} }.
$$

The new value of $\hat{\mu}$ is given by
$$
\hat{\mu}^{(1)} = \frac{\sum w_i^{(0)} y_i}
{\sum w_i^{(0)} }.
$$

Continuing in this way we define
$$
\epsilon_i^{(j)} = y_i- \hat{\mu}^{(j)}
$$
and
$$
w_i^{(j)} = \frac{2}{1+\epsilon_i^{(j)} }.
$$

The estimated value at the pass $j+1$ of the algorithm becomes
$$
\hat{\mu}^{(j+1)} = \frac{\sum w_i^{(j)} y_i}
{\sum w_i^{(j)} }.
$$

Continuing until the sequence
$$
\hat{\mu}^{(0)}, \hat{\mu}^{(1)}, \ldots, \hat{\mu}^{(j)}, \ldots
$$

converges.

Now we studies this process with a more general location and scale family, $f(y)= \frac{1}{\sigma} f_0(\frac{y-\mu}{\sigma})$, with less detail.
Let $Y_1,Y_2,\ldots,Y_n$ be independent with the density above. Define also $ \epsilon_i=\frac{y_i-\mu}{\sigma}$. The loglikelihood function is
$$
l(y)= -\frac{n}{2}\log(\sigma^2) + \sum \log(f_0\left(\frac{y_i-\mu}{\sigma}\right)).
$$

Writing $\nu=\sigma^2$, note that
$$
\frac{\partial \epsilon_i}{\partial \mu} =
-\frac{1}{\sigma}
$$

and
$$
\frac{\partial \epsilon_i}{\partial \nu} =
(y_i-\mu)\left(\frac{1}{\sqrt{\nu}}\right)’ =
(y_i-\mu)\cdot \frac{-1}{2 \sigma^3}.
$$

Calculating the loglikelihood derivative
$$
\frac{\partial l(y)}{\partial \mu} =
\sum \frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\frac{\partial \epsilon_i}{\partial \mu} =
\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot\left(-\frac{1}{\sigma}\right)=
-\frac{1}{\sigma}\sum\frac{f_o'(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\left(-\frac{1}{\epsilon_i}\right)(-\epsilon_i) =
\frac{1}{\sigma}\sum w_i \epsilon_i
$$

and equaling this to zero gives the same estimating equation as the first example. Then searching for an estimator for $\sigma^2$:
$$
\begin{eqnarray}
\frac{\partial l(y)}{\partial \nu} &=& -\frac{n}{2}\frac{1}{\nu} +
\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\frac{\partial \epsilon_i}{\partial\nu} \nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu}+\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}
\cdot \left(-\frac{(y_i-\mu)}{2\sigma^3}\right) \nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu} – \frac{1}{2}\frac{1}{\sigma^2}
\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot \epsilon_i\nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu}-\frac{1}{2}\frac{1}{\nu}
\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\left(-\frac{1}{\epsilon_i}\right)
(-\epsilon_i)\cdot\epsilon_i\nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu}+\frac{1}{2}\frac{1}{\nu}\sum w_i \epsilon_i^2
\stackrel{!}{=} 0. \nonumber
\end{eqnarray}
$$

leading to the estimator
$$
\hat{\sigma^2} = \frac{1}{n}\sum w_i (y_i-\hat{\mu})^2.
$$

The iterative algorithm above can be used in this case as well.

In the following we give a numerical example using R, for the double exponential model (with known scale) and with data y <- c(-5,-1,0,1,5). For this data the true value of the ML estimator is 0.
The initial value will be mu <- 0.5. One pass of the algorithm is

      iterest <- function(y, mu) {
                   w <- 1/abs(y - mu)
                   weighted.mean(y, w)
                   }

with this function you can experiment with doing the iterations “by hand”
Then the iterative algorithm can be done by

    mu_0 <- 0.5
    repeat {mu <- iterest(y, mu_0)
            if (abs(mu_0 - mu) < 0.000001) break
            mu_0 <- mu }

Exercise: If the model is a $t_k$ distribution with scale parameter $\sigma$ show the iterations are given by the weight
$$
w_i = \frac{k + 1}{k + \epsilon_i^2}.
$$

Exercise: If the density is logistic, show the weights are given by
$$
w(\epsilon) = \frac{ 1-e^\epsilon}{1+e^\epsilon} \cdot – \frac{1}{\epsilon}.
$$

For the moment I will leave it here, I will continue this post.

Attribution
Source : Link , Question Author : ihadanny , Answer Author : kjetil b halvorsen

Leave a Comment