Can you give a simple intuitive explanation of IRLS method to find the MLE of a GLM?

Background:

I’m trying to follow Princeton’s review of MLE estimation for GLM.

I understand the basics of MLE estimation: likelihood, score, observed and expected Fisher information and the Fisher scoring technique. And I know how to justify simple linear regression with MLE estimation.

The question:

I can’t understand even the first line of this method 🙁

What’s the intuition behind the $z_i$ working variables defined as:

$$z_i = \hat\eta_i + (y_i -\hat\mu_i)\frac{d\eta_i}{d\mu_i}$$

Why are they used instead of $y_i$ to estimate $\beta$?

And what’s their relation with the response/link function which is the connection between $\eta$ and $\mu$

If anyone has a simple explanation or can direct me to a more basic-level text about this I would be grateful.

Some years ago I wrote a paper about this for my students (in spanish), so I can try to rewrite those explanations here. I will look at IRLS (iteratively reweighted least squares) through a series of examples of increasing complexity. For the first example we need the concept of a location-scale family. Let $$f_0$$ be a density function centered at zero in some sense. We can construct a family of densities by defining
$$f(x)= f(x;\mu,\sigma)= \frac{1}{\sigma} f_0\left(\frac{x-\mu}{\sigma}\right)$$
where $$\sigma > 0$$ is a scale parameter and $$\mu$$ is a location parameter. In the measurement error model, where usual the error term is modeled as a normal distribution, we can in the place of that normal distribution use a location-scale family as constructed above. When $$f_0$$ is the standard normal distribution, the construction above gives the $$\text{N}(\mu, \sigma)$$ family.

Now we will use IRLS on some simple examples. First we will find the ML (maximum likelihood) estimators in the model
$$Y_1,Y_2,\ldots,Y_n \hspace{1em} \text{i.i.d}$$
with the density
$$f(y)= \frac{1}{\pi} \frac{1}{1+(y-\mu)^2},\hspace{1em} y\in{\mathbb R},$$
the Cauchy distribution the location family $$\mu$$ (so this is a location family). But first some notation. The weighted least squares estimator of $$\mu$$ is given by
$$\mu^{\ast} = \frac{\sum_{i=1}^n w_i y_i} {\sum_{i=1}^n w_i}.$$
where $$w_i$$ is some weights. We will see that the ML estimator of $$\mu$$ can be expressed in the same form, with $$w_i$$ some function of the residuals
$$\epsilon_i = y_i-\hat{\mu}.$$
The likelihood function is given by
$$L(y;\mu)= \left(\frac{1}{\pi}\right)^n \prod_{i=1}^n \frac{1}{1+(y_i-\mu)^2}$$
and the loglikelihood function is given by
$$l(y)= -n \log(\pi) – \sum_{i=1}^n \log\left(1+(y_i-\mu)^2\right).$$
Its derivative with respect to $$\mu$$ is
$$\begin{eqnarray} \frac{\partial l(y)}{\partial \mu}&=& 0-\sum \frac{\partial}{\partial \mu} \log\left(1+(y_i-\mu)^2\right) \nonumber \\ &=& -\sum \frac{2(y_i-\mu)}{1+(y_i-\mu)^2}\cdot (-1) \nonumber \\ &=& \sum \frac{2 \epsilon_i}{1+\epsilon_i^2} \nonumber \end{eqnarray}$$
where $$\epsilon_i=y_i-\mu$$. Write $$f_0(\epsilon)= \frac{1}{\pi} \frac{1}{1+\epsilon^2}$$ and $$f_0′(\epsilon)=\frac{1}{\pi} \frac{-1\cdot 2 \epsilon}{(1+\epsilon^2)^2}$$, we get
$$\frac{f_0′(\epsilon)}{f_0(\epsilon)} = \frac{\frac{-1 \cdot2\epsilon}{(1+\epsilon^2)^2}} {\frac{1}{1+\epsilon^2}} = -\frac{2\epsilon}{1+\epsilon^2}.$$
We find
$$\begin{eqnarray} \frac {\partial l(y)} {\partial \mu} & =& -\sum \frac {f_0′(\epsilon_i)} {f_0(\epsilon_i)} \nonumber \\ &=& -\sum \frac {f_0′(\epsilon_i)} {f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right) \cdot (-\epsilon_i) \nonumber \\ &=& \sum w_i \epsilon_i \nonumber \end{eqnarray}$$
where we used the definition
$$w_i= \frac{f_0′(\epsilon_i)} {f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right) = \frac{-2 \epsilon_i} {1+\epsilon_i^2} \cdot \left(-\frac{1}{\epsilon_i}\right) = \frac{2}{1+\epsilon_i^2}.$$
Remembering that
$$\epsilon_i=y_i-\mu$$ we obtain the equation
$$\sum w_i y_i = \mu \sum w_i,$$
which is the estimating equation of IRLS. Note that

1. The weights $$w_i$$ are always positive.
2. If the residual is large, we give less weight to the corresponding observation.

To calculate the ML estimator in practice, we need a start value $$\hat{\mu}^{(0)}$$, we could use the median, for example. Using this value we calculate residuals
$$\epsilon_i^{(0)} = y_i – \hat{\mu}^{(0)}$$
and weights
$$w_i^{(0)} = \frac{2}{1+\epsilon_i^{(0)} }.$$
The new value of $$\hat{\mu}$$ is given by
$$\hat{\mu}^{(1)} = \frac{\sum w_i^{(0)} y_i} {\sum w_i^{(0)} }.$$
Continuing in this way we define
$$\epsilon_i^{(j)} = y_i- \hat{\mu}^{(j)}$$ and
$$w_i^{(j)} = \frac{2}{1+\epsilon_i^{(j)} }.$$
The estimated value at the pass $$j+1$$ of the algorithm becomes
$$\hat{\mu}^{(j+1)} = \frac{\sum w_i^{(j)} y_i} {\sum w_i^{(j)} }.$$
Continuing until the sequence
$$\hat{\mu}^{(0)}, \hat{\mu}^{(1)}, \ldots, \hat{\mu}^{(j)}, \ldots$$
converges.

Now we studies this process with a more general location and scale family, $$f(y)= \frac{1}{\sigma} f_0(\frac{y-\mu}{\sigma})$$, with less detail.
Let $$Y_1,Y_2,\ldots,Y_n$$ be independent with the density above. Define also $$\epsilon_i=\frac{y_i-\mu}{\sigma}$$. The loglikelihood function is
$$l(y)= -\frac{n}{2}\log(\sigma^2) + \sum \log(f_0\left(\frac{y_i-\mu}{\sigma}\right)).$$
Writing $$\nu=\sigma^2$$, note that
$$\frac{\partial \epsilon_i}{\partial \mu} = -\frac{1}{\sigma}$$
and
$$\frac{\partial \epsilon_i}{\partial \nu} = (y_i-\mu)\left(\frac{1}{\sqrt{\nu}}\right)’ = (y_i-\mu)\cdot \frac{-1}{2 \sigma^3}.$$
Calculating the loglikelihood derivative
$$\frac{\partial l(y)}{\partial \mu} = \sum \frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot \frac{\partial \epsilon_i}{\partial \mu} = \sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot\left(-\frac{1}{\sigma}\right)= -\frac{1}{\sigma}\sum\frac{f_o'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \left(-\frac{1}{\epsilon_i}\right)(-\epsilon_i) = \frac{1}{\sigma}\sum w_i \epsilon_i$$
and equaling this to zero gives the same estimating equation as the first example. Then searching for an estimator for $$\sigma^2$$:
$$\begin{eqnarray} \frac{\partial l(y)}{\partial \nu} &=& -\frac{n}{2}\frac{1}{\nu} + \sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot \frac{\partial \epsilon_i}{\partial\nu} \nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}+\sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)} \cdot \left(-\frac{(y_i-\mu)}{2\sigma^3}\right) \nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu} – \frac{1}{2}\frac{1}{\sigma^2} \sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot \epsilon_i\nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}-\frac{1}{2}\frac{1}{\nu} \sum\frac{f_0′(\epsilon_i)}{f_0(\epsilon_i)}\cdot \left(-\frac{1}{\epsilon_i}\right) (-\epsilon_i)\cdot\epsilon_i\nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}+\frac{1}{2}\frac{1}{\nu}\sum w_i \epsilon_i^2 \stackrel{!}{=} 0. \nonumber \end{eqnarray}$$
$$\hat{\sigma^2} = \frac{1}{n}\sum w_i (y_i-\hat{\mu})^2.$$
The iterative algorithm above can be used in this case as well.

In the following we give a numerical example using R, for the double exponential model (with known scale) and with data y <- c(-5,-1,0,1,5). For this data the true value of the ML estimator is 0.
The initial value will be mu <- 0.5. One pass of the algorithm is

      iterest <- function(y, mu) {
w <- 1/abs(y - mu)
weighted.mean(y, w)
}


with this function you can experiment with doing the iterations “by hand”
Then the iterative algorithm can be done by

    mu_0 <- 0.5
repeat {mu <- iterest(y, mu_0)
if (abs(mu_0 - mu) < 0.000001) break
mu_0 <- mu }


Exercise: If the model is a $$t_k$$ distribution with scale parameter $$\sigma$$ show the iterations are given by the weight
$$w_i = \frac{k + 1}{k + \epsilon_i^2}.$$
Exercise: If the density is logistic, show the weights are given by
$$w(\epsilon) = \frac{ 1-e^\epsilon}{1+e^\epsilon} \cdot – \frac{1}{\epsilon}.$$

For the moment I will leave it here, I will continue this post.