I’m confused about the iteratively reweighted least squares algorithm used to solve for logistic regression coefficients as described on page 121 of

The Elements of Statistical Learning, 2nd Edition(Hastie, Tibshirani, Friedman 2009).The final step of the process, after fitting a Taylor approximation to the log-likelihood of N observations, is to solve the following weighted least squares problem:

βnew←argminβ(z−Xβ)TW(z−Xβ) (1)

by finding δ[(z−Xβ)TW(z−Xβ)]δβj, setting δ[(z−Xβ)TW(z−Xβ)]δβj=0, then solving for βnewj,

where:

z=Xβold+W−1(y−p),

W=N×N diagonal matrix of weights with ith diagonal element p(xi;βold)(1−p(xi;βold)),

p=vector of fitted probabilities with ith element p(xi;βold),

y=vector of yi values,

X=matrix of xi values,

β=vector of coefficients β0,β1,...,βp.

In the right hand part of expression (1), the βs are missing any superscript. Is β presumed to be equal to βold? That is, in order to solve for βnewj in (1) do we plug in the most current update of β for all values of βl≠j calculated in prior steps?

**Answer**

In an expression like

βnew←argminb(z−Xb)TW(z−Xb)

the point is that the output, βnew, is the result of considering all possible b∈Rp or whatever other space you are optimizing over. That’s why there’s no superscript: in the optimization problem β is a dummy variable, just like with an integral (and I’m deliberately writing b not β to reflect b being a dummy variable, not the target parameter).

The overall procedure involves getting a β(t), computing the “response” for the WLS, and then solving the WLS problem for β(t+1); as you know, we can use derivatives to get a nice closed-form solution for the optimal ˆβ for this problem. Thus βold, which is fixed, appears in the vector z in the WLS computation and then leads to βnew. That’s the “iteration” part, that we use our current solution to create a new response vector; the WLS part then is solving for the new ˆβ vector. We keep doing this until there’s no “significant” change.

Remember that the WLS procedure doesn’t know that it is being used iteratively; as far as it is concerned, it is presented with an X, y, and W and then outputs

ˆβ=(XTWX)−1XTWy

like it would do in any other instance. We are being clever with our choice of y and W and iterating.

Update:

We can derive the solution to the WLS problem without using any component-wise derivatives. Note that if Y∼N(Xβ,I) then W1/2Y∼N(W1/2Xβ,W) from which we have that

ddβ‖

Setting the derivative equal to 0 and solving we obtain

\hat{\beta} = (X^TWX)^{-1} X^TWY.

Thus for any inputs W, X, and Y (provided W is positive definite and X is full column rank) we get our optimal \hat{\beta}. It doesn’t matter what these inputs are. So what we do is we use our \beta^{old} to create our Y vector and then we plug *that* in to this formula which outputs the optimal \hat \beta for the given inputs. The whole point of the WLS procedure is to solve for \hat \beta. It in and of itself doesn’t require plugging in a \hat \beta.

**Attribution***Source : Link , Question Author : RobertF , Answer Author : jld*