Optimizing OLS with Newton’s Method

Can ordinary least squares regression be solved with Newton’s method? If so, how many steps would be required to achieve convergence?

I know that Newton’s method works on twice differentiable functions, I’m just not sure how this works with OLS.


If used for OLS regression, Newton’s method converges in a single step, and is equivalent to using the standard, closed form solution for the coefficients.

On each iteration, Newton’s method constructs a quadratic approximation of the loss function around the current parameters, based on the gradient and Hessian. The parameters are then updated by minimizing this approximation. For quadratic loss functions (as we have with OLS regression) the approximation is equivalent to the loss function itself, so convergence occurs in a single step.

This assumes we’re using the ‘vanilla’ version of Newton’s method. Some variants use a restricted step size, in which case multiple steps would be needed. It also assumes the design matrix has full rank. If this doesn’t hold, the Hessian is non-invertible so Newton’s method can’t be used without modifying the problem and/or update rule (also, there’s no unique OLS solution in this case).


Assume the design matrix $X \in \mathbb{R}^{n \times d}$ has full rank. Let $y \in \mathbb{R}^n$ be the responses, and $w \in \mathbb{R}^d$ be the coefficients. The loss function is:

$$L(w) = \frac{1}{2} \|y – X w\|_2^2$$

The gradient and Hessian are:

$$\nabla L(w) = X^T X w – X^T y \quad \quad
H_L(w) = X^T X$$

Newton’s method sets the parameters to an initial guess $w_0$, then iteratively updates them. Let $w_t$ be the current parameters on iteration $t$. The updated parameters $w_{t+1}$ are obtained by subtracting the product of the inverse Hessian and the gradient:

$$w_{t+1} = w_t – H_L(w_t)^{-1} \nabla L(w_t)$$

Plug in the expressions for the gradient and Hessian:

$$w_{t+1} = w_t – (X^T X)^{-1} (X^T X w_t – X^T y)$$

$$= (X^T X)^{-1} X^T y$$

This is the standard, closed form expression for the OLS coefficients. Therefore, no matter what we choose for the initial guess $w_0$, we’ll have the correct solution at $w_1$ after a single iteration.

Furthermore, this is a stationary point. Notice that the expression for $w_{t+1}$ doesn’t depend on $w_t$, so the solution won’t change if we continue beyond one iteration. This indicates that Newton’s method converges in a single step.

Source : Link , Question Author : MLNewbie , Answer Author : user20160

Leave a Comment