Why not use the “normal equations” to find simple least squares coefficients?

I saw this list here and couldn’t believe there were so many ways to solve least squares. The “normal equations” on Wikipedia seemed to be a fairly straight forward way:
ˆα=ˉyˆβˉx,ˆβ=ni=1(xiˉx)(yiˉy)ni=1(xiˉx)2

So why not just use them? I assumed there must be a computational or precision issue given that in the first link above Mark L. Stone mentions that SVD or QR are popular methods in statistical software and that the normal equations are “TERRIBLE from reliability and numerical accuracy standpoint”. However, in the following code, the normal equations are giving me accuracy to ~12 decimal places when compared to three popular python functions: numpy’s polyfit; scipy’s linregress; and scikit-learn’s LinearRegression.

What’s more interesting is that the normal equation method is the fastest when n = 100000000. Computational times for me are: 2.5s for linregress; 12.9s for polyfit; 4.2s for LinearRegression; and 1.8s for the normal equation.

Code:

import numpy as np
from sklearn.linear_model import LinearRegression
from scipy.stats import linregress
import timeit

b0 = 0
b1 = 1
n = 100000000
x = np.linspace(-5, 5, n)
np.random.seed(42)
e = np.random.randn(n)
y = b0 + b1*x + e

# scipy                                                                                                                                     
start = timeit.default_timer()
print(str.format('{0:.30f}', linregress(x, y)[0]))
stop = timeit.default_timer()
print(stop - start)

# numpy                                                                                                                                      
start = timeit.default_timer()
print(str.format('{0:.30f}', np.polyfit(x, y, 1)[0]))
stop = timeit.default_timer()
print(stop - start)

# sklearn                                                                                                                                    
clf = LinearRegression()
start = timeit.default_timer()
clf.fit(x.reshape(-1, 1), y.reshape(-1, 1))
stop = timeit.default_timer()
print(str.format('{0:.30f}', clf.coef_[0, 0]))
print(stop - start)

# normal equation                                                                                                                            
start = timeit.default_timer()
slope = np.sum((x-x.mean())*(y-y.mean()))/np.sum((x-x.mean())**2)
stop = timeit.default_timer()
print(str.format('{0:.30f}', slope))
print(stop - start) 

Answer

For the problem Axb, forming the Normal equations squares the condition number of A by forming ATA. Roughly speaking log10(cond) is the number of digits you lose in your calculation if everything is done well. And this doesn’t really have anything to do with forming the inverse of ATA. No matter how ATAx=ATb is solved, you’ve already lost log10(cond(ATA))=2log10(cond(A)) digits of accuracy. I.e., forming the Normal equations has doubled the number of digits of accuracy lost, right off the bat.

If the condition number is small (one is the best possible), it doesn’t matter much. If the condition number = 108 and you use a stable method such as QR or SVD, you may have about 8 digits of accuracy in double precision. If you form the Normal equations, you’ve squared the condition number to 1016, and you have essentially no accuracy in your answer.

Sometimes you get away with the Normal equations, and sometimes you don’t.

Attribution
Source : Link , Question Author : Oliver Angelil , Answer Author : Mark L. Stone

Leave a Comment