# Why not use the “normal equations” to find simple least squares coefficients?

I saw this list here and couldn’t believe there were so many ways to solve least squares. The “normal equations” on Wikipedia seemed to be a fairly straight forward way:

So why not just use them? I assumed there must be a computational or precision issue given that in the first link above Mark L. Stone mentions that SVD or QR are popular methods in statistical software and that the normal equations are “TERRIBLE from reliability and numerical accuracy standpoint”. However, in the following code, the normal equations are giving me accuracy to ~12 decimal places when compared to three popular python functions: numpy’s polyfit; scipy’s linregress; and scikit-learn’s LinearRegression.

What’s more interesting is that the normal equation method is the fastest when n = 100000000. Computational times for me are: 2.5s for linregress; 12.9s for polyfit; 4.2s for LinearRegression; and 1.8s for the normal equation.

Code:

import numpy as np
from sklearn.linear_model import LinearRegression
from scipy.stats import linregress
import timeit

b0 = 0
b1 = 1
n = 100000000
x = np.linspace(-5, 5, n)
np.random.seed(42)
e = np.random.randn(n)
y = b0 + b1*x + e

# scipy
start = timeit.default_timer()
print(str.format('{0:.30f}', linregress(x, y)[0]))
stop = timeit.default_timer()
print(stop - start)

# numpy
start = timeit.default_timer()
print(str.format('{0:.30f}', np.polyfit(x, y, 1)[0]))
stop = timeit.default_timer()
print(stop - start)

# sklearn
clf = LinearRegression()
start = timeit.default_timer()
clf.fit(x.reshape(-1, 1), y.reshape(-1, 1))
stop = timeit.default_timer()
print(str.format('{0:.30f}', clf.coef_[0, 0]))
print(stop - start)

# normal equation
start = timeit.default_timer()
slope = np.sum((x-x.mean())*(y-y.mean()))/np.sum((x-x.mean())**2)
stop = timeit.default_timer()
print(str.format('{0:.30f}', slope))
print(stop - start)


For the problem $Ax \approx b$, forming the Normal equations squares the condition number of $A$ by forming $A^TA$. Roughly speaking $log_{10}(cond)$ is the number of digits you lose in your calculation if everything is done well. And this doesn’t really have anything to do with forming the inverse of $A^TA$. No matter how $A^TAx = A^Tb$ is solved, you’ve already lost $log_{10}(cond(A^TA)) = 2*log_{10}(cond(A))$ digits of accuracy. I.e., forming the Normal equations has doubled the number of digits of accuracy lost, right off the bat.
If the condition number is small (one is the best possible), it doesn’t matter much. If the condition number = $10^8$ and you use a stable method such as QR or SVD, you may have about 8 digits of accuracy in double precision. If you form the Normal equations, you’ve squared the condition number to $10^{16}$, and you have essentially no accuracy in your answer.