# Explicit solution for linear regression with two predictors

I have some samples of data of the form $x,y$ and $z=f(x,y)$. I wish to fit a plane to the data with the smallest mean square errors.

I have found an “answer” in section 3 of this document, but it is left in the form of some equations left to be solved. I just about have the ability to solve these equations, but the process gets so messy that the likelihood of me making a mistake is quite high. Surely someone has written out the full solution longhand somewhere (it might be called “closed form”) anyway, in the form

EDIT: Maybe “closed form” is the wrong phrase. So let me be clear. I want an explicit solution for $A$, $B$, and $C$, and not a solution that ends with “So now if you can solve these equations then you can discover the values of $A$, $B$ and $C$“.

Elsewhere on this site, explicit solutions to the ordinary least squares regression

are available in matrix form as

where $X$ is the “model matrix”

and $z$ is the response vector

That’s a perfectly fine, explicit, computable answer. But maybe there is some additional understanding that can be wrung out of it by inspecting the coefficients. This can be achieved by choosing appropriate units in which to express the variables.

The best units for this purpose center each variable at its mean and use its standard deviation as the unit of measurement. Explicitly, let the three means be $m_x, m_y,$ and $m_z$ and the three standard deviations be $s_x, s_y,$ and $s_z$. (It turns out not to matter whether you divide by $n$ or $n-1$ in computing the standard deviations. Just make sure you use a consistent convention when you compute any second moment of the data.) The values of the variables in these new units of measurement are

This process is known as standardizing the data. The variables $\xi$, $\eta$, and $\zeta$ are the standardized versions of the original variables $x$, $y$, and $z$.

These relationships are invertible:

Plugging these into the defining relationship

and simplifying yields

Solving for the expectation of the dependent variable $\zeta_i$ yields

If we write these coefficients as $\beta_0, \beta_1, \beta_2$ respectively, then we can recover $A, B, C$ by comparing and solving. For the record this gives

The point of this becomes apparent when we consider the new model matrix

and the new response matrix $\zeta = (\zeta_1, \zeta_2, \ldots, \zeta_n)$, because now

and

where $\rho$ is the correlation coefficient $\frac{1}{n}\sum_{i=1}^n \xi_i \eta_i$, $\tau$ is the correlation coefficient $\frac{1}{n}\sum_{i=1}^n \xi_i \zeta_i$, and $\upsilon$ is the correlation coefficient $\frac{1}{n}\sum_{i=1}^n \eta_i \zeta_i$.

To solve the normal equations $(1)$ we may divide both sides by $n$, giving

What originally looked like a formidable matrix formula has been reduced to a truly elementary set of three simultaneous equations. Provided $|\rho| \lt 1$, its solution is easily found to be

Plugging these into the coefficients in $(2)$ produces the estimates $\hat A, \hat B,$ and $\hat C$.

In fact, even more has been achieved:

• It is now evident why the cases $|\rho|=1$ are problematic: they introduce a divide-by-zero condition in the solution.

• It is equally evident how to determine whether a solution exists when $|\rho=1|$ and how to obtain it. It will exist when the second and third normal equations in $\Xi$ are redundant and it will be obtained simply by ignoring one of the variables $x$ and $y$ in the first place.

• We can derive some insight into the solution generally. For instance, from $\hat\beta_0=0$ in all cases, we may conclude that the fitted plane must pass through the point of averages $(m_x, m_y, m_z)$.

• It is now evident that the solution can be found in terms of the first two moments of the trivariate dataset $(x, y, z)$. This sheds further light on the fact that coefficient estimates can be found from means and covariance matrices alone.

• Furthermore, equation $(2)$ shows that the means are needed only to estimate the intercept term $C$. Estimates of the two slopes $A$ and $B$ require only the second moments.

• When the regressors are uncorrelated, $\rho=0$ and the solution is that the intercept is zero and the slopes are the correlation coefficients between the response $z$ and the regressors $x$ and $y$ when we standardize the data. This is both easy to remember and provides insight into how regression coefficients are related to correlation coefficients.

Putting this all together, we find that (except in the degenerate cases $|\rho|=1$) the estimates can be written

In these formulae, the $m_{*}$ are the sample means, the $s_{*}$ are the sample standard deviations, and the greek letters $\rho, \tau,$ and $\upsilon$ represent the three correlation coefficients (between $x$ and $y$, $x$ and $z$, and $y$ and $z$, respectively).

Please note that these formulas are not the best way to carry out the calculations. They all involve subtracting quantities that might be of comparable size, such as $\tau-\rho\upsilon$, $\upsilon-\rho\tau$, and $m_z - (-m_x \hat A - m_y \hat B)$. Such subtraction involves loss of precision. The matrix formulation allows numerical analysts to obtain more stable solutions that preserve as much precision as possible. This is why people rarely have any interest in term-by-term formulas. The other reason there is little interest is that as the number of regressors increases, the complexity of the formulas grows exponentially, quickly becoming too unwieldy.

As further evidence of the correctness of these formulas, we may compare their answers to those of a standard least-squares solver, the lm function in R.

#
# Generate trivariate data.
#
library(MASS)
set.seed(17)
n <- 20
mu <- 1:3
Sigma <- matrix(1, 3, 3)
Sigma[lower.tri(Sigma)] <- Sigma[upper.tri(Sigma)] <- c(.8, .5, .6)
xyz <- data.frame(mvrnorm(n, mu, Sigma))
names(xyz) <- c("x", "y", "z")
#
# Obtain the least squares coefficients.
#
beta.hat <- coef(lm(z ~ x + y, xyz))
#
# Compute the first two moments via colMeans and cov.
#
m <- colMeans(xyz)
sigma <- cov(xyz)
s <- sqrt(diag(sigma))
rho <- t(t(sigma/s)/s); rho <- as.vector(rho[lower.tri(rho)])
#
# Here are the least squares coefficient estimates in terms of the moments.
#
A.hat <- (rho - rho*rho) / (1 - rho^2) * s / s
B.hat <- (rho - rho*rho) / (1 - rho^2) * s / s
C.hat <- m - m*A.hat - m*B.hat
#
# Compare the two solutions.
#
rbind(beta.hat, formulae=c(C.hat, A.hat, B.hat))


The output exhibits two identical rows of estimates, as expected:

         (Intercept)         x        y
beta.hat    1.522571 0.3013662 0.403636
formulae    1.522571 0.3013662 0.403636