Explicit solution for linear regression with two predictors

I have some samples of data of the form x,y and z=f(x,y). I wish to fit a plane z=Ax+By+C to the data with the smallest mean square errors.

I have found an “answer” in section 3 of this document, but it is left in the form of some equations left to be solved. I just about have the ability to solve these equations, but the process gets so messy that the likelihood of me making a mistake is quite high. Surely someone has written out the full solution longhand somewhere (it might be called “closed form”) anyway, in the form A=, B=, C=.

EDIT: Maybe “closed form” is the wrong phrase. So let me be clear. I want an explicit solution for A, B, and C, and not a solution that ends with “So now if you can solve these equations then you can discover the values of A, B and C“.

Answer

Elsewhere on this site, explicit solutions to the ordinary least squares regression

E(zi)=Axi+Byi+C

are available in matrix form as

(C,A,B)=(XX)1Xz

where X is the “model matrix”

X=(1x1y11x2y21xnyn)

and z is the response vector

z=(z1,z2,,zn).

That’s a perfectly fine, explicit, computable answer. But maybe there is some additional understanding that can be wrung out of it by inspecting the coefficients. This can be achieved by choosing appropriate units in which to express the variables.

The best units for this purpose center each variable at its mean and use its standard deviation as the unit of measurement. Explicitly, let the three means be mx,my, and mz and the three standard deviations be sx,sy, and sz. (It turns out not to matter whether you divide by n or n1 in computing the standard deviations. Just make sure you use a consistent convention when you compute any second moment of the data.) The values of the variables in these new units of measurement are

ξi=ximxsx, ηi=yimysy, ζi=zimzsz.

This process is known as standardizing the data. The variables ξ, η, and ζ are the standardized versions of the original variables x, y, and z.

These relationships are invertible:

xi=sxξi+mx, yi=syηi+my, zi=szζi+mz.

Plugging these into the defining relationship

E(zi)=C+Axi+Byi

and simplifying yields

E(szζi+mz)=C+A(sxξi+mx)+B(syηi+my).

Solving for the expectation of the dependent variable ζi yields

E(ζi)=(C+Amx+Bmymzsz)+(Asxsz)ξi+(Bsysz)ηi.

If we write these coefficients as β0,β1,β2 respectively, then we can recover A,B,C by comparing and solving. For the record this gives

A=szβ1sx, B=szβ2sy, and C=szβ0+mzAmxBmy.

The point of this becomes apparent when we consider the new model matrix

Ξ=(1ξ1ηi1ξ2η21ξnηn)

and the new response matrix ζ=(ζ1,ζ2,,ζn), because now

ΞΞ=(n000nnρ0nρn)

and

Ξζ=(0,nτ,nυ)

where ρ is the correlation coefficient 1nni=1ξiηi, τ is the correlation coefficient 1nni=1ξiζi, and υ is the correlation coefficient 1nni=1ηiζi.

To solve the normal equations (1) we may divide both sides by n, giving

(10001ρ0ρ1)(β0β1β2)=(0τυ).

What originally looked like a formidable matrix formula has been reduced to a truly elementary set of three simultaneous equations. Provided |ρ|<1, its solution is easily found to be

(ˆβ0ˆβ1ˆβ2)=11ρ2(0τρυυρτ).

Plugging these into the coefficients in (2) produces the estimates ˆA,ˆB, and ˆC.

In fact, even more has been achieved:

  • It is now evident why the cases |ρ|=1 are problematic: they introduce a divide-by-zero condition in the solution.

  • It is equally evident how to determine whether a solution exists when |ρ=1| and how to obtain it. It will exist when the second and third normal equations in Ξ are redundant and it will be obtained simply by ignoring one of the variables x and y in the first place.

  • We can derive some insight into the solution generally. For instance, from ˆβ0=0 in all cases, we may conclude that the fitted plane must pass through the point of averages (mx,my,mz).

  • It is now evident that the solution can be found in terms of the first two moments of the trivariate dataset (x,y,z). This sheds further light on the fact that coefficient estimates can be found from means and covariance matrices alone.

  • Furthermore, equation (2) shows that the means are needed only to estimate the intercept term C. Estimates of the two slopes A and B require only the second moments.

  • When the regressors are uncorrelated, ρ=0 and the solution is that the intercept is zero and the slopes are the correlation coefficients between the response z and the regressors x and y when we standardize the data. This is both easy to remember and provides insight into how regression coefficients are related to correlation coefficients.


Putting this all together, we find that (except in the degenerate cases |ρ|=1) the estimates can be written

ˆA=τρυ1ρ2szsxˆB=υρτ1ρ2szsyˆC=mzmxˆAmyˆB.

In these formulae, the m are the sample means, the s are the sample standard deviations, and the greek letters ρ,τ, and υ represent the three correlation coefficients (between x and y, x and z, and y and z, respectively).

Please note that these formulas are not the best way to carry out the calculations. They all involve subtracting quantities that might be of comparable size, such as τρυ, υρτ, and mz(mxˆAmyˆB). Such subtraction involves loss of precision. The matrix formulation allows numerical analysts to obtain more stable solutions that preserve as much precision as possible. This is why people rarely have any interest in term-by-term formulas. The other reason there is little interest is that as the number of regressors increases, the complexity of the formulas grows exponentially, quickly becoming too unwieldy.


As further evidence of the correctness of these formulas, we may compare their answers to those of a standard least-squares solver, the lm function in R.

#
# Generate trivariate data.
#
library(MASS)
set.seed(17)
n <- 20
mu <- 1:3
Sigma <- matrix(1, 3, 3)
Sigma[lower.tri(Sigma)] <- Sigma[upper.tri(Sigma)] <- c(.8, .5, .6)
xyz <- data.frame(mvrnorm(n, mu, Sigma))
names(xyz) <- c("x", "y", "z")
#
# Obtain the least squares coefficients.
#
beta.hat <- coef(lm(z ~ x + y, xyz))
#
# Compute the first two moments via `colMeans` and `cov`.
#
m <- colMeans(xyz)
sigma <- cov(xyz)
s <- sqrt(diag(sigma))
rho <- t(t(sigma/s)/s); rho <- as.vector(rho[lower.tri(rho)])
#
# Here are the least squares coefficient estimates in terms of the moments.
#
A.hat <- (rho[2] - rho[1]*rho[3]) / (1 - rho[1]^2) * s[3] / s[1]
B.hat <- (rho[3] - rho[1]*rho[2]) / (1 - rho[1]^2) * s[3] / s[2]
C.hat <- m[3] - m[1]*A.hat - m[2]*B.hat
#
# Compare the two solutions.
#
rbind(beta.hat, formulae=c(C.hat, A.hat, B.hat))

The output exhibits two identical rows of estimates, as expected:

         (Intercept)         x        y
beta.hat    1.522571 0.3013662 0.403636
formulae    1.522571 0.3013662 0.403636

Attribution
Source : Link , Question Author : Mick , Answer Author : Community

Leave a Comment