Perform linear regression, but force solution to go through some particular data points

I know how to perform a linear regression on a set of points. That is, I know how to fit a polynomial of my choice, to a given data set, (in the LSE sense). However, what I do not know, is how to force my solution to go through some particular points of my choosing. I have seen this be done before, but I cannot remember what the procedure was called, let alone how it was done.

As a very simple and concrete example, let us say that I have 100 points scattered on the x-y plane, and I choose to fit a whatever-order polynomial through them. I know how to perform this linear regression very well. However, let us say that I want to ‘force’ my solution, to go through, say, three of my data points at x-coordinates $x=3$, $x=19$, and $x=89$, (and their corresponding y-coordinates of course).

What is this general procedure called, how is it done, and are there any particular pitfalls I need to be aware of?

Edit:

I would like to add, that I am looking for a concrete way to do this. I have written a program that actually does the linear regression in one of two ways, inverting the covariance matrix directly, or through gradient descent. What I am asking is, how, exactly, step by step, do I modify what I did, such that I force the polynomial solution to go through specific points?

Thanks!

Answer

The model in question can be written

$$y = p(x) + (x-x_1)\cdots(x-x_d)\left(\beta_0 + \beta_1 x + \cdots + \beta_p x^p \right) + \varepsilon$$

where $p(x_i) = y_i$ is a polynomial of degree $d-1$ passing through predetermined points $(x_1,y_1), \ldots, (x_d,y_d)$ and $\varepsilon$ is random. (Use the Lagrange interpolating polynomial.) Writing $(x-x_1)\cdots(x-x_d) = r(x)$ allows us to rewrite this model as

$$y – p(x) = \beta_0 r(x) + \beta_1 r(x)x + \beta_2 r(x)x^2 + \cdots + \beta_p r(x)x^p + \varepsilon,$$

which is a standard OLS multiple regression problem with the same error structure as the original where the independent variables are the $p+1$ quantities $r(x)x^i,$ $i=0, 1, \ldots, p$. Simply compute these variables and run your familiar regression software, making sure to prevent it from including a constant term. The usual caveats about regressions without a constant term apply; in particular, the $R^2$ can be artificially high; the usual interpretations do not apply.

(In fact, regression through the origin is a special case of this construction where $d=1$, $(x_1,y_1) = (0,0)$, and $p(x)=0$, so that the model is $y = \beta_0 x + \cdots + \beta_p x^{p+1} + \varepsilon.$)


Here is a worked example (in R)

# Generate some data that *do* pass through three points (up to random error).
x <- 1:24
f <- function(x) ( (x-2)*(x-12) + (x-2)*(x-23) + (x-12)*(x-23) )  / 100
y0 <-(x-2) * (x-12) * (x-23) * (1 + x - (x/24)^2) / 10^4  + f(x)
set.seed(17)
eps <- rnorm(length(y0), mean=0, 1/2)
y <- y0 + eps
data <- data.frame(x,y)

# Plot the data and the three special points.
plot(data)
points(cbind(c(2,12,23), f(c(2,12,23))), pch=19, col="Red", cex=1.5)

# For comparison, conduct unconstrained polynomial regression
data$x2 <- x^2
data$x3 <- x^3
data$x4 <- x^4

fit0 <- lm(y ~ x + x2 + x3 + x4, data=data)
lines(predict(fit0), lty=2, lwd=2)

# Conduct the constrained regressions
data$y1 <- y - f(x)
data$r <- (x-2)*(x-12)*(x-23)
data$z0 <- data$r
data$z1 <- data$r * x
data$z2 <- data$r * x^2

fit <- lm(y1 ~ z0 + z1 + z2 - 1, data=data)
lines(predict(fit) + f(x), col="Red", lwd=2)

Plot

The three fixed points are shown in solid red–they are not part of the data. The unconstrained fourth-order polynomial least squares fit is shown with a black dotted line (it has five parameters); the constrained fit (of order five, but with only three free parameters) is shown with the red line.

Inspecting the least squares output (summary(fit0) and summary(fit)) can be instructive–I leave this to the interested reader.

Attribution
Source : Link , Question Author : Spacey , Answer Author : whuber

Leave a Comment