I know how to perform a linear regression on a set of points. That is, I know how to fit a polynomial of my choice, to a given data set, (in the LSE sense). However, what I do not know, is how to force my solution to go through some particular points of my choosing. I have seen this be done before, but I cannot remember what the procedure was called, let alone how it was done.
As a very simple and concrete example, let us say that I have 100 points scattered on the x-y plane, and I choose to fit a whatever-order polynomial through them. I know how to perform this linear regression very well. However, let us say that I want to ‘force’ my solution, to go through, say, three of my data points at x-coordinates $x=3$, $x=19$, and $x=89$, (and their corresponding y-coordinates of course).
What is this general procedure called, how is it done, and are there any particular pitfalls I need to be aware of?
Edit:
I would like to add, that I am looking for a concrete way to do this. I have written a program that actually does the linear regression in one of two ways, inverting the covariance matrix directly, or through gradient descent. What I am asking is, how, exactly, step by step, do I modify what I did, such that I force the polynomial solution to go through specific points?
Thanks!
Answer
The model in question can be written
$$y = p(x) + (x-x_1)\cdots(x-x_d)\left(\beta_0 + \beta_1 x + \cdots + \beta_p x^p \right) + \varepsilon$$
where $p(x_i) = y_i$ is a polynomial of degree $d-1$ passing through predetermined points $(x_1,y_1), \ldots, (x_d,y_d)$ and $\varepsilon$ is random. (Use the Lagrange interpolating polynomial.) Writing $(x-x_1)\cdots(x-x_d) = r(x)$ allows us to rewrite this model as
$$y – p(x) = \beta_0 r(x) + \beta_1 r(x)x + \beta_2 r(x)x^2 + \cdots + \beta_p r(x)x^p + \varepsilon,$$
which is a standard OLS multiple regression problem with the same error structure as the original where the independent variables are the $p+1$ quantities $r(x)x^i,$ $i=0, 1, \ldots, p$. Simply compute these variables and run your familiar regression software, making sure to prevent it from including a constant term. The usual caveats about regressions without a constant term apply; in particular, the $R^2$ can be artificially high; the usual interpretations do not apply.
(In fact, regression through the origin is a special case of this construction where $d=1$, $(x_1,y_1) = (0,0)$, and $p(x)=0$, so that the model is $y = \beta_0 x + \cdots + \beta_p x^{p+1} + \varepsilon.$)
Here is a worked example (in R
)
# Generate some data that *do* pass through three points (up to random error).
x <- 1:24
f <- function(x) ( (x-2)*(x-12) + (x-2)*(x-23) + (x-12)*(x-23) ) / 100
y0 <-(x-2) * (x-12) * (x-23) * (1 + x - (x/24)^2) / 10^4 + f(x)
set.seed(17)
eps <- rnorm(length(y0), mean=0, 1/2)
y <- y0 + eps
data <- data.frame(x,y)
# Plot the data and the three special points.
plot(data)
points(cbind(c(2,12,23), f(c(2,12,23))), pch=19, col="Red", cex=1.5)
# For comparison, conduct unconstrained polynomial regression
data$x2 <- x^2
data$x3 <- x^3
data$x4 <- x^4
fit0 <- lm(y ~ x + x2 + x3 + x4, data=data)
lines(predict(fit0), lty=2, lwd=2)
# Conduct the constrained regressions
data$y1 <- y - f(x)
data$r <- (x-2)*(x-12)*(x-23)
data$z0 <- data$r
data$z1 <- data$r * x
data$z2 <- data$r * x^2
fit <- lm(y1 ~ z0 + z1 + z2 - 1, data=data)
lines(predict(fit) + f(x), col="Red", lwd=2)
The three fixed points are shown in solid red–they are not part of the data. The unconstrained fourth-order polynomial least squares fit is shown with a black dotted line (it has five parameters); the constrained fit (of order five, but with only three free parameters) is shown with the red line.
Inspecting the least squares output (summary(fit0)
and summary(fit)
) can be instructive–I leave this to the interested reader.
Attribution
Source : Link , Question Author : Spacey , Answer Author : whuber