Regression when each point has its own uncertainty in both xx and yy

I made n measurements of two variables x and y. They both have known uncertainties σx and σy associated with them. I want to find the relation between x and y. How can I do it?

EDIT: each xi has a different σx,i associated with it, and the same with the yi.


Reproducible R example:

## pick some real x and y values 
true_x <- 1:100
true_y <- 2*true_x+1

## pick the uncertainty on them
sigma_x <- runif(length(true_x), 1, 10) # 10
sigma_y <- runif(length(true_y), 1, 15) # 15

## perturb both x and y with noise 
noisy_x <- rnorm(length(true_x), true_x, sigma_x)
noisy_y <- rnorm(length(true_y), true_y, sigma_y)

## make a plot 
plot(NA, xlab="x", ylab="y",
    xlim=range(noisy_x-sigma_x, noisy_x+sigma_x), 
    ylim=range(noisy_y-sigma_y, noisy_y+sigma_y))
arrows(noisy_x, noisy_y-sigma_y, 
       noisy_x, noisy_y+sigma_y, 
       length=0, angle=90, code=3, col="darkgray")
arrows(noisy_x-sigma_x, noisy_y,
       noisy_x+sigma_x, noisy_y,
       length=0, angle=90, code=3, col="darkgray")
points(noisy_y ~ noisy_x)

## fit a line 
mdl <- lm(noisy_y ~ noisy_x)
abline(mdl)

## show confidence interval around line 
newXs <- seq(-100, 200, 1)
prd <- predict(mdl, newdata=data.frame(noisy_x=newXs), 
    interval=c('confidence'), level=0.99, type='response')
lines(newXs, prd[,2], col='black', lty=3)
lines(newXs, prd[,3], col='black', lty=3)

linear regression without considering errors in variables

The problem with this example is that I think it assumes that there are no uncertainties in x. How can I fix this?

Answer

Let the true line L, given by an angle θ and a value γ, be the set

(x,y):cos(θ)x+sin(θ)y=γ.

The signed distance between any point (x,y) and this line is

d(x,y;L)=cos(θ)x+sin(θ)yγ.

Letting the variance of xi be σ2i and that of yi be τ2i, independence of xi and yi implies the variance of this distance is

Var(d(xi,yi;L))=cos2(θ)σ2i+sin2(θ)τ2i.

Let us therefore find θ and γ for which the inverse variance weighted sum of squared distances is as small as possible: it will be the maximum likelihood solution if we assume the errors have bivariate normal distributions. This requires a numerical solution, but it’s straightforward to find a with a few Newton-Raphson steps beginning with a value suggested by an ordinary least-squares fit.

Simulations suggest this solution is good even with small amounts of data and relatively large values of σi and τi. You can, of course, obtain standard errors for the parameters in the usual ways. If you’re interested in the standard error of the position of the line, as well as the slope, then you might wish first to center both variables at 0: that should eliminate almost all the correlation between the estimates of the two parameters.


The method works so well with the example of the question that the fitted line is almost distinguishable from the true line in the plot: they are within one unit or so of each other everywhere. Instead, in this example the τi are drawn iid from an exponential distribution and the σi are drawn iid from an exponential distribution with twice the scale (so that most of the error tends to occur in the x coordinate). There are only n=8 points, a small number. The true points are equally spaced along the line with unit spacing. This is a fairly severe test, because the potential errors are noticeable compared to the range of the points.

Figure

The true line is shown in dotted blue. Along it the original points are plotted as hollow circles. Gray arrows connect them to the observed points, plotted as solid black disks. The solution is drawn as a solid red line. Despite the presence of large deviations between observed and actual values, the solution is remarkably close to the correct line within this region.

#
# Generate data.
#
theta <- c(1, -2, 3) # The line is theta %*% c(x,y,-1) == 0
theta[-3] <- theta[-3]/sqrt(crossprod(theta[-3]))
n <- 8
set.seed(17)
sigma <- rexp(n, 1/2)
tau <- rexp(n, 1)
u <- 1:n
xy.0 <- t(outer(c(-theta[2], theta[1]), 0:(n-1)) + c(theta[3]/theta[1], 0))
xy <- xy.0 + cbind(rnorm(n, sd=sigma), rnorm(n, sd=tau))
#
# Fit a line.
#
x <- xy[, 1]
y <- xy[, 2]
f <- function(phi) { # Negative log likelihood, up to an additive constant
  a <- phi[1]
  gamma <- phi[2]
  sum((x*cos(a) + y*sin(a) - gamma)^2 / ((sigma*cos(a))^2 + (tau*sin(a))^2))/2
}
fit <- lm(y ~ x) # Yields starting estimates
slope <- coef(fit)[2]
theta.0 <- atan2(1, -slope)
gamma.0 <- coef(fit)[1] / sqrt(1 + slope^2)
sol <- nlm(f,c(theta.0, gamma.0))
#
# Plot the data and the fit.
#
theta.hat <- sol$estimate[1] %% (2*pi)
gamma.hat <- sol$estimate[2]
plot(rbind(xy.0, xy), type="n", xlab="x", ylab="y")
invisible(sapply(1:n, function(i) 
  arrows(xy.0[i,1], xy.0[i,2], xy[i,1], xy[i,2], 
         length=0.15, angle=20, col="Gray")))
points(xy.0)
points(xy, pch=16)
abline(c(theta[3] / theta[2], -theta[1]/theta[2]), col="Blue", lwd=2, lty=3)
abline(c(gamma.hat / sin(theta.hat), -1/tan(theta.hat)), col="Red", lwd=2)

Attribution
Source : Link , Question Author : rhombidodecahedron , Answer Author : whuber

Leave a Comment