Should confidence intervals for linear regression coefficients be based on the normal or $t$ distribution?

Let’s have some linear model, for example just simple ANOVA:

# data generation
set.seed(1.234)                      
Ng <- c(41, 37, 42)                    
data <- rnorm(sum(Ng), mean = rep(c(-1, 0, 1), Ng), sd = 1)      
fact <- as.factor(rep(LETTERS[1:3], Ng)) 

m1 = lm(data ~ 0 + fact)
summary(m1)

Result is as follows:

Call:
lm(formula = data ~ 0 + fact)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.30047 -0.60414 -0.04078  0.54316  2.25323 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
factA  -0.9142     0.1388  -6.588 1.34e-09 ***
factB   0.1484     0.1461   1.016    0.312    
factC   1.0990     0.1371   8.015 9.25e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.8886 on 117 degrees of freedom
Multiple R-squared: 0.4816,     Adjusted R-squared: 0.4683 
F-statistic: 36.23 on 3 and 117 DF,  p-value: < 2.2e-16 

Now I try two different methods to estimate confidence interval of these parameters

c = coef(summary(m1))

# 1st method: CI limits from SE, assuming normal distribution
cbind(low = c[,1] - qnorm(p = 0.975) * c[,2], 
    high = c[,1] + qnorm(p = 0.975) * c[,2])

# 2nd method
confint(m1)

Questions:

  1. What is the distribution of estimated linear regression coefficients? Normal or $t$?
  2. Why do both methods yield different results? Assuming normal distribution and correct SE, I’d expect both methods to have the same result.

Thank you very much!

data ~ 0 + fact

EDIT after an answer:

The answer is exact, this will give exactly the same result as confint(m1)!

# 3rd method
cbind(low = c[,1] - qt(p = 0.975, df = sum(Ng) - 3) * c[,2], 
    high = c[,1] + qt(p = 0.975, df = sum(Ng) - 3) * c[,2])

Answer

(1) When the errors are normally distributed and their variance is not known, then $$\frac{\hat{\beta} – \beta_0}{{\rm se}(\hat{\beta})}$$ has a $t$-distribution under the null hypothesis that $\beta_0$ is the true regression coefficient. The default in R is to test $\beta_0 = 0$, so the $t$-statistics reported there are just $$\frac{\hat{\beta}}{{\rm se}(\hat{\beta})}$$

Note that, under some regularity conditions, the statistic above is always asymptotically normally distributed, regardless of whether the errors are normal or whether the error variance is known.

(2) The reason you’re getting different results is that the percentiles of the normal distribution are different from the percentiles of the $t$-distribution. Therefore, the multiplier you’re using in front of the standard error is different, which, in turn gives different confidence intervals.

Specifically, recall that the confidence interval using the normal distribution is

$$ \hat{\beta} \pm z_{\alpha/2} \cdot {\rm se}(\hat{\beta}) $$

where $z_{\alpha/2}$ is the $\alpha/2$ quantile of the normal distribution. In the standard case of a $95\%$ confidence interval, $\alpha = .05$ and $z_{\alpha/2} \approx 1.96$. The confidence interval based on the $t$-distribution is

$$ \hat{\beta} \pm t_{\alpha/2,n-p} \cdot {\rm se}(\hat{\beta}) $$

where the multiplier $t_{\alpha/2,n-p}$ is based on the quantiles of the $t$-distribution with $n-p$ degrees of freedom where $n$ is the sample size and $p$ is the number of predictors. When $n$ is large, $t_{\alpha/2,n-p}$ and $z_{\alpha/2}$ are about the same.

Below is a plot of the $t$ multipliers for sample sizes ranging from $5$ to $300$ (I’ve assumed $p=1$ for this plot, but that qualitatively changes nothing). The $t$-multipliers are larger, but, as you can see below, they do converge to the $z$ (solid black line) multiplier as the sample size increases.

enter image description here

Attribution
Source : Link , Question Author : Tomas , Answer Author : Macro

Leave a Comment