What exactly is model instability due to multicollinearity?

I’ve read that model parameters become unstable in case of multicollinearity. Can some one give an example situation of this behavior, and explain why this is happening?

Please use the following multiple linear regression for illustration:



What Is It?

Here is an example of this behavior. I’m going to write a function to simulate regressions and output their coefficients. We’ll look at the coordinate pair of coefficients (a1,a2) in the case of no collinearity and high collinearity. Here is some code:

    sim <- function(rho){
      #Number of samples to draw
      N = 50
      #Make a covariance matrix
      covar = matrix(c(1,rho, rho, 1), byrow = T, nrow = 2)

      # Append a column of 1s to N draws from a 2-dimensional  
      # Gaussian 
      # With covariance matrix covar
      X = cbind(rep(1,N),MASS::mvrnorm(N, mu = c(0,0), 
                  Sigma = covar))

      # True betas for our regression
      betas = c(1,2,4)

      # Make the outcome
      y = X%*%betas + rnorm(N,0,1)

      # Fit a linear model
      model = lm(y ~ X[,2] + X[,3])
      # Return a dataframe of the coefficients
      return(tibble(a1 = coef(model)[2], a2 = coef(model)[3]))     
    #Run the function 1000 times and stack the results
    zero_covar = rerun(1000, sim(0)) %>% 
    #Same as above, but the covariance in covar matrix 
    #is now non-zero
    high_covar = rerun(1000, sim(0.95)) %>% bind_rows
    zero_covar %>% 
      ggplot(aes(a1,a2)) +
      geom_point(data = high_covar, color = 'red') +

Run that and you get something like

Plot of simulation results

This simulation is supposed to simulate the sampling distribution of the coefficients. As we can see, in the case of no collinearity (black dots) the sampling distribution for the coefficients is very tight around the true value of (2,4). The blob is symmetric about this point.

In the case of high collinearity (red dots), the coefficients of the linear model can vary quite a lot! Instability in this case manifests as wildly different coefficient values given the same data generating process.

Why Is This Happening

Let’s take a statistical perspective. The sampling distribution for the coefficients of a linear regression (with enough data) looks like
The covariance matrix for the above is
Let’s focus for a minute on (XX). If X has full rank, then (XX) is a Gram Matrix, which has some special properties. One of those properties is that it has positive eigenvalues. That means we can decompose this matrix product according to eigenvalue decomposition.
Suppose now one of the columns of X is highly correlated with another column. Then, one of the eigenvalues of XX should be close to 0 (I think). Inverting this product gives us
Since Λ is a diagonal matrix, Λ1jj=1Λjj. If one of the eigenvalues is really small, then one of the elements of Λ1 is really big, and so too is the covariance, leading to this instability in the coefficients.

I think I got that right, it has been a long time since I’ve done linear algebra.

Source : Link , Question Author : Eric Kim , Answer Author : kjetil b halvorsen

Leave a Comment