Why does an insignificant regressor become significant if I add some significant dummy variables? [duplicate]

I’m doing a linear regression with cluster robust SE and I have the following conceptual problem:
I have five regressors, of which four are statistically significant, while the remaining regressor is not.

When I put $K$ dummy variables in the model in order to control for effects not captured by the $5$ initial explanatory variables, I saw that:

  1. Some dummy variables were statistically significant
  2. The regressor that initially was not significant becomes significant.

What is the reason for the second result? What does it mean?

Answer

What you have described is a classic example of the phenomenon “confounding.” For the sake of argument, suppose you want to know what factors affect the price of a car, and the original model you fitted was:

$Price_i=MPG^*_i + Weight_i + Length_i + GearRatio_i$

*$MPG$ is how many miles per gallon the car gets

The regression results are as follows:

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  4,    69) =   10.93
       Model |   246385405     4  61596351.2           Prob > F      =  0.0000
    Residual |   388679991    69  5633043.35           R-squared     =  0.3880
-------------+------------------------------           Adj R-squared =  0.3525
       Total |   635065396    73  8699525.97           Root MSE      =  2373.4

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   -90.8697   82.54167    -1.10   0.275    -255.5358    73.79643
      weight |   5.330082   1.259779     4.23   0.000     2.816892    7.843272
      length |  -112.6501   39.26864    -2.87   0.005    -190.9889   -34.31134
  gear_ratio |   1747.338   940.8806     1.86   0.068    -129.6674    3624.343
       _cons |   7909.196   6803.245     1.16   0.249    -5662.907     21481.3
------------------------------------------------------------------------------

$Weight$ and $Length$ are significantly associated with price at the 5% level, whereas $GearRatio$ is significant at the 10% level. In this example, I will use 10% as the significant level often used in econometrics instead of the customary 5% in statistics/biostatistics.

Now suppose you realize that the country of origin of the car might have something to do with the price, so you enter “Country of origin” ($Country$)–a variable with 4 categories: 1. USA, 2. Japan, 3. Germany, and 4. France/Italy–into your model as dummy variables with “USA” as the reference/omitted category. The resulting model is as follows:

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  7,    66) =    7.05
       Model |   271664993     7  38809284.6           Prob > F      =  0.0000
    Residual |   363400404    66  5506066.72           R-squared     =  0.4278
-------------+------------------------------           Adj R-squared =  0.3671
       Total |   635065396    73  8699525.97           Root MSE      =  2346.5

---------------------------------------------------------------------------------
          price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
            mpg |  -43.63664   88.87729    -0.49   0.625    -221.0859    133.8126
         weight |   5.627906   1.277128     4.41   0.000     3.078037    8.177775
         length |  -108.6306   40.96925    -2.65   0.010    -190.4283   -26.83285
     gear_ratio |   1036.988   1011.416     1.03   0.309     -982.369    3056.344
                |
        country |
        Germany |   1474.478   786.7092     1.87   0.065    -96.23774    3045.193
          Japan |   1508.771   931.8605     1.62   0.110    -351.7485    3369.291
   France/Italy |   1513.169   1660.423     0.91   0.365    -1801.972    4828.311
                |
          _cons |   6825.621   6936.845     0.98   0.329    -7024.236    20675.48
---------------------------------------------------------------------------------

When we added $Country$ into the model, $GearRatio$ was no longer significant at the 10% level and $MPG$ became even more not significant (p was 0.28 in the original model, and became 0.63 after adding $Country$). We also note that the only significant category of $Country$ was $Germany$.

How do we interpret these results?

  1. Recall that dummy variables are entered into the model as a set as $(N-1)$ dummy variables where $N$ is the number of categories in the original variable. Recall also that dummies are interpreted relative to the excluded (reference) category. It is therefore normal for some dummy variables not to be significant in the model if the difference between that category and the reference category is not significant. In our example, German cars are on average USD 1,474.48 more expensive than American cars, whereas Japanese and French/Italian cars are both not significantly different from American cars in terms of $Price$. If you want to know whether the effect of the construct you entered as dummy variables was significant or not, you will need to do an F-test of the joint significance of your dummies, as the p-value given in the model only tells you if the given category was different from the reference or not, and not whether the $Country$ as a whole is significantly associated with $Price$:

test Germany Japan FranceItaly

( 1)  Germany     = 0
( 2)  Japan       = 0
( 3)  FranceItaly = 0

          F(  3,    66) =    1.53
               Prob > F =    0.2148

It turns out $Country$ as a whole is not a significant predictor of price (p=0.21), although German cars are significantly more expensive than American cars in this model.

  1. We also noted that some variables that were significant ($GearRatio$) became non-significant after adding $Country$. This means that in the model where we omitted $Country$, the parameter estimate for $GearRatio$ “absorbed” the effect of $Country$. That is, $Country$ is significantly associated with $GearRatio$ and $Price$, and failing to control for $Country$ biased the parameter estimate of $GearRatio$, making it seem more significant than it really is. That is, the “significant” effect of $GearRatio$ on $Price$ we saw in the original model is actually reflecting the effect of $Country$ on $Price$. $GearRatio$, as it turns out, has nothing to do with the $Price$ of a car.

Of course, the reverse can be true too: You CAN have something that was not significant become significant after adding variables to the model. The logic behind it is the same. The originally-not-significant variable was significantly associated with the omitted variable and reflects the effect of the omitted variable in addition to its own effect (plus some other unobservables, which we will ignore for the sake of argument). When you add the omitted variable (the dummies) into the model, the originally-not-significant variable no longer captures the partial effect of the omitted variable but now reflects the “true” effect of that variable…which, it turns out, is significantly associated with the outcome.

(Data: Stata built-in dataset “1978 Automobile Data” from http://www.stata-press.com/data/r13/auto.dta)

Attribution
Source : Link , Question Author : Luca Dibo , Answer Author : Marquis de Carabas

Leave a Comment