Either quadratic or interaction term is significant in isolation, but neither are together

As part of an assignment, I had to fit a model with two predictor variables. I then had to draw a plot of the models’ residuals against one of the included predictors and make changes based on that. The plot showed a curvilinear trend and so I included a quadratic term for that predictor. The new model showed the quadratic term to be significant. All good so far.

However, the data suggest that an interaction makes sense, too. Adding an interaction term to the original model also ‘fixed’ the curvilinear trend and was also significant when added to the model (without the quadratic term). The problem is, when both the quadratic and the interaction term are added to the model, one of them is not significant.

Which term (the quadratic or the interaction) should I include in the model and why?

Answer

Synopsis

When the predictors are correlated, a quadratic term and an interaction term will carry similar information. This can cause either the quadratic model or the interaction model to be significant; but when both terms are included, because they are so similar neither may be significant. Standard diagnostics for multicollinearity, such as VIF, may fail to detect any of this. Even a diagnostic plot, specifically designed to detect the effect of using a quadratic model in place of interaction, may fail to determine which model is best.


Analysis

The thrust of this analysis, and its main strength, is to characterize situations like that described in the question. With such a characterization available it’s then an easy task to simulate data that behave accordingly.

Consider two predictors X1 and X2 (which we will automatically standardize so that each has unit variance in the dataset) and suppose the random response Y is determined by these predictors and their interaction plus independent random error:

Y=β1X1+β2X2+β1,2X1X2+ε.

In many cases predictors are correlated. The dataset might look like this:

Scatterplot matrix

These sample data were generated with β1=β2=1 and β1,2=0.1. The correlation between X1 and X2 is 0.85.

This doesn’t necessarily mean we are thinking of X1 and X2 as realizations of random variables: it can include the situation where both X1 and X2 are settings in a designed experiment, but for some reason these settings are not orthogonal.

Regardless of how the correlation arises, one good way to describe it is in terms of how much the predictors differ from their average, X0=(X1+X2)/2. These differences will be fairly small (in the sense that their variance is less than 1); the greater the correlation between X1 and X2, the smaller these differences will be. Writing, then, X1=X0+δ1 and X2=X0+δ2, we can re-express (say) X2 in terms of X1 as X2=X1+(δ2δ1). Plugging this into the interaction term only, the model is

Y=β1X1+β2X2+β1,2X1(X1+[δ2δ1])+ε=(β1+β1,2[δ2δ1])X1+β2X2+β1,2X21+ε

Provided the values of β1,2[δ2δ1] vary only a little bit compared to β1, we can gather this variation with the true random terms, writing

Y=β1X1+β2X2+β1,2X21+(ε+β1,2[δ2δ1]X1)

Thus, if we regress Y against X1,X2, and X21, we will be making an error: the variation in the residuals will depend on X1 (that is, it will be heteroscedastic). This can be seen with a simple variance calculation:

var(ε+β1,2[δ2δ1]X1)=var(ε)+[β21,2var(δ2δ1)]X21.

However, if the typical variation in ε substantially exceeds the typical variation in β1,2[δ2δ1]X1, that heteroscedasticity will be so low as to be undetectable (and should yield a fine model). (As shown below, one way to look for this violation of regression assumptions is to plot the absolute value of the residuals against the absolute value of X1–remembering first to standardize X1 if necessary.) This is the characterization we were seeking.

Remembering that X1 and X2 were assumed to be standardized to unit variance, this implies the variance of δ2δ1 will be relatively small. To reproduce the observed behavior, then, it should suffice to pick a small absolute value for β1,2, but make it large enough (or use a large enough dataset) so that it will be significant.

In short, when the predictors are correlated and the interaction is small but not too small, a quadratic term (in either predictor alone) and an interaction term will be individually significant but confounded with each other. Statistical methods alone are unlikely to help us decide which is better to use.


Example

Let’s check this out with the sample data by fitting several models. Recall that β1,2 was set to 0.1 when simulating these data. Although that is small (the quadratic behavior is not even visible in the previous scatterplots), with 150 data points we have a chance of detecting it.

First, the quadratic model:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.03363    0.03046   1.104  0.27130    
x1           0.92188    0.04081  22.592  < 2e-16 ***
x2           1.05208    0.04085  25.756  < 2e-16 ***
I(x1^2)      0.06776    0.02157   3.141  0.00204 ** 

Residual standard error: 0.2651 on 146 degrees of freedom
Multiple R-squared: 0.9812, Adjusted R-squared: 0.9808 

The quadratic term is significant. Its coefficient, 0.068, underestimates β1,2=0.1, but it’s of the right size and right sign. As a check for multicollinearity (correlation among the predictors) we compute the variance inflation factors (VIF):

      x1       x2  I(x1^2) 
3.531167 3.538512 1.009199 

Any value less than 5 is usually considered just fine. These are not alarming.

Next, the model with an interaction but no quadratic term:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02887    0.02975    0.97 0.333420    
x1           0.93157    0.04036   23.08  < 2e-16 ***
x2           1.04580    0.04039   25.89  < 2e-16 ***
x1:x2        0.08581    0.02451    3.50 0.000617 ***

Residual standard error: 0.2631 on 146 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9811

      x1       x2    x1:x2 
3.506569 3.512599 1.004566 

All the results are similar to the previous ones. Both are about equally good (with a very tiny advantage to the interaction model).

Finally, let’s include both the interaction and quadratic terms:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02572    0.03074   0.837    0.404    
x1           0.92911    0.04088  22.729   <2e-16 ***
x2           1.04771    0.04075  25.710   <2e-16 ***
I(x1^2)      0.01677    0.03926   0.427    0.670    
x1:x2        0.06973    0.04495   1.551    0.123    

Residual standard error: 0.2638 on 145 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.981 

      x1       x2  I(x1^2)    x1:x2 
3.577700 3.555465 3.374533 3.359040

Now, neither the quadratic term nor the interaction term are significant, because each is trying to estimate a part of the interaction in the model. Another way to see this is that nothing was gained (in terms of reducing the residual standard error) when adding the quadratic term to the interaction model or when adding the interaction term to the quadratic model. It is noteworthy that the VIFs do not detect this situation: although the fundamental explanation for what we have seen is the slight collinearity between X1 and X2, which induces a collinearity between X21 and X1X2, neither is large enough to raise flags.

If we had tried to detect the heteroscedasticity in the quadratic model (the first one), we would be disappointed:

Diagnostic plot

In the loess smooth of this scatterplot there is ever so faint a hint that the sizes of the residuals increase with |X1|, but nobody would take this hint seriously.

Attribution
Source : Link , Question Author : Tal Bashan , Answer Author : whuber

Leave a Comment