I am currently running a multiple regression model using imputed data and have a few questions.
Using SPSS 18. My data appears to be MAR. Listwise deletion of cases leaves me with only 92 cases, multiple imputation leaves 153 cases for analysis. All assumptions met – one variable log transformed. 9 IV’s 5 – 5 categorical, 3 scale, 1 interval. DV-scale. Using the enter method of standard multiple regression.
- My DV is the difference of scores between a pre- score and a post score measure, both of these variables are missing a number of cases – should I impute missing values for each of these and then work out the differnce between them to calculate my DV (how do I go about doing this), or can I just impute data for my DV? Which is the most appropriate approach?
- Should I run imputations on transformed data or skewed untransformed data?
- Should I enter all variables into the imputation process, even if they are not missing data, or should I just impute data for the variables missing more than 10% of cases?
I have run the regression on the listwise deleted cases and my IV’s account for very little of the variance in my DV, subsequently I have run the regression on a complete file following multiple imputation – The results are very similar, in that my 9 IV’s still predict only approx 12% of the variance in my DV, however, now one of my IV’S indicates that it is making a significant contribution (this happens to be a log transformed variable)…
- Should I report original data if there is little difference between my conclusions – i.e my IV’s poorly predict the dv, or report the complete data?
- Whether you should impute both the pre- and post- scores, or the difference score, depends on how you analyze the pre-post difference. You should be aware there are legitimate limitations to analyses of difference scores (see Edwards, 1994, for a nice review), and a regression approach in which you analyze the residual for post- scores after controlling for pre-scores might be better. In that case, you would want to impute pre- and post- scores, since those are the variables that will be in your analytic model. However, if you’re intent on analyzing difference scores, impute the difference scores, since it’s unlikely you will want to manually compute difference scores across all your imputed data sets. In other words, whatever variable(s) you are using in your actual analytic model, is/are the variable(s) that you should use in your imputation model.
- Again, I would impute with the transformed variable, since that is what is used in your analytic model.
- Adding variables to the imputation model will increase the computational demands of the imputation process, BUT, if you have the time, more information is always better. Variables with complete data could potentially be very useful auxiliary variables for explaining MAR missingness. If using all your variables results in too time/computation demanding of an imputation model (i.e., if you have a big data set), create dummy variables for each cases’s missingness for each variable, and see which complete variables predict those missingness variables in logistic models–then include those particular complete case variables in your imputation model.
- I wouldn’t report the original (i.e., list-wise deleted) analyses. If your missingness mechanism is MAR, then MI is not only going to give you increased power, but it will also give you more accurate estimates (Enders, 2010). Thus, the significant effect with MI might be non-significant with list-wise deletion because that analysis is underpowered, biased, or both.
Edwards, J. R. (1994). Regression analysis as an alternative to difference scores. Journal of Management, 20, 683-689.
Enders, C. K. (2010). Applied Missing Data Analysis. New York, NY: Guilford Press.