Currently I am working on a large data set with well over 200 variables (238 to be exact) and 290 observations for each variable (in theory). This data set is missing quite a lot of values, with variables ranging from 0-100% ‘missingness’. I will eventually be performing logistical regression on this data, so of my 238 columns I will at most only be using ten or so.

However as almost all of my columns are missing some data, I am turning to multiple imputation to fill in the blanks (using the MICE package).

My question is; given that I have a large amount of variation in the missing data, at what percentage missing should I start to exclude variables from the mice() function?

Can mice function well with variables that are missing 50% of their values? What about 60%, 70%, 80%, 90%?

**Answer**

In principle, MICE should be able to handle large amounts of missing data. Variables with lots of missing data points would be expected to end up with larger error terms than those with fewer missing data points, so your ability to detect significant relations to those variables would be limited accordingly. That’s an advantage of having multiple imputations and analyzing results from all of the imputations.

More important than a “cutoff” for missing data is to consider carefully (1) the intended use of your model and (2) whether the “missing-at-random” assumptions needed for multiple imputation holds in your case.

In terms of (1) if you, say, intend to use the model for prediction but some variables are inherently hard to get, then there’s no sense including them in the model. Also, you should use your knowledge of the subject matter to consider variables for inclusion. If you suspect that only 10 or so will be important based on such knowledge, maybe you should just use those 10.

In terms of (2), if the probability of missing data for a variable depends on the actual value of the variable, then multiple imputation is inappropriate.

**Attribution***Source : Link , Question Author : purplesocks , Answer Author : EdM*