Is the method of mean substitution for replacing missing data out of date? Are there more sophisticated models that should be used? If so, what are they?

**Answer**

Barring the fact that it’s not necessary to shoot mosquitoes with a cannon (i.e. if you have one missing value in a million data points, just drop it), using the mean could be suboptimal to say the least: the result can be biased, and you should at least correct the result for the uncertainty.

There are some other options, but the one easiest to explain is multiple imputation. The concept is simple: based upon a model for your data itself (e.g. obtained from the complete cases, though other options are available, like MICE), draw values from the associated distribution to ‘complete’ your dataset. Then in this completed dataset you don’t have anymore missing data, and you can run your analysis of interest.

If you did this only once (in fact, replacing the missing values with the mean is a very contorted form of this), it would be called single imputation, and there is no reason why it would perform better than mean replacement.

However: the trick is to do this repeatedly (hence Multiple Imputation), and each time do your analysis on each completed (=imputed) dataset. The result is typically a set of parameter estimates or similar for each completed dataset. Under relatively loose conditions, it is OK to average your parameter estimates over all these imputed datasets.

The advantage is that there also exists a simple formula to adjust the standard error for the uncertainty caused by the missing data.

If you want to know more, you probably want to read Little and Rubin’s ‘Statistical Analysis with Missing Data’. This also holds other methods (EM,…) and more explanation on how/why/when they work.

**Attribution***Source : Link , Question Author : Melissa Duncombe , Answer Author : mpiktas*