Methods to work around the problem of missing data in machine learning

Virtually any database we want to make predictions using machine learning algorithms will find missing values ​​for some of the characteristics.

There are several approaches to address this problem, to exclude lines that have missing values ​​until they fill with the mean values ​​of the characteristics.

I would like to use for a somewhat more robust approach, which would basically run a regression (or another method) where the dependent variable (Y) would be each of the columns that have missing values ​​but only with the rows of the table that contain all the data , and predict the missing values ​​with this method, complete the table by the table and move to the next ‘column’ with missing values ​​and repeat the method until everything is filled.

But that gives me some doubts.

Why any column start? I believe that the one with the smallest missing values ​​until the one with the most

Is there any threshold of missing values ​​that is not worth trying to complete it? (for example, if this characteristic only has 10% of the values ​​filled would not it be more interesting to exclude it)

Is there any kind of implementation in traditional packages or other methods that are robust to missings?


The technique you describe is called imputation by sequential regressions or multiple imputation by chained equations. The technique was pioneered by Raghunathan (2001) and implemented in a well working R package called mice (van Buuren, 2012).

A paper by Schafer and Graham (2002) explains well why mean imputation and listwise deletion (what you call line exclusion) usually are no good alternatives to the above mentioned techniques. Principally mean imputation is not conditional and thus can bias the imputed distributions towards the observed mean. It will also shrink the variance, among other undesirable impacts on the imputed distribution. Furthermore, listwise deletion indeed will only work if the data are missing completely at random, like by the flip of a coin. Also it will increase the sampling error, as the sample size is reduced.

The authors quoted above usually recommend starting with the variable featuring the least missing values. Also, the technique is usually applied in a Bayesian way (i.e. an extension of your suggestion). Variables are visited more often in the imputation procedure, not only once. In particular, each variable is completed by draws from its conditional posterior predictive distribution, starting with the variable featuring least missing values. Once all variables in a data set have been completed, the algorithm again starts at the first variable and then re-iterates until convergence. The authors have shown that this algorithm is Gibbs, thus it usually converges to the correct multivariate distribution of the variables.

Usually, because there are some untestable assumptions involved, in particular missing at random data (i.e. whether data are observed or not depends on the observed data only, and not on the unobaserved values). Also the procedures can be partially incompatible, which is why they have been called PIGS (partially incompatible Gibbs sampler).

In practice Bayesian multiple imputation is still a good way to deal with multivariate non-monotone missing data problems. Also, non-parametric extensions such as predictive mean matching help to relax regression modeling assumptions.

Raghunathan, T. E., Lepkowski, J., van Hoewyk, J., & Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27(1), 85–95.

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.

van Buuren, S. (2012). Flexible Imputation of Missing Data. Boca Raton: CRC Press.

Source : Link , Question Author : sn3fru , Answer Author : tomka

Leave a Comment