How do the number of imputations & the maximum iterations affect accuracy in multiple imputation?

The help page for MICE defines the function as:

mice(data, m = 5, method = vector("character", length = ncol(data)),
  predictorMatrix = (1 - diag(1, ncol(data))),
  visitSequence = (1:ncol(data))[apply(, 2, any)],
  form = vector("character", length = ncol(data)),
  post = vector("character", length = ncol(data)), defaultMethod = c("pmm",
  "logreg", "polyreg", "polr"), maxit = 5, diagnostics = TRUE,
  printFlag = TRUE, seed = NA, imputationMethod = NULL,
  defaultImputationMethod = NULL, data.init = NULL, ...)

Those are a lot of parameters. How does one decide which parameters to specify and which ones to leave as default?

I’m especially interested in the number of multiple imputations, m and the maximum iterations, maxit. How do these parameters affect accuracy?

In other words, when (how?) – whilst using these parameters – can I really say that a sort of convergence has been reached?


Let’s just go through the parameters one by one:

  • data doesn’t require explanation
  • m is the number of imputations, generally speaking, the more the better. Originally (following Rubin, 1987) 5 was considered to be enough (hence the default). So from an accuracy point of view, 5 may be sufficient. However, this was based on an efficiency argument only. In order to achieve better estimates of standard errors, more imputations are needed. These days there is a rule of thumb to use whatever the average percentage rate of missingness is – so if there is 30% missing data on average in a dataset, use 30 imputations – see Bodner (2008) and White et al (2011) for further details.
  • method specifies which imputation method is to be used – this only necessary when the default method is to be over-ridden. For example, continuous data are imputed by predictive mean matching by default, and this usually works very well, but Bayesian linear regression, and several others including a multilevel model for nested/clustered data may be specified instead. Hence, expert/clinical/statistical knowledge may be of use in specifying alternatives to the default method(s).
  • predictorMatrix is a matrix which tells the algorithm which variables predict missingness in which other variables. mice uses a default based on correlations between variables and the proportion of usable cases if this is not specified. Expert/clinical knowledge may be very useful in specifying the predictor matrix, so the default should be used with care.
  • visitSequence specifies the order in which variables are imputed. It is not usually needed.
  • form is used primarily to aid the specification of interaction terms to be used in imputation, and isn’t normally needed.
  • post is for post-imputation processing, for example to ensure that positive values are imputed. This isn’t normally needed.
  • defaultMethod changes the default imputation methods, and is not normally needed
  • maxit is the number of iterations for each imputation. mice uses an iterative algorithm. It is important that the imputations for all variables reach convergence, otherwise they will be inaccurate. By inspecting the trace plots generated by plot() this can be visually determined. Unlike other Gibbs sampling methods, far fewer iterations are needed – generally in the region of 20-30 or less as a rule of thumb. When the trace lines reach a value and fluctuate slightly around it, convergence has been achieved. The following is an example showing healthy convergence, taken from here :

enter image description here

Here, 3 variables are being imputed with 5 imputations (coloured lines) for 20 iterations (x-axis on the plots), the y-axis on the plots are the imputed values for each imputation.

  • diagnostics produces useful diagnostic information by default.

  • printFlag outputs the algorithm progress by default which is useful because the estimated time to completion can easily be ascertained.

  • seed is a random seed parameter which is useful for reproducibility.

  • imputationMethod and defaultImputationMethod are for backwards compatibility only.

Bodner, Todd E. (2008) “What improves with increased missing data imputations?” Structural Equation Modeling: A Multidisciplinary Journal 15: 651-675.

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley.

White, Ian R., Patrick Royston and Angela M. Wood (2011) “Multiple imputation using chained equations: Issues and guidance for practice.” Statistics in Medicine 30: 377-399.

Source : Link , Question Author : 119631 , Answer Author : Robert Long

Leave a Comment