The help page for
MICEdefines the function as:
mice(data, m = 5, method = vector("character", length = ncol(data)), predictorMatrix = (1 - diag(1, ncol(data))), visitSequence = (1:ncol(data))[apply(is.na(data), 2, any)], form = vector("character", length = ncol(data)), post = vector("character", length = ncol(data)), defaultMethod = c("pmm", "logreg", "polyreg", "polr"), maxit = 5, diagnostics = TRUE, printFlag = TRUE, seed = NA, imputationMethod = NULL, defaultImputationMethod = NULL, data.init = NULL, ...)
Those are a lot of parameters. How does one decide which parameters to specify and which ones to leave as default?
I’m especially interested in the number of multiple imputations,
mand the maximum iterations,
maxit. How do these parameters affect accuracy?
In other words, when (how?) – whilst using these parameters – can I really say that a sort of convergence has been reached?
Let’s just go through the parameters one by one:
datadoesn’t require explanation
mis the number of imputations, generally speaking, the more the better. Originally (following Rubin, 1987) 5 was considered to be enough (hence the default). So from an accuracy point of view, 5 may be sufficient. However, this was based on an efficiency argument only. In order to achieve better estimates of standard errors, more imputations are needed. These days there is a rule of thumb to use whatever the average percentage rate of missingness is – so if there is 30% missing data on average in a dataset, use 30 imputations – see Bodner (2008) and White et al (2011) for further details.
methodspecifies which imputation method is to be used – this only necessary when the default method is to be over-ridden. For example, continuous data are imputed by predictive mean matching by default, and this usually works very well, but Bayesian linear regression, and several others including a multilevel model for nested/clustered data may be specified instead. Hence, expert/clinical/statistical knowledge may be of use in specifying alternatives to the default method(s).
predictorMatrixis a matrix which tells the algorithm which variables predict missingness in which other variables.
miceuses a default based on correlations between variables and the proportion of usable cases if this is not specified. Expert/clinical knowledge may be very useful in specifying the predictor matrix, so the default should be used with care.
visitSequencespecifies the order in which variables are imputed. It is not usually needed.
formis used primarily to aid the specification of interaction terms to be used in imputation, and isn’t normally needed.
postis for post-imputation processing, for example to ensure that positive values are imputed. This isn’t normally needed.
defaultMethodchanges the default imputation methods, and is not normally needed
maxitis the number of iterations for each imputation.
miceuses an iterative algorithm. It is important that the imputations for all variables reach convergence, otherwise they will be inaccurate. By inspecting the trace plots generated by
plot()this can be visually determined. Unlike other Gibbs sampling methods, far fewer iterations are needed – generally in the region of 20-30 or less as a rule of thumb. When the trace lines reach a value and fluctuate slightly around it, convergence has been achieved. The following is an example showing healthy convergence, taken from here :
Here, 3 variables are being imputed with 5 imputations (coloured lines) for 20 iterations (x-axis on the plots), the y-axis on the plots are the imputed values for each imputation.
diagnosticsproduces useful diagnostic information by default.
printFlagoutputs the algorithm progress by default which is useful because the estimated time to completion can easily be ascertained.
seedis a random seed parameter which is useful for reproducibility.
defaultImputationMethodare for backwards compatibility only.
Bodner, Todd E. (2008) “What improves with increased missing data imputations?” Structural Equation Modeling: A Multidisciplinary Journal 15: 651-675.
Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
White, Ian R., Patrick Royston and Angela M. Wood (2011) “Multiple imputation using chained equations: Issues and guidance for practice.” Statistics in Medicine 30: 377-399.