The help page for
MICE
defines the function as:mice(data, m = 5, method = vector("character", length = ncol(data)), predictorMatrix = (1 - diag(1, ncol(data))), visitSequence = (1:ncol(data))[apply(is.na(data), 2, any)], form = vector("character", length = ncol(data)), post = vector("character", length = ncol(data)), defaultMethod = c("pmm", "logreg", "polyreg", "polr"), maxit = 5, diagnostics = TRUE, printFlag = TRUE, seed = NA, imputationMethod = NULL, defaultImputationMethod = NULL, data.init = NULL, ...)
Those are a lot of parameters. How does one decide which parameters to specify and which ones to leave as default?
I’m especially interested in the number of multiple imputations,
m
and the maximum iterations,maxit
. How do these parameters affect accuracy?In other words, when (how?) – whilst using these parameters – can I really say that a sort of convergence has been reached?
Answer
Let’s just go through the parameters one by one:
data
doesn’t require explanationm
is the number of imputations, generally speaking, the more the better. Originally (following Rubin, 1987) 5 was considered to be enough (hence the default). So from an accuracy point of view, 5 may be sufficient. However, this was based on an efficiency argument only. In order to achieve better estimates of standard errors, more imputations are needed. These days there is a rule of thumb to use whatever the average percentage rate of missingness is – so if there is 30% missing data on average in a dataset, use 30 imputations – see Bodner (2008) and White et al (2011) for further details.method
specifies which imputation method is to be used – this only necessary when the default method is to be over-ridden. For example, continuous data are imputed by predictive mean matching by default, and this usually works very well, but Bayesian linear regression, and several others including a multilevel model for nested/clustered data may be specified instead. Hence, expert/clinical/statistical knowledge may be of use in specifying alternatives to the default method(s).predictorMatrix
is a matrix which tells the algorithm which variables predict missingness in which other variables.mice
uses a default based on correlations between variables and the proportion of usable cases if this is not specified. Expert/clinical knowledge may be very useful in specifying the predictor matrix, so the default should be used with care.visitSequence
specifies the order in which variables are imputed. It is not usually needed.form
is used primarily to aid the specification of interaction terms to be used in imputation, and isn’t normally needed.post
is for post-imputation processing, for example to ensure that positive values are imputed. This isn’t normally needed.defaultMethod
changes the default imputation methods, and is not normally neededmaxit
is the number of iterations for each imputation.mice
uses an iterative algorithm. It is important that the imputations for all variables reach convergence, otherwise they will be inaccurate. By inspecting the trace plots generated byplot()
this can be visually determined. Unlike other Gibbs sampling methods, far fewer iterations are needed – generally in the region of 20-30 or less as a rule of thumb. When the trace lines reach a value and fluctuate slightly around it, convergence has been achieved. The following is an example showing healthy convergence, taken from here :
Here, 3 variables are being imputed with 5 imputations (coloured lines) for 20 iterations (x-axis on the plots), the y-axis on the plots are the imputed values for each imputation.
-
diagnostics
produces useful diagnostic information by default. -
printFlag
outputs the algorithm progress by default which is useful because the estimated time to completion can easily be ascertained. -
seed
is a random seed parameter which is useful for reproducibility. -
imputationMethod
anddefaultImputationMethod
are for backwards compatibility only.
Bodner, Todd E. (2008) “What improves with increased missing data imputations?” Structural Equation Modeling: A Multidisciplinary Journal 15: 651-675.
https://dx.doi.org/10.1080/10705510802339072
Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
White, Ian R., Patrick Royston and Angela M. Wood (2011) “Multiple imputation using chained equations: Issues and guidance for practice.” Statistics in Medicine 30: 377-399.
https://dx.doi.org/10.1002/sim.4067
Attribution
Source : Link , Question Author : 119631 , Answer Author : Robert Long