# How to get pooled p-values on tests done in multiple imputed datasets?

Using Amelia in R, I obtained multiple imputed datasets. After that, I performed a repeated measures test in SPSS. Now, I want to pool test results. I know that I can use Rubin’s rules (implemented through any multiple imputation package in R) to pool means and standard errors, but how do I pool p-values? Is it possible? Is there a function in R to do so?

Yes, it is possible and, yes, there are R functions that do it. Instead of computing the p-values of the repeated analyses by hand, you can use the package Zelig, which is also referred to in the vignette of the Amelia-package (for a more informative method see my update below). I’ll use an example from the Amelia-vignette to demonstrate this:

library("Amelia")
amelia.out <- amelia(freetrade, m = 15, ts = "year", cs = "country")

library("Zelig")
zelig.fit <- zelig(tariff ~ pop + gdp.pc + year + polity, data = amelia.out$imputations, model = "ls", cite = FALSE) summary(zelig.fit)  This is the corresponding output including $p$-values:  Model: ls Number of multiply imputed data sets: 15 Combined results: Call: lm(formula = formula, weights = weights, model = F, data = data) Coefficients: Value Std. Error t-stat p-value (Intercept) 3.18e+03 7.22e+02 4.41 6.20e-05 pop 3.13e-08 5.59e-09 5.59 4.21e-08 gdp.pc -2.11e-03 5.53e-04 -3.81 1.64e-04 year -1.58e+00 3.63e-01 -4.37 7.11e-05 polity 5.52e-01 3.16e-01 1.75 8.41e-02 For combined results from datasets i to j, use summary(x, subset = i:j). For separate results, use print(summary(x), subset = i:j).  zelig can fit a host of models other than least squares. To get confidence intervals and degrees of freedom for your estimates you can use mitools: library("mitools") imp.data <- imputationList(amelia.out$imputations)
mitools.fit <- MIcombine(with(imp.data, lm(tariff ~ polity + pop + gdp.pc + year)))
mitools.res <- summary(mitools.fit)
mitools.res <- cbind(mitools.res, df = mitools.fit$df) mitools.res  This will give you confidence intervals and proportion of the total variance that is attributable to the missing data:  results se (lower upper) missInfo df (Intercept) 3.18e+03 7.22e+02 1.73e+03 4.63e+03 57 % 45.9 pop 3.13e-08 5.59e-09 2.03e-08 4.23e-08 19 % 392.1 gdp.pc -2.11e-03 5.53e-04 -3.20e-03 -1.02e-03 21 % 329.4 year -1.58e+00 3.63e-01 -2.31e+00 -8.54e-01 57 % 45.9 polity 5.52e-01 3.16e-01 -7.58e-02 1.18e+00 41 % 90.8  Of course you can just combine the interesting results into one object: combined.results <- merge(mitools.res, zelig.res$coefficients[, c("t-stat", "p-value")], by = "row.names", all.x = TRUE)


# Update

After some playing around, I have found a more flexible way to get all necessary information using the mice-package. For this to work, you’ll need to modify the package’s as.mids()-function. Use Gerko’s version posted in my follow-up question:

as.mids2 <- function(data2, .imp=1, .id=2){
ini <- mice(data2[data2[, .imp] == 0, -c(.imp, .id)], m = max(as.numeric(data2[, .imp])), maxit=0)
names  <- names(ini$imp) if (!is.null(.id)){ rownames(ini$data) <- data2[data2[, .imp] == 0, .id]
}
for (i in 1:length(names)){
for(m in 1:(max(as.numeric(data2[, .imp])))){
if(!is.null(ini$imp[[i]])){ indic <- data2[, .imp] == m & is.na(data2[data2[, .imp]==0, names[i]]) ini$imp[[names[i]]][m] <- data2[indic, names[i]]
}
}
}
return(ini)
}


With this defined, you can go on to analyze the imputed data sets:

library("mice")
imp.data <- do.call("rbind", amelia.out$imputations) imp.data <- rbind(freetrade, imp.data) imp.data$.imp <- as.numeric(rep(c(0:15), each = nrow(freetrade)))
mice.data <- as.mids2(imp.data, .imp = ncol(imp.data), .id = NULL)

mice.fit <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc + year))
mice.res <- summary(pool(mice.fit, method = "rubin1987"))


This will give you all results you get using Zelig and mitools and more:

                  est       se     t    df Pr(>|t|)     lo 95     hi 95 nmis   fmi lambda
(Intercept)  3.18e+03 7.22e+02  4.41  45.9 6.20e-05  1.73e+03  4.63e+03   NA 0.571  0.552
pop          3.13e-08 5.59e-09  5.59 392.1 4.21e-08  2.03e-08  4.23e-08    0 0.193  0.189
gdp.pc      -2.11e-03 5.53e-04 -3.81 329.4 1.64e-04 -3.20e-03 -1.02e-03    0 0.211  0.206
year        -1.58e+00 3.63e-01 -4.37  45.9 7.11e-05 -2.31e+00 -8.54e-01    0 0.570  0.552
polity       5.52e-01 3.16e-01  1.75  90.8 8.41e-02 -7.58e-02  1.18e+00    2 0.406  0.393


Note, using pool() you can also calculate $p$-values with $df$ adjusted for small samples by omitting the method-parameter. What is even better, you can now also calculate $R^2$ and compare nested models:

pool.r.squared(mice.fit)

mice.fit2 <- with(mice.data, lm(tariff ~ polity + pop + gdp.pc))
pool.compare(mice.fit, mice.fit2, method = "Wald")\$pvalue