Suppose I have an experiment with two or more factors. An overall ANOVA is constructed, and then we follow-up with two or more sets of post hoc tests, say multiple comparisons. My question is about how big—and how many—families should be used as the basis for multiplicity adjustments of these post hoc tests.
An example is the warp-breaks dataset from Tukey’s book on EDA. There are two factors:
wool(at two levels) and
tension(at three levels). The ANOVA table is:
Source Df Sum Sq Mean Sq F value Pr(>F) wool 1 450.7 450.67 3.7653 0.0582130 tension 2 2034.3 1017.13 8.4980 0.0006926 wool:tension 2 1002.8 501.39 4.1891 0.0210442 Residuals 48 5745.1 119.69
Clearly, the interaction is needed in the model. So we decide to do comparisons of the levels of each factor, holding the other factor fixed. The results are below, with some annotations to be referred to later:
*** Pairwise comparisons of tension for each wool *** *** All combined: Family T *** wool = A: *** Family T|A *** contrast estimate SE df t.ratio L - M 20.5555556 5.157299 48 3.986 L - H 20.0000000 5.157299 48 3.878 M - H -0.5555556 5.157299 48 -0.108 wool = B: *** Family T|B *** contrast estimate SE df t.ratio L - M -0.5555556 5.157299 48 -0.108 L - H 9.4444444 5.157299 48 1.831 M - H 10.0000000 5.157299 48 1.939 *** Comparison of wool for each tension *** *** All combined: Family W *** tension = L: *** Family W|L *** contrast estimate SE df t.ratio A - B 16.333333 5.157299 48 3.167 tension = M: *** Family W|M *** contrast estimate SE df t.ratio A - B -4.777778 5.157299 48 -0.926 tension = H: *** Family W|H *** contrast estimate SE df t.ratio A - B 5.777778 5.157299 48 1.120
I think there are different practices out there, and I wonder which are most common, and what arguments people would make for or against each approach. In computing adjusted P values, should we do multiplicity adjustments for…
- each of the five smallest families (T|A, T|B, …, W|H) separately? (Note – the last 3 families have only one test so there would be no multiplicity adjustment for those)
- each of the larger families (T, with 6 tests and W, with 3 tests) separately?
- all 6+3=9 tests considered as one big family?
I’m interested both in what people usually do (even if they haven’t thought much about it) and why (if they have). A couple of things I might mention are:
- There are 3 F tests in the ANOVA table. I don’t recall seeing anyone consider a multiplicity adjustment on ANOVA tests. If that’s the case, and you recommend option (3), are you being inconsistent?
- If we had done a somewhat smaller experiment where all the tests are less powerful, it’s possible the interaction would not have been significant, leading to a much smaller number of post hoc comparisons of marginal means only. Moreover, the marginal means could well have smaller SEs than the cell means do in the larger experiment. If, in addition, the multiplicity adjustment is less conservative, we could have more “significant” results with less data than we’d have with more data.
Interested in seeing what people have to say…
No one’s answered yet, so I’ll take a crack at this.
It’s my opinion (and I would love to hear other’s thoughts) that you should be adjusting for the full 9 tests in this case. Assuming we’re using family-wise error rate correction,
We are simultaneously drawing conclusions from all 9 tests at once. I.e. scanning down the list and seeing to find anything significant.
To be able to do this, we are considering an overall family-wise error rate of 5%. The alternative would be to individually correct the groups to a 5% FWER. This would mean that when interpreting, we could not interpret the tests together, and would rather have to look at the first 6 tests and think that there’s a 5% chance of a false positive, then subsequently examine each of the further tests in turn knowing that there is a 5% chance of a false positive for each group. IMO the utility of multiple testing correction is that we are able to simultaneously draw inference from multiple tests at once. It seems more logical that we should look at all 9 tests and know there’s a 5% chance of a false positive, rather than having to examine them separately, akin to not correcting at all.
The issue of adjusting for the three F-tests in the ANOVA is interesting, but in my opinion only relevant if you plan to do some model selection in which you only accept significant predictors. This might be a good read, specifically the conclusion is a very succinct and excellent read. I stole that link from this question.
Your point about the inclusion of interaction effects is interesting, and I think you could define that as model selection. Would you have included the interaction effects if they were significant? In this case perhaps the F statistics in the original ANOVA should have been adjusted in order to facilitate selection of significant predictors.
Overall I think that if you are drawing simultaneous inference from a group, you must consider each test in that group for correction. Otherwise the standard understanding of controlled group error rate doesn’t hold up, and it’s quite difficult to conceptually keep track of what has been adjusted and what hasn’t. Much better, in my opinion, to hold all tests accountable and hold the family-wise error rate at a given threshold.
If you have any rebuttals, I would love to hear them, and I’m sure some people will disagree with some things in here. Very interested to hear other’s thoughts.