I’ve done an experiment where I’ve collected measurements from a number of participants. Each relevant data point has two variables, both categorical: in fact, each variable has two possible values (answers to two yes/no questions). I would like a statistical hypothesis test to check whether there appears to be a correlation between these two variables.
If I had one data point per participant, I could use Fisher’s exact test on the resulting 2×2 contingency table. However, I have multiple data points per participant. Consequently, Fisher’s exact test does not seem applicable, because the data points from a single participant are not independent. For instance, if I have 10 data points from Alice, those probably aren’t independent, because they all came from the same person. Fisher’s exact test assumes that all data points were independently sampled, so the assumptions of Fisher’s exact test are not satisfied and it would be inappropriate to use in this setting (it might give unjustified reports of statistical significance).
Are there techniques to handle this situation?
Approaches I’ve considered:
One plausible alternative is to aggregate all the data from each participant into a single number, and then use some other test of independence. For instance, for each participant, I could count the fraction of Yes answers to the first question and the fraction of Yes answers to the second question, giving me two real numbers per participant, and then use Pearson’s product-moment test to test for correlation between these two numbers. However, I’m not sure whether this is a good approach. (For example, I worry that averaging/counting is throwing out data and this might be losing power, because of the aggregation; or that signs of dependence might be disappear after aggregation.)
I’ve read about multi-level models, which sound like they are intended the handle this situation when the underlying variables are continuous (e.g., real numbers) and when a linear model is appropriate. However, here I have two categorical variables (answers to Yes/No questions), so they don’t seem to apply here. Is there some equivalent technique that is applicable to categorical data?
I’ve also read a tiny bit about repeated measures design on Wikipedia, but the Wikipedia article focuses on longitudinal studies. That doesn’t seem applicable here: if I understand it correctly, repeated measures seems to focus on effects due to the passage of time (where the progression of time influences the variables). However, in my case, the passage of time shouldn’t have any relevant effect. Do tell me if I’ve misunderstood.
On further reflection, another approach that occurs to me is to use a permutation test. For each participant, we could randomly permute their answers to question 1 and (independently) randomly permutation their answers to question 2, using a different permutation for each participant. However, it’s not clear to me what test statistic would be appropriate here, to measure which outcomes are “at least as extreme” as the observed outcome.
Related: How to correctly treat multiple data points per each subject (but that also focuses on linear models for continuous variables, not categorical data), Are Measurements made on the same patient independent? (same)
Context of my answer
I self-studied this question yesterday (the part concerning the possibility to use mixed models here). I shamelessly dump my fresh new understanding on this approach for 2×2 tables and wait for more advanced peers to correct my imprecisions or misunderstandings. My answer will be then lengthy and overly didactic (at least trying to be didactic) in order to help but also expose my own flaws. First of all, I must say that I shared your confusion that you stated here.
I’ve read about multi-level models, which sound like they are intended the handle this situation when the underlying variables are continuous (e.g., real numbers) and when a linear model is appropriate
I studied all the examples from this paper random-effects modelling of categorical response data. The title itself contradicts this thought. For our problem with 2×2 tables with repeated measurement, the example in section 3.6 is germane to our discussion. This is for reference only as my goal is to explain it. I may edit out this section in the future if this context is not necessary anymore.
The first thing to understand is that the random effect is modelled not in a very different way as in regression over continuous variable. Indeed a regression over a categorical variable is nothing else than a linear regression over the logit (or another link function like probit) of the probability associated with the different levels of this categorical variable. If πi is the probability to answer yes at the question i, then logit(πi)=FixedEffectsi+RandomEffecti. This model is linear and random effects can be expressed in a classical numerical way like for example RandomEffecti∼N(0,σ) In this problem, the random effect represents the subject-related variation for the same answer.
For our problem, we want to model
πijv the probability of the subject to answer “yes” for the variable v at interview time j. The logit of this variable is modeled as a combination of fixed effects and subject-related random effects.
About the fixed effects
The fixed effects are then related to the probability to answer “yes” at time j at the question v. According to your scientific goal you can test with a likelihood ratio to test if the equality of certain fixed effects must be rejected. For example, the model where β1v=β2v=β3v... means that there is no change tendency in the answer from time 1 to time 2. If you assume that this global tendency does not exist, which seems to be the case for your study, you can drop the i straightaway in your model βjv becomes βv. By analogy, you can test by a likelihood ratio if the equality β1=β2 must be rejected.
About random effects
I know it’s possible to model random effects by something else than normal errors but I prefer to answer on the basis of normal random effects for the sake of simplicity.
The random effects can be modelled in different ways. With the notations uij I assumed that a random effect is drawn from its distribution each time a subject answer a question.This is the most specific degree of variation possible. If I used ui instead, it would have mean that a random effect is drawn for each subject i and is the same for each question v he has to answer (some subjects would then have a tendency to answer yes more often). You have to make a choice. If I understood well, you can also have both random effects ui∼N(0,σ1) which is subject-drawn and uij∼N(0,σ2) which is subject+answer-drawn. I think that your choice depends of the details of your case. But If I understood well, the risk of overfitting by adding random effects is not big, so when one have a doubt, we can include many levels.
I realize how weird my answer is, this is just an embarrassing rambling certainly more helpful to me than to others. Maybe I ll edit out 90% of it.
I am not more confident, but more disposed to get to the point.
I would suggest to compare the model with nested random effects (ui+uiv) versus the model with only the combinated random effect (uiv). The idea is that the ui term is the sole responsible for the dependency between answers. Rejecting independence is rejecting the presence of ui. Using glmer to test this would give something like :
model1<-glmer(yes ~ Question + (1 | Subject/Question ), data = df, family = binomial) model2<-glmer(yes ~ Question + (1 | Subject:Question ), data = df, family = binomial) anova(model1,model2)
Question is a dummy variable indicating if the question 1 or 2 is asked.
If I understood well,
(1 | Subject/Question ) is related to the nested structure ui+uiv and
(1 |Subject:Question) is just the combination uiv.
anova computes a likelihood ratio test between the two models.