I have looked at a lot of R datasets, postings in DASL, and elsewhere, and am not finding very many good examples of interesting datasets illustrating analysis of covariance for experimental data. There are numerous “toy” datasets with contrived data in stat textbooks.
I’d like to have an example where:
- The data are real, with an interesting story
- There is at least one treatment factor and two covariates
- At least one covariate is affected by one or more of the treatment factors, and one is not affected by treatments.
- Experimental rather than observational, preferably
My real goal is to find a good example to put in the vignette for my R package. But a larger goal is that people need to see good examples to illustrate some important concerns in covariance analysis. Consider the following made-up scenario (and please understand that my knowledge of agriculture is superficial at best).
- We do an experiment where fertilizers are randomized to plots, and a crop is planted. After a suitable growing period, we harvest the crop and measure some quality characteristic – that’s the response variable. But we also record total rainfall during the growing period, and soil acidity at time of harvest — and, of course, which fertilizer was used. Thus we have two covariates and a treatment.
The usual way to analyze the resulting data would be to fit a linear model with the treatment as a factor, and additive effects for the covariates. Then to summarize the results, one computes “adjusted means” (AKA least-squares means), which are predictions from the model for each fertilizer, at the average rainfall and the3 average soil acidity. This puts everything on an equal footing, because then when we compare these results, we are holding rainfall and acidity constant.
But this is probably the wrong thing to do — because the fertilizer probably affects the soil acidity as well as the response. This makes the adjusted means misleading, because the treatment effect includes its effect on acidity. One way to handle this would be to take acidity out of the model, then the rainfall-adjusted means would provide a fair comparison. But if acidity is important, this fairness comes at great cost, in the increase in residual variation.
There are ways to work around this by using an adjusted version of acidity in the model instead of its original values. The upcoming update to my R package lsmeans will make this downright easy. But I want to have a good example to illustrate it. I will be very thankful to, and will duly acknowledge, anyone who can point me to some good illustrative datasets.
You may want to check out the
mediation R package. It does include experimental data like
framing where the treatment variable affects both a response variable and covariates (i.e., mediators of the treatment effect), along with covariates not affected by the treatment.
I looked into the mediation literature because I though you exactly described a mediation study: the fertilizer effect on the crop quality is mediated through its effect on soil acidity. Even if the datasets in the
mediation package do not satisfy you, you may find one if you look into the mediation literature.