I am working on my master thesis at the moment and planned on running the statistics with SigmaPlot. However, after spending some time with my data I came to the conclusion that SigmaPlot might not be fit for my problem (I may be mistaken) so I started my first attempts in R, which did not exactly make it easier.
The plan was to run a simple TWO-WAY-ANOVA on my data which results from 3 different proteins and 8 different treatments on those, so my two factors are proteins and treatments. I tested for normality using both
> ks.test(time, "norm", mean=mean(time), sd=sqrt(var(time)))
In both cases (maybe not surprising) I ended up with a non-normal distribution.
Which left me with the first questions of which test to use for equality of variances. I came up with
and the result was, that I don’t have equality of variance in my data either.
I tried different data transformations (log, center, standardization), all of which did not solve my problems with the variances.
Now I am at a loss, how to conduct the ANOVA for testing which proteins and which treatments differ significantly from each other. I found something about a Kruskal-Walis-Test, but only for one factor (?). I also found things about ranking or randamization, but not yet how to implement those techniques in R.
Does anyone have a suggestion what I should do?
Edit: thank you for your answers, I am a little overwhelmed by the reading (it just seems getting more and more instead of less), but I will of course keep going.
Here an example of my data, as suggested (I am very sorry for the format, I couldn’t figure out another solution or place to put a file. I am still new to this all.):
protein treatment time A con 2329.0 A HY 1072.0 A CL1 4435.0 A CL2 2971.0 A CL1-HY sim 823.5 A CL2-HY sim 491.5 A CL1+HY mix 2510.5 A CL2+HY mix 2484.5 A con 2454.0 A HY 1180.5 A CL1 3249.7 A CL2 2106.7 A CL1-HY sim 993.0 A CL2-HY sim 817.5 A CL1+HY mix 1981.0 A CL2+HY mix 2687.5 B con 1482.0 B HY 2084.7 B CL1 1498.0 B CL2 1258.5 B CL1-HY sim 1795.7 B CL2-HY sim 1804.5 B CL1+HY mix 1633.0 B CL2+HY mix 1416.3 B con 1339.0 B HY 2119.0 B CL1 1093.3 B CL2 1026.5 B CL1-HY sim 2315.5 B CL2-HY sim 2048.5 B CL1+HY mix 1465.0 B CL2+HY mix 2334.5 C con 1614.8 C HY 1525.5 C CL1 426.3 C CL2 1192.0 C CL1-HY sim 1546.0 C CL2-HY sim 874.5 C CL1+HY mix 1386.0 C CL2+HY mix 364.5 C con 1907.5 C HY 1152.5 C CL1 639.7 C CL2 1306.5 C CL1-HY sim 1515.0 C CL2-HY sim 1251.0 C CL1+HY mix 1350.5 C CL2+HY mix 1230.5
This may be more of a comment than an answer, but it won’t fit as a comment. We may be able to help you here, but this may take a few iterations; we need more information.
First, what is your response variable?
Second, note that the marginal distribution of your response does not have to be normal, rather the distribution conditional on the model (i.e., the residuals) should be–it is not clear that you have examined your residuals. Furthermore, normality is the least important assumption of a linear model (e.g., an ANOVA); the residuals may not need to be perfectly normal. Tests of normality are not generally worthwhile (see here for a discussion on CV), plots are much better. I would try a qq-plot of your residuals. In
R this is done with
qqnorm(), or try
qqPlot() in the
car package. It’s also worth considering the manner in which the residuals are non-normal: skewness is more damaging than excess kurtosis, in particular if the skews alternate directions amongst the groups.
If there really is a problem worth worrying about, a transformation is a good strategy. Taking the log of your raw data is one option, but not the only one. Note that centering and standardizing aren’t really transformations in this sense. You want to look into the Box & Cox family of power transformations. And remember, the result doesn’t have to be perfectly normal, just good enough.
Next, I don’t follow your use of the chi-squared test for homogeneity of variance, although it may be perfectly fine. I would suggest you use Levene’s test (use
car). Heterogeneity is more damaging than non-normality, but the ANOVA is pretty robust if the heterogeneity is minor. A standard rule of thumb is that the largest group variance can be up to four times the smallest without posing strong problems. A good transformation should also address heterogeneity.
If these strategies are insufficient, I would probably explore robust regression before trying a non-parametric approach.
If you can edit your question and say more about your data, I may be able to update this to provide more specific information.