Independent samples t-test: Do data really need to be normally distributed for large sample sizes?

Let’s say I want to test if two independent samples have different means. I know the underlying distribution is not normal.

If I understand correctly, my test statistic is the mean, and for large enough sample sizes, the mean should become normally distributed even if the samples are not. So a parametric significance test should be valid in this case, right? I have read conflicting and confusing information about this so I would appreciate some confirmation (or explanation why I’m wrong).

Also, I’ve read that for large sample sizes, I should use the z-statistic instead of the t-statistic. But in practice, the t-distribution will just converge to the normal distribution and the two statistics should be the same, no?

Edit: Below are some sources describing the z-test. They both state that the populations must be normally distributed:

Here, it says “Irrespective of the type of Z-test used it is assumed that the populations from which the samples are drawn are normal.”
And here, the requirements for the z-test are listed as “Two normally distributed but independent populations, σ is known”.

Answer

I think this is a common misunderstanding of the CLT. Not only does the CLT have nothing to do with preserving type II error (which no one has mentioned here) but it is often not applicable when you must estimate the population variance. The sample variance can be very far from a scaled chi-squared distribution when the data are non-Gaussian, so the CLT may not apply even when the sample size exceeds tens of thousands. For many distributions the SD is not even a good measure of dispersion.

To really use the CLT, one of two things must be true: (1) the sample standard deviation works as a measure of dispersion for the true unknown distribution or (2) the true population standard deviation is known. That is very often not the case. And an example of n=20,000 being far too small for the CLT to “work” comes from drawing samples from the lognormal distribution as discussed elsewhere on this site.

The sample standard deviation “works” as a dispersion measure if for example the distribution is symmetric and does not have tails that are heavier than the Gaussian distribution.

I do not want to rely on the CLT for any of my analyses.

Attribution
Source : Link , Question Author : Lisa , Answer Author : Frank Harrell

Leave a Comment