I’ve been in a debate with my graduate-level statistics professor about “normal distributions”. I contend that to truly get a normal distribution one must have mean=median=mode, all the data must be contained under the bell curve, and perfectly symmetrical around the mean. Therefore, technically, there are virtually NO normal distributions in real studies, and we should call them something else, perhaps “near-normal”.
She says I’m too picky, and if the skew/kurtosis are less than 1.0 it is a normal distribution and took off points on an exam. The dataset is total number of falls/year in a random sampling of 52 nursing homes which is a random sample of a larger population. Any insight?
QUESTION: 3. Compute measures of skewness and kurtosis for this data. Include a histogram with a normal curve. Discuss your findings. Is the data normally distributed?
Statistics Number of falls N Valid 52 Missing 0 Mean 11.23 Median 11.50 Mode 4a
a. Multiple modes exist. The smallest value is shown
Number of falls N Valid 52 Missing 0 Skewness .114 Std. Error of Skewness .330 Kurtosis -.961 Std. Error of Kurtosis .650
The data is platykurtic and has only slight positive skewing, and it is NOT a normal distribution because the mean and median and mode are not equal and the data is not evenly distributed around the mean. In reality virtually no data is ever a perfect normal distribution, although we can discuss “approximately normal distributions” such as height, weight, temperature, or length of adult ring finger in large population groups.
You are correct that there is no perfectly normal distribution. But, we are not looking for perfection. We need to look at data in addition to the histogram and the measures of central tendency. What do the skewness and kurtosis statistics tell you about the distribution? Because they are both between the critical values of -1 and +1, this data is considered to be normally distributed.
A problem with your discussion with the professor is one of terminology, there’s a misunderstanding that is getting in the way of conveying a potentially useful idea. In different places, you both make errors.
So the first thing to address: it’s important to be pretty clear about what a distribution is.
A normal distribution is a specific mathematical object, which you could consider as a model for an infinite population of values. (No finite population can actually have a continuous distribution.)
Loosely, what this distribution does (once you specify the parameters) is define (via an algebraic expression) the proportion of the population values that lies within any given interval on the real line. Slightly less loosely, it defines the probability that a single value from that population will lie in any given interval.
An observed sample doesn’t really have a normal distribution; a sample might (potentially) be drawn from a normal distribution, if one were to exist. If you look at the empirical cdf of the sample, it’s discrete. If you bin it (as in a histogram) the sample has a “frequency distribution”, but those aren’t normal distributions. The distribution can tell us some things (in a probabilistic sense) about a random sample from the population, and a sample may also tell us some things about the population.
A reasonable interpretation of a phrase like “normally distributed sample”* is “a random sample from a normally distributed population”.
*(I generally try to avoid saying it myself, for reasons that are hopefully made clear enough here; usually I manage to confine myself to the second kind of expression.)
Having defined terms (if still a little loosely), let us now look at the question in detail. I’ll be addressing specific pieces of the question.
normal distribution one must have mean=median=mode
This is certainly a condition on the normal probability distribution, though not a requirement on a sample drawn from a normal distribution; samples may be asymmetric, may have mean differ from median and so on. [We can, however, get an idea how far apart we might reasonably expect them to be if the sample really came from a normal population.]
all the data must be contained under the bell curve
I am not sure what “contained under” means in this sense.
and perfectly symmetrical around the mean.
No; you’re talking about the data here, and a sample from a (definitely symmetrical) normal population would not itself be perfectly symmetric.
Therefore, technically, there are virtually NO normal distributions in real studies,
I agree with your conclusion but the reasoning is not correct; it’s not a consequence of the fact that data are not perfectly symmetric (etc); it’s the fact that populations are themselves not perfectly normal.
if the skew/kurtosis are less than 1.0 it is a normal distribution
If she said this in just that way, she’s definitely wrong.
A sample skewness may be much closer to 0 than that (taking “less than” to mean in absolute magnitude not actual value), and the sample excess kurtosis may also be much closer to 0 than that (they might even, whether by chance or construction, potentially be almost exactly zero), and yet the distribution from which the sample was drawn can easily be distinctly non-normal.
We can go further — even if we were to magically know the population skewness and kurtosis were exactly that of a normal, it still wouldn’t of itself tell us the population was normal, nor even something close to normal.
The dataset is total number of falls/year in a random sampling of 52 nursing homes which is a random sample of a larger population.
The population distribution of counts are never normal. Counts are discrete and non-negative, normal distributions are continuous and over the entire real line.
But we’re really focused on the wrong issue here. Probability models are just that, models. Let us not confuse our models with the real thing.
The issue isn’t “are the data themselves normal?” (they can’t be), nor even “is the population from which the data were drawn normal?” (this is almost never going to be the case).
A more useful question to discuss is “how badly would my inference be impacted if I treated the population as normally distributed?”
It’s also a much harder question to answer well, and may require considerably more work than glancing at a few simple diagnostics.
The sample statistics you showed are not particularly inconsistent with normality (you could see statistics like that or “worse” not terribly rarely if you had random samples of that size from normal populations), but that doesn’t of itself mean that the actual population from which the sample was drawn is automatically “close enough” to normal for some particular purpose. It would be important to consider the purpose (what questions you’re answering), and the robustness of the methods employed for it, and even then we may still not be sure that it’s “good enough”; sometimes it may be better to simply not assume what we don’t have good reason to assume a priori (e.g. on the basis of experience with similar data sets).
it is NOT a normal distribution
Data – even data drawn from a normal population – never have exactly the properties of the population; from those numbers alone you don’t have a good basis to conclude that the population is not normal here.
On the other hand neither do we have any reasonably solid basis to say that it’s “sufficiently close” to normal – we haven’t even considered the purpose of assuming normality, so we don’t know what distributional features it might be sensitive to.
For example, if I had two samples for a measurement that was bounded, that I knew would not be heavily discrete (not mostly only taking a few distinct values) and reasonably near to symmetric, I might be relatively happy to use a two-sample t-test at some not-so-small sample size; it’s moderately robust to mild deviations from the assumptions (somewhat level-robust, not so power-robust). But I would be considerably more cautious about as causally assuming normality when testing equality of spread, for example, because the best test under that assumption is quite sensitive to the assumption.
Because they are both between the critical values of -1 and +1, this data is considered to be normally distributed.”
If that’s really the criterion by which one decides to use a normal distributional model, then it will sometimes lead you into quite poor analyses.
The values of those statistics do give us some clues about the population from which the sample was drawn, but that’s not at all the same thing as suggesting that their values are in any way a ‘safe guide’ to choosing an analysis.
Now to address the underlying issue with even a better phrased version of such a question as the one you had:
The whole process of looking at a sample to choose a model is fraught with problems — doing so alters the properties of any subsequent choices of analysis based on what you saw! e.g for a hypothesis test, your significance levels, p-values and power are all not what you would choose/calculate them to be, because those calculations are predicated on the analysis not being based on the data.
See, for example Gelman and Loken (2014), “The Statistical Crisis in Science,” American Scientist, Volume 102, Number 6, p 460
(DOI: 10.1511/2014.111.460) which discusses issues with such data-dependent analysis.