# How to understand degrees of freedom?

From Wikipedia, there are three interpretations of the degrees of freedom of a statistic:

In statistics, the number of degrees of freedom is the number of
values in the final calculation of a statistic that are free to vary.

Estimates of statistical parameters can be based upon different
amounts of information or data. The number of independent pieces of
information
that go into the estimate of a parameter is called the
degrees of freedom (df). In general, the degrees of freedom of an
estimate of a parameter is equal to the number of independent scores
that go into the estimate
minus the number of parameters used as
intermediate steps in the estimation of the parameter itself
(which,
in sample variance, is one, since the sample mean is the only
intermediate step).

Mathematically, degrees of freedom is the dimension of the domain of a
random vector
, or essentially the number of ‘free’ components: how
many components need to be known before the vector is fully
determined
.

The bold words are what I don’t quite understand. If possible, some mathematical formulations will help clarify the concept.

Also do the three interpretations agree with each other?

This is a subtle question. It takes a thoughtful person not to understand those quotations! Although they are suggestive, it turns out that none of them is exactly or generally correct. I haven’t the time (and there isn’t the space here) to give a full exposition, but I would like to share one approach and an insight that it suggests.

Where does the concept of degrees of freedom (DF) arise? The contexts in which it’s found in elementary treatments are:

• The Student t-test and its variants such as the Welch or Satterthwaite solutions to the Behrens-Fisher problem (where two populations have different variances).

• The Chi-squared distribution (defined as a sum of squares of independent standard Normals), which is implicated in the sampling distribution of the variance.

• The F-test (of ratios of estimated variances).

• The Chi-squared test, comprising its uses in (a) testing for independence in contingency tables and (b) testing for goodness of fit of distributional estimates.

In spirit, these tests run a gamut from being exact (the Student t-test and F-test for Normal variates) to being good approximations (the Student t-test and the Welch/Satterthwaite tests for not-too-badly-skewed data) to being based on asymptotic approximations (the Chi-squared test). An interesting aspect of some of these is the appearance of non-integral “degrees of freedom” (the Welch/Satterthwaite tests and, as we will see, the Chi-squared test). This is of especial interest because it is the first hint that DF is not any of the things claimed of it.

We can dispose right away of some of the claims in the question. Because “final calculation of a statistic” is not well-defined (it apparently depends on what algorithm one uses for the calculation), it can be no more than a vague suggestion and is worth no further criticism. Similarly, neither “number of independent scores that go into the estimate” nor “the number of parameters used as intermediate steps” are well-defined.

“Independent pieces of information that go into [an] estimate” is difficult to deal with, because there are two different but intimately related senses of “independent” that can be relevant here. One is independence of random variables; the other is functional independence. As an example of the latter, suppose we collect morphometric measurements of subjects–say, for simplicity, the three side lengths $X$, $Y$, $Z$, surface areas $S=2(XY+YZ+ZX)$, and volumes $V=XYZ$ of a set of wooden blocks. The three side lengths can be considered independent random variables, but all five variables are dependent RVs. The five are also functionally dependent because the codomain (not the “domain”!) of the vector-valued random variable $(X,Y,Z,S,V)$ traces out a three-dimensional manifold in $\mathbb{R}^5$. (Thus, locally at any point $\omega\in\mathbb{R}^5$, there are two functions $f_\omega$ and $g_\omega$ for which $f_\omega(X(\psi),\ldots,V(\psi))=0$ and $g_\omega(X(\psi),\ldots,V(\psi))=0$ for points $\psi$ “near” $\omega$ and the derivatives of $f$ and $g$ evaluated at $\omega$ are linearly independent.) However–here’s the kicker–for many probability measures on the blocks, subsets of the variables such as $(X,S,V)$ are dependent as random variables but functionally independent.

Having been alerted by these potential ambiguities, let’s hold up the Chi-squared goodness of fit test for examination, because (a) it’s simple, (b) it’s one of the common situations where people really do need to know about DF to get the p-value right and (c) it’s often used incorrectly. Here’s a brief synopsis of the least controversial application of this test:

• You have a collection of data values $(x_1, \ldots, x_n)$, considered as a sample of a population.

• You have estimated some parameters $\theta_1, \ldots, \theta_p$ of a distribution. For example, you estimated the mean $\theta_1$ and standard deviation $\theta_2 = \theta_p$ of a Normal distribution, hypothesizing that the population is normally distributed but not knowing (in advance of obtaining the data) what $\theta_1$ or $\theta_2$ might be.

• In advance, you created a set of $k$ “bins” for the data. (It may be problematic when the bins are determined by the data, even though this is often done.) Using these bins, the data are reduced to the set of counts within each bin. Anticipating what the true values of $(\theta)$ might be, you have arranged it so (hopefully) each bin will receive approximately the same count. (Equal-probability binning assures the chi-squared distribution really is a good approximation to the true distribution of the chi-squared statistic about to be described.)

• You have a lot of data–enough to assure that almost all bins ought to have counts of 5 or greater. (This, we hope, will enable the sampling distribution of the $\chi^2$ statistic to be approximated adequately by some $\chi^2$ distribution.)

Using the parameter estimates, you can compute the expected count in each bin. The Chi-squared statistic is the sum of the ratios

This, many authorities tell us, should have (to a very close approximation) a Chi-squared distribution. But there’s a whole family of such distributions. They are differentiated by a parameter $\nu$ often referred to as the “degrees of freedom.” The standard reasoning about how to determine $\nu$ goes like this

I have $k$ counts. That’s $k$ pieces of data. But there are (functional) relationships among them. To start with, I know in advance that the sum of the counts must equal $n$. That’s one relationship. I estimated two (or $p$, generally) parameters from the data. That’s two (or $p$) additional relationships, giving $p+1$ total relationships. Presuming they (the parameters) are all (functionally) independent, that leaves only $k-p-1$ (functionally) independent “degrees of freedom”: that’s the value to use for $\nu$.

The problem with this reasoning (which is the sort of calculation the quotations in the question are hinting at) is that it’s wrong except when some special additional conditions hold. Moreover, those conditions have nothing to do with independence (functional or statistical), with numbers of “components” of the data, with the numbers of parameters, nor with anything else referred to in the original question.

Let me show you with an example. (To make it as clear as possible, I’m using a small number of bins, but that’s not essential.) Let’s generate 20 independent and identically distributed (iid) standard Normal variates and estimate their mean and standard deviation with the usual formulas (mean = sum/count, etc.). To test goodness of fit, create four bins with cutpoints at the quartiles of a standard normal: -0.675, 0, +0.657, and use the bin counts to generate a Chi-squared statistic. Repeat as patience allows; I had time to do 10,000 repetitions.

The standard wisdom about DF says we have 4 bins and 1+2 = 3 constraints, implying the distribution of these 10,000 Chi-squared statistics should follow a Chi-squared distribution with 1 DF. Here’s the histogram:

The dark blue line graphs the PDF of a $\chi^2(1)$ distribution–the one we thought would work–while the dark red line graphs that of a $\chi^2(2)$ distribution (which would be a good guess if someone were to tell you that $\nu=1$ is incorrect). Neither fits the data.

You might expect the problem to be due to the small size of the data sets ($n$=20) or perhaps the small size of the number of bins. However, the problem persists even with very large datasets and larger numbers of bins: it is not merely a failure to reach an asymptotic approximation.

Things went wrong because I violated two requirements of the Chi-squared test:

1. You must use the Maximum Likelihood estimate of the parameters. (This requirement can, in practice, be slightly violated.)

2. You must base that estimate on the counts, not on the actual data! (This is crucial.)

The red histogram depicts the chi-squared statistics for 10,000 separate iterations, following these requirements. Sure enough, it visibly follows the $\chi^2(1)$ curve (with an acceptable amount of sampling error), as we had originally hoped.

The point of this comparison–which I hope you have seen coming–is that the correct DF to use for computing the p-values depends on many things other than dimensions of manifolds, counts of functional relationships, or the geometry of Normal variates. There is a subtle, delicate interaction between certain functional dependencies, as found in mathematical relationships among quantities, and distributions of the data, their statistics, and the estimators formed from them. Accordingly, it cannot be the case that DF is adequately explainable in terms of the geometry of multivariate normal distributions, or in terms of functional independence, or as counts of parameters, or anything else of this nature.

We are led to see, then, that “degrees of freedom” is merely a heuristic that suggests what the sampling distribution of a (t, Chi-squared, or F) statistic ought to be, but it is not dispositive. Belief that it is dispositive leads to egregious errors. (For instance, the top hit on Google when searching “chi squared goodness of fit” is a Web page from an Ivy League university that gets most of this completely wrong! In particular, a simulation based on its instructions shows that the chi-squared value it recommends as having 7 DF actually has 9 DF.)

With this more nuanced understanding, it’s worthwhile to re-read the Wikipedia article in question: in its details it gets things right, pointing out where the DF heuristic tends to work and where it is either an approximation or does not apply at all.

A good account of the phenomenon illustrated here (unexpectedly high DF in Chi-squared GOF tests) appears in Volume II of Kendall & Stuart, 5th edition. I am grateful for the opportunity afforded by this question to lead me back to this wonderful text, which is full of such useful analyses.

### Edit (Jan 2017)

Here is R code to produce the figure following “The standard wisdom about DF…”

#
# Simulate data, one iteration per column of x.
#
n <- 20
n.sim <- 1e4
bins <- qnorm(seq(0, 1, 1/4))
x <- matrix(rnorm(n*n.sim), nrow=n)
#
# Compute statistics.
#
m <- colMeans(x)
s <- apply(sweep(x, 2, m), 2, sd)
counts <- apply(matrix(as.numeric(cut(x, bins)), nrow=n), 2, tabulate, nbins=4)
expectations <- mapply(function(m,s) n*diff(pnorm(bins, m, s)), m, s)
chisquared <- colSums((counts - expectations)^2 / expectations)
#
# Plot histograms of means, variances, and chi-squared stats.  The first
# two confirm all is working as expected.
#
mfrow <- par("mfrow")
par(mfrow=c(1,3))
red <- "#a04040"  # Intended to show correct distributions
blue <- "#404090" # To show the putative chi-squared distribution
hist(m, freq=FALSE)