# Pearson’s residuals

A beginner’s question about the Pearson’s residual within the context of the chi-square test for goodness of fit:

As well as the test statistic, R’s chisq.test function reports the Pearson’s residual:

(obs - exp) / sqrt(exp)


I understand why looking at the raw difference between observed and expected values isn’t that informative, as a smaller sample will result in a smaller difference. However, I’d like to know more about the effect of the denominator: why divide by the root of the expected value? Is this a ‘standardized’ residual?

The standard statistical model underlying analysis of contingency tables is to assume that (unconditional on the total count) the cell counts are independent Poisson random variables. So if you have an $$n×mn \times m$$ contingency table, the statistical model used as a basis for analysis takes each cell count to have unconditional distribution:

$$Xi,j ~ Pois(μi,j)X_{i,j} \text{ ~ Pois}(\mu_{i,j})$$

Once you impose a total cell count for the contingency table, or a row or column count, the resulting conditional distributions of the cell counts then become multinomial. In any case, for a Poisson distribution we have $$E(Xi,j)=V(Xi,j)=μi,j\mathbb{E}(X_{i,j}) = \mathbb{V}(X_{i,j}) = \mu_{i,j}$$, so the standardised cell count is:

$$STD(Xi,j)≡Xi,j−E(Xi,j)√V(Xi,j)=Xi,j−μi,j√μi,j\text{STD}(X_{i,j}) \equiv \frac{X_{i,j} - \mathbb{E}(X_{i,j})}{\sqrt{\mathbb{V}(X_{i,j})}} = \frac{X_{i,j} - \mu_{i,j}}{\sqrt{\mu_{i,j}}}$$

So, what you’re seeing in the formula you are enquiring about, is the standardised cell count, under the assumption that the cell counts have an (unconditional) Poisson distribution.

From here it is common to test independence of the row and column variable in the data, and in this case you can use a test statistic that looks at the sum-of-squares of the above values (which is equivalent to the squared-norm of the vector of standardised values). The chi-squared test provides a p-value for this kind of test based on a large-sample approximation to the null distribution of the test statistic. It is usually applied in cases where none of the sell counts are too small.