Properties of logistic regressions

We’re working with some logistic regressions and we have realized that the average estimated probability always equals the proportion of ones in the sample; that is, the average of fitted values equals the average of the sample.

Can anybody explain me the reason or give me a reference where I can find this demonstration?

The behavior you are observing is the “typical” case in logistic regression, but is not always true. It also holds in much more generality (see below). It is the consequence of the confluence of three separate facts.

1. The choice of modeling the log-odds as a linear function of the predictors,
2. The use of maximum likelihood to obtain estimates of the coefficients in the logistic regression model, and
3. The inclusion of an intercept term in the model.

If any one of the above are not present, then the average estimated probabilities will not, in general, match the proportion of ones in the sample.

However, (nearly) all statistical software uses maximum-likelihood estimation for such models, so, in practice, items 1 and 2 are essentially always present, and item 3 is usually present, except in special cases.

Some details

In the typical logistic regression framework, we observe the outcome of independent binomial trials with probability $p_i$. Let $y_i$ be the observed responses. Then the total likelihood is

and so the log-likelihood is

Now, we have a vector of predictors $\newcommand{\x}{\mathbf x}\x_i$ for each observation and from Fact 1 above, the logistic regression model posits that

for some unknown vector of parameters $\beta$. Note: By rearranging this, we get that $p_i = 1/(1+e^{-\beta^T \x_i})$.

Using maximum likelihood to fit the model (Fact 2) yields a set of equations to solve from considering $\partial \ell / \partial \beta = 0$. Observe that

by using the assumed linear relationship between the log-odds and the predictors. This means, that the MLE satisfies

since MLEs are invariant under transformations, hence $\hat{p}_i = (1+\exp(-\hat{\beta}^T \x_i))^{-1}$ in this case.

Using Fact 3, if $\x_i$ has a component $j$ that is always 1 for every $i$, then $\sum_i y_i x_{ij} = \sum_i y_i = \sum_i \hat{p}_i$ and so the empirical proportion of positive responses matches the average of the fitted probabilities.

A simulation

The inclusion of an intercept is important. Here is an example in $R$ to demonstrate that the observed behavior may not occur when no intercept is present in the model.

x <- rnorm(100)
p <- 1/(1+exp(-3*x))
y <- runif(100) <= p
mean(y)
# Should be identical to mean(y)
mean( predict( glm(y~x, family="binomial"), type="response" ) )
# Won't be identical (usually) to mean(y)
mean( predict( glm(y~x+0, family="binomial"), type="response") )


General case: As alluded to above, the property that the mean response is equal to the average predicted mean holds in much greater generality for the class of generalized linear models fit by maximum likelihood, using the canonical link function, and including an intercept in the model.

References

Some good references for the associated theory are the following.

1. A. Agresti (2002), Categorical Data Analysis, 2nd ed., Wiley.
2. P. McCullagh and J. A. Nelder (1989), Generalized Linear Models, 2nd
ed., Chapman & Hall. (Text from original authors of the general methods.)