We’re working with some logistic regressions and we have realized that the average estimated probability always equals the proportion of ones in the sample; that is, the average of fitted values equals the average of the sample.

Can anybody explain me the reason or give me a reference where I can find this demonstration?

**Answer**

The behavior you are observing is the “typical” case in logistic regression, but is *not* always true. It also holds in much more generality (see below). It is the consequence of the confluence of three separate facts.

- The choice of modeling the log-odds as a linear function of the predictors,
- The use of maximum likelihood to obtain estimates of the coefficients in the logistic regression model, and
- The inclusion of an intercept term in the model.

If any one of the above are not present, then the average estimated probabilities will not, in general, match the proportion of ones in the sample.

However, (nearly) *all* statistical software uses maximum-likelihood estimation for such models, so, in practice, items 1 and 2 are essentially always present, and item 3 is usually present, except in special cases.

**Some details**

In the typical logistic regression framework, we observe the outcome of independent binomial trials with probability pi. Let yi be the observed responses. Then the total likelihood is

L=n∏i=1pyii(1−pi)1−yi=n∏i=1exp(yilog(pi/(1−pi))+log(1−pi)),

and so the log-likelihood is

ℓ=n∑i=1yilog(pi/(1−pi))+n∑i=1log(1−pi).

Now, we have a vector of predictors \newcommand{\x}{\mathbf x}\x_i for each observation and from Fact 1 above, the logistic regression model posits that

\log \frac{p_i}{1-p_i} = \beta^T \x_i \>,

for some unknown vector of parameters \beta. **Note**: By rearranging this, we get that p_i = 1/(1+e^{-\beta^T \x_i}).

Using maximum likelihood to fit the model (Fact 2) yields a set of equations to solve from considering \partial \ell / \partial \beta = 0. Observe that

\frac{\partial \ell}{\partial \beta} = \sum_i y_i \x_i – \sum_i \frac{\x_i}{1+\exp(-\beta^T \x_i)} = \sum_i y_i \x_i – \sum_i p_i \x_i \>,

by using the assumed linear relationship between the log-odds and the predictors. This means, that the MLE satisfies

\sum_i y_i \x_i = \sum_i \hat{p}_i \x_i \>,

since MLEs are invariant under transformations, hence \hat{p}_i = (1+\exp(-\hat{\beta}^T \x_i))^{-1} in this case.

Using Fact 3, if \x_i has a component j that is always 1 for every i, then \sum_i y_i x_{ij} = \sum_i y_i = \sum_i \hat{p}_i and so the empirical proportion of positive responses matches the average of the fitted probabilities.

**A simulation**

The inclusion of an intercept is important. Here is an example in R to demonstrate that the observed behavior may not occur when no intercept is present in the model.

```
x <- rnorm(100)
p <- 1/(1+exp(-3*x))
y <- runif(100) <= p
mean(y)
# Should be identical to mean(y)
mean( predict( glm(y~x, family="binomial"), type="response" ) )
# Won't be identical (usually) to mean(y)
mean( predict( glm(y~x+0, family="binomial"), type="response") )
```

**General case**: As alluded to above, the property that the mean response is equal to the average predicted mean holds in much greater generality for the class of *generalized linear models* fit by maximum likelihood, using the *canonical* link function, and including an intercept in the model.

**References**

Some good references for the associated theory are the following.

- A. Agresti (2002),
*Categorical Data Analysis*, 2nd ed., Wiley. - P. McCullagh and J. A. Nelder (1989),
*Generalized Linear Models*, 2nd

ed., Chapman & Hall. (Text from original authors of the general methods.)

**Attribution***Source : Link , Question Author : Gabi Foix , Answer Author : cardinal*