I’m trying to get a handle on the concept of overdispersion in logistic regression. I’ve read that overdispersion is when observed variance of a response variable is greater than would be expected from the binomial distribution.
But if a binomial variable can only have two values (1/0), how can it have a mean and variance?
I’m fine with calculating the mean and variance of successes from x number of Bernoulli trials. But I cannot wrap my head around the concept of a mean and variance of a variable that can only have two values.
Can anyone provide an intuitive overview of:
- The concept of a mean and variance in a variable that can only have two values
- The concept of overdispersion in a variable that can only have two values
A binomial random variable with $N$ trials and probability of success $p$ can take more than two values. The binomial random variable represents the number of successes in those $N$ trials, and can in fact take $N+1$ different values ($0,1,2,3,…,N$). So if the variance of that distribution is greater than too be expected under the binomial assumptions (perhaps there are excess zeros for instance), that is a case of overdispersion.
Overdispersion does not make sense for a Bernoulli random variable ($N = 1$)
In the context of a logistic regression curve, you can consider a “small slice”, or grouping, through a narrow range of predictor value to be a realization of a binomial experiment (maybe we have 10 points in the slice with a certain number of successes and failures). Even though we do not truly have multiple trials at each predictor value and we are looking at proportions instead of raw counts, we would still expect the proportion of each of these “slices” to be close to the curve. If these “slices” have a tendency to be far away from the curve, there is too much variability in the distribution. So by grouping the observations, you create realizations of binomial random variables rather than looking at the 0/1 data individually.
The Example below is from another question on this site. Lets say the blue lines represents the expected proportion over the range of predictor variables. The blue cells indicate observed instances (in this case schools). This provides a graphical representation of how overdispersion may look. Note that there are flaws with interpreting the cells of the graph below, but it provides an idea of how overdispersion can manifest itself.