Prediction interval for a future proportion of successes under Binomial setting

Suppose I fit a Binomial regression and obtain the point estimates and variance-covariance matrix of the regression coefficients. That will allow me to get a CI for the expected proportion of successes in a future experiment, p, but I need a CI for the observed proportion. There have been a few related answers posted, including simulation (suppose I don’t want to do that) and a link to Krishnamoorthya et al (which does not quite answer my question).

My reasoning is as follows: if we use just the Binomial model, we are forced to assume that p is sampled from Normal distribution (with the corresponding Wald CI) and therefore it is impossible to get CI for the observed proportion in closed form. If we assume that p is sampled from beta distribution, then things are much easier because the count of successes will follow Beta-Binomial distribution. We will have to assume that there is no uncertainty in the estimated beta parameters, \alpha and \beta.

There are three questions:

1) A theoretical one: is it ok to use just the point estimates of beta parameters? I know that to construct a CI for future observation in multiple linear regression

Y = x’\beta + \epsilon, \epsilon \sim N(0, \sigma^2)

they do that w.r.t. error term variance, \sigma^2. I take it (correct me if I am wrong) that the justification is that in practice \sigma^2 is estimated with a far greater precision than the regression coefficients and we won’t gain much by trying to incorporate the uncertainty of \sigma^2. Is a similar justification applicable to the estimated beta parameters, \alpha and \beta?

2) What package is better (R: gamlss-bb, betareg, aod?; I also have access to SAS).

3) Given the estimated beta parameters, is there an (approximate) shortcut to get the quantiles (2.5%, 97.5%) for the count of future successes or, better yet, for the proportion of future successes under Beta-Binomial distribution.


I will address all 3 parts to the question.

There are two conflated issues, first is the method you use to fit a regression model in this case. The second is how to interval estimates from your estimates to predict a new estimate.

if your response variables are Binomially distributed you would typically use either a logistic regression or a probit regression (glm with normal cdf as a link function).

If you do a logistic regression, take the response to be the ratio of the observed counts divided by the the known upper bound i.e. y_i/n_i. Then take your predictors/covariates and put these into your R call to a glm function. The returned object has everything you need to do the rest of your calculations.

x<- rnorm(100, sd=2)
prob_true <- 1/(1+exp(-(1+5*x)))
counts <- rbinom(100, 50,prob_true)
print(d.AD <- data.frame(counts,x))
glm.D93 <- glm(counts/50 ~ x, family = binomial() )

For a linear regression model the formula for a prediction interval is:

\hat{y}_i \pm t_{n-p}s_y\sqrt{1+\frac{1}{n}+\frac{(x_i-\bar{x})^2}{(n-1)s^2_x}}

You can use the linear regression model as an approximation for the glm. To do this you would linear regression formula for the linear combination of predictors before you do the inverse link transformation to get the probabilities back on the 0-1 scale. The code to do this is baked into the predict.glm() R function. Here is some example code that will also make a nice plot. (EDIT: This code is for confidence interval, not for prediction interval)

y_hat <- predict(glm.D93, type="link",
t_np<- qt(.975, 100-2, ncp=0)

ub <- y_hat$fit + t_np * y_hat$
lb <- y_hat$fit - t_np * y_hat$

point <- y_hat$fit

p_hat <- glm.D93$family$linkinv(point)
p_hat_lb <- glm.D93$family$linkinv(lb)
p_hat_ub <- glm.D93$family$linkinv(ub)

points(x, p_hat_ub, col='red')
points(x, p_hat_lb, col='blue')

You can do the same thing for any glm, e.g. Poisson, inverse Gaussian, gamma, etc. In each case do the prediction interval on the scale of the linear combination of the predictors. After you get the two end points of the prediction interval you convert these end points via the inverse link. For each of the glms I mentioned the inverse link might be different than the logit case I wrote here. Hope this helps.

Source : Link , Question Author : James , Answer Author : Matúš Košík

Leave a Comment