# Bayesian estimation of NN of a binomial distribution

This question is a technical follow-up of this question.

I have trouble understanding and replicating the model presented in Raftery (1988): Inference for the binomial $N$ parameter: a hierarchical Bayes approach in WinBUGS/OpenBUGS/JAGS. It is not only about code though so it should be on-topic here.

## Background

Let $x=(x_{1},\ldots,x_{n})$ be a set of success counts from a binomial distribution with unknown $N$ and $\theta$. Further, I assume that $N$ follows a Poisson distribution with parameter $\mu$ (as discussed in the paper). Then, each $x_{i}$ has a Poisson distribution with mean $\lambda = \mu \theta$. I want to specify the priors in terms of $\lambda$ and $\theta$.

Assuming that I don’t have any good prior knowledge about $N$ or $\theta$, I want to assign non-informative priors to both $\lambda$ and $\theta$. Say, my priors are $\lambda\sim \mathrm{Gamma}(0.001, 0.001)$ and $\theta\sim \mathrm{Uniform}(0, 1)$.

The author uses an improper prior of $p(N,\theta)\propto N^{-1}$ but WinBUGS does not accept improper priors.

## Example

In the paper (page 226), the following success counts of observed waterbucks are provided: $53, 57, 66, 67, 72$. I want to estimate $N$, the size of the population.

Here is how I tried to work out the example in WinBUGS (updated after @Stéphane Laurent’s comment):

model {

# Likelihood
for (i in 1:N) {
x[i] ~ dbin(theta, n)
}

# Priors

n ~ dpois(mu)
lambda ~ dgamma(0.001, 0.001)
theta ~ dunif(0, 1)
mu <- lambda/theta

}

# Data

list(x = c(53, 57, 66, 67, 72), N = 5)

# Initial values

list(n = 100, lambda = 100, theta  = 0.5)
list(n = 1000, lambda = 1000, theta  = 0.8)
list(n = 5000, lambda = 10, theta  = 0.2)


The model does sill not converge nicely after 500’000 samples with 20’000 burn-in samples. Here is the output of a JAGS run:

Inference for Bugs model at "jags_model_binomial.txt", fit using jags,
5 chains, each with 5e+05 iterations (first 20000 discarded), n.thin = 5
n.sims = 480000 iterations saved
mu.vect  sd.vect   2.5%     25%     50%     75%    97.5%  Rhat  n.eff
lambda    63.081    5.222 53.135  59.609  62.938  66.385   73.856 1.001 480000
mu       542.917 1040.975 91.322 147.231 231.805 462.539 3484.324 1.018    300
n        542.906 1040.762 95.000 147.000 231.000 462.000 3484.000 1.018    300
theta      0.292    0.185  0.018   0.136   0.272   0.428    0.668 1.018    300
deviance  34.907    1.554 33.633  33.859  34.354  35.376   39.213 1.001  43000


## Questions

Clearly, I am missing something, but I can’t see what exactly. I think my formulation of the model is wrong somewhere. So my questions are:

• Why does my model and its implementation not work?
• How could the model given by Raftery (1988) be formulated and implemented correctly?

Well, since you got your code to work, it looks like this answer is a bit too late. But I’ve already written the code, so…

For what it’s worth, this is the same* model fit with rstan. It is estimated in 11 seconds on my consumer laptop, achieving a higher effective sample size for our parameters of interest $(N, \theta)$ in fewer iterations.

raftery.model   <- "
data{
int     I;
int     y[I];
}
parameters{
real<lower=max(y)>  N;
simplex[2]      theta;
}
transformed parameters{
}
model{
vector[I]   Pr_y;

for(i in 1:I){
Pr_y[i] <-  binomial_coefficient_log(N, y[i])
+multiply_log(y[i],         theta[1])
+multiply_log((N-y[i]),     theta[2]);
}
increment_log_prob(sum(Pr_y));
increment_log_prob(-log(N));
}
"
raft.data           <- list(y=c(53,57,66,67,72), I=5)
system.time(fit.test    <- stan(model_code=raftery.model, data=raft.data,iter=10))
system.time(fit     <- stan(fit=fit.test, data=raft.data,iter=10000,chains=5))


Note that I cast theta as a 2-simplex. This is just for numerical stability. The quantity of interest is theta[1]; obviously theta[2] is superfluous information.

*As you can see, the posterior summary is virtually identical, and promoting $N$ to a real quantity does not appear to have a substantive impact on our inferences.

The 97.5% quantile for $N$ is 50% larger for my model, but I think that’s because stan’s sampler is better at exploring the full range of the posterior than a simple random walk, so it can more easily make it into the tails. I may be wrong, though.

            mean se_mean       sd   2.5%    25%    50%    75%   97.5% n_eff Rhat
N        1078.75  256.72 15159.79  94.44 148.28 230.61 461.63 4575.49  3487    1
theta[1]    0.29    0.00     0.19   0.01   0.14   0.27   0.42    0.67  2519    1
theta[2]    0.71    0.00     0.19   0.33   0.58   0.73   0.86    0.99  2519    1
lp__      -19.88    0.02     1.11 -22.89 -20.31 -19.54 -19.09  -18.82  3339    1


Taking the values of $N, \theta$ generated from stan, I use these to draw posterior predictive values $\tilde{y}$. We should not be surprised that mean of the posterior predictions $\tilde{y}$ is very near the mean of the sample data!

N.samples   <- round(extract(fit, "N")[[1]])
theta.samples   <- extract(fit, "theta")[[1]]
y_pred  <- rbinom(50000, size=N.samples, prob=theta.samples[,1])
mean(y_pred)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
32.00   58.00   63.00   63.04   68.00  102.00


To check whether the rstan sampler is a problem or not, I computed the posterior over a grid. We can see that the posterior is banana-shaped; this kind of posterior can be problematic for euclidian metric HMC. But let’s check the numerical results. (The severity of the banana shape is actually suppressed here since $N$ is on the log scale.) If you think about the banana shape for a minute, you’ll realize that it must lie on the line $\bar{y}=\theta N$.

The code below may confirm that our results from stan make sense.

theta   <- seq(0+1e-10,1-1e-10, len=1e2)
N       <- round(seq(72, 5e5, len=1e5)); N[2]-N[1]
grid    <- expand.grid(N,theta)
y   <- c(53,57,66,67,72)
raftery.prob    <- function(x, z=y){
N       <- x[1]
theta   <- x[2]
exp(sum(dbinom(z, size=N, prob=theta, log=T)))/N
}

post    <- matrix(apply(grid, 1, raftery.prob), nrow=length(N), ncol=length(theta),byrow=F)
approx(y=N, x=cumsum(rowSums(post))/sum(rowSums(post)), xout=0.975)
$x [1] 0.975$y
[1] 3236.665


Hm. This is not quite what I would have expected. Grid evaluation for the 97.5% quantile is closer to the JAGS results than to the rstan results. At the same time, I don’t believe that the grid results should be taken as gospel because grid evaluation is making several rather coarse simplifications: grid resolution isn’t too fine on the one hand, while on the other, we are (falsely) asserting that total probability in the grid must be 1, since we must draw a boundary (and finite grid points) for the problem to be computable (I’m still waiting on infinite RAM). In truth, our model has positive probability over $(0,1)\times\left\{N|N\in\mathbb{Z}\land N\ge72)\right\}$. But perhaps something more subtle is at play here.