# 95% confidence intervals on prediction of censored binomial model estimated using mle2 / maximum-likelihood

I am working on a problem in which I have multiple pairs of currently living males i that each have a presumed paternal ancestor ni generations ago (based on genealogical evidence) and where I have info on whether there is a mismatch in their Y chromosomal genotype (exclusively paternally inherited, xi = 1 for mismatch, 0 if there is a match). If there is no mismatch, they indeed have a common paternal ancestor, but if there is there must have been a kink in the chain as a result of one or more extra-marital affairs (I can only detect though if either none or at least one such extra-pair paternity event happened, ie the dependent variable is censored). What I am interested in is obtaining a maximum likelihood estimate (plus 95% confidence limits) not just of the mean extra-pair paternity (EPP) rate (probability that per generation a child would be derived from an extra-pair copulation), but also to try to infer how the extra-pair paternity rate may have changed as a function of time (as the nr of generations that separated the common paternal ancestor should have some info on this – when there is a mismatch I don’t know though when the EPPs would have happened, as it could have been anywhere between the generation of that presumed ancestor and the present, but when there is a match we are sure there were no EPPs in any of the preceding generations). Hence, both my dependent binomial variable and independent covariate generation/time are censored. Based on a somewhat similar problem posted here I already figured out how I could make a maximum-likelihood estimate of the population and time-average extra-pair paternity rate phat plus 95% profile likelihood confidence intervals in R as follows :

# Function to make overall ML estimate of EPP rate p plus 95% profile likelihood confidence intervals,
# taking into account that for pairs with mismatches multiple EPP events could have occured
#
# input is
#     x=vector of booleans or 0 and 1s specifying if there was a mismatch or not (1 or TRUE = mismatch)
#     n=vector with nr of generations that separate common ancestor
# output is mle2 maximum likelihood fit with best-fit param EPP rate p

estimateP = function(x, n) {
if (!is.logical(x[[1]])) x = (x==1)
neglogL = function(p, x, n)  -sum((log(1 - (1-p)^n))[x]) -sum((n*log(1-p))[!x]) # negative log likelihood, see see https://stats.stackexchange.com/questions/152111/censored-binomial-model-log-likelihood
require(bbmle)
fit = mle2(neglogL, start=list(p=0.01), data=list(x=x, n=n))
return(fit)
}

Example with some pilot data (from Larmuseau et al. ProcB 2010):

n = c(7, 7, 7, 7, 7, 8, 9, 9, 9, 10, 10, 10, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 13, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 18, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 23, 24, 24, 25, 27, 31) # number of generations/meioses that separate presumed common paternal ancestor
x = c(rep(0,6), 1, rep(0,7), 1, 1, 1, 0, 1, rep(0,20), 1, rep(0,13), 1, 1, rep(0,5)) # whether pair of individuals had non-matching Y chromosomal genotypes

Maximum-likelihood estimate of population and time-average extra-pair paternity rate plus 95% confidence limits :

fit = estimateP(x,n)
c(coef(fit),confint(fit))*100 # estimated p and profile likelihood confidence intervals
#           p     2.5 %    97.5 %
#   0.9415172 0.4306652 1.7458847

i.e. 0.9% [0.43-1.75% 95% C.L.s] of all kids were derived from a father that was different than the supposed one.

I then wanted to go a step further, and also try to estimate a possible temporal trend in the extra-pair paternity rate p as a function of generation ni (for simplicity assuming a linear relationship between the log odds of observing an extra-pair paternity event and generation), taking into account that if a mismatch occurs the EPP events could have taken place anywhere between the generation of the common ancestor ni and the present (generation 0), and that if there was no mismatch that no EPP event could have taken place in any of the previous generations for that particular pair of individuals.

If before we assumed the probability of a child being derived from an extra-pair copulation $p$ to be constant, and if $X$ was a random variable equal to $1$ when a Y chromosome mismatch was observed (corresponding to 1 or more EPP events) and $0$ otherwise, then the probability of observing no mismatch (that is, $X=0$) when the paternal ancestor lived $n$ generations ago ($n = 1, 2, 3, \ldots$) was $(1-p)^n$, whereas the chance of observing an EPP event was

$$\Pr(X=1\,|\, n) = 1 – (1-p)^n.$$

In a dataset of independent observations $\mathbf{x} = (x_1, x_2, \ldots)$ of $X$ with paternal ancestors living $\mathbf{n} = (n_1, n_2, \ldots)$ generations ago, the likelihood therefore was

$$L(p; \mathbf{x}, \mathbf{n}) = \prod_{x_i=1} \left(1 – (1-p)^{n_i}\right)\prod_{x_j=0} (1-p)^{n_j},$$

resulting in a log likelihood of

$$\Lambda(p) = \sum_{x_i=1} \log\left(1 – (1-p)^{n_i}\right) + \sum_{x_j=0} {n_j} \log (1-p).$$

Taking into account that in my more complex model incorporating temporal dynamics I want $p$ to be a function of $n$ now, with
$$log (p/(1-p)) = a + b.n$$
, i.e.
$$p(a,b,n) = \exp(a+b.n) / (1+\exp(a+b.n))$$

I then changed the definition of the likelihood function above accordingly and maximized it using function mle2 from package bbmle :

# ML estimation, assuming that EPP rate p shows a temporal trend
# where log(p/(1-p))=a+b*n
# input is
#     x=vector of booleans or 0 and 1s specifying if there was a mismatch or not (1 or TRUE = mismatch)
#     n=vector with nr of generations that separate common ancestor
# output is mle2 maximum likelihood fit with best-fit params a and b

estimatePtemp = function(x, n) {
if (!is.logical(x[[1]])) x = (x==1)
pfun = function(a, b, n) exp(a+b*n)/(1+exp(a+b*n)) # we now write p as a function of a, b and n
logL = function(a, b, x, n)  sum((log(1 - (1-pfun(a, b, n))^n))[x]) +
sum((n*log(1-pfun(a, b, n)))[!x]) # a and b are params to be estimated, modified from https://stats.stackexchange.com/questions/152111/censored-binomial-model-log-likelihood
neglogL = function(a, b, x, n)  -logL(a, b, x, n) # negative log-likelihood
require(bbmle)
fit = mle2(neglogL, start=list(a=-3, b=-0.1), data=list(x=x, n=n))
return(fit)
}

# fitted coefficients
estfit = estimatePtemp(x, n)
cbind(coef(estfit),confint(estfit)) # parameter estimates and profile likelihood confidence intervals
#                      2.5 %      97.5 %
#   a -3.09054167 -5.3191406 -1.12078519
#   b -0.09870851 -0.2396262  0.02848305
summary(estfit)
# Coefficients:
#      Estimate Std. Error z value    Pr(z)
#   a -3.090542   1.057382 -2.9228 0.003469 **
#   b -0.098709   0.067361 -1.4654 0.142819

This gives me a reasonable looking historical estimate of the evolution of the EPP rate $p$ over time :

pfun = function(a, b, n) exp(a+b*n)/(1+exp(a+b*n))
xvals=1:max(n)
par(mfrow=c(1,1))
plot(xvals,sapply(xvals,function (n) pfun(coef(estfit)[1], coef(estfit)[2], n)),
type="l", xlab="Generations ago (n)", ylab="EPP rate (p)")

However, I am still a little stuck on how to calculate the 95% confidence intervals on the overall prediction of this model. Would anybody know how to do that by any chance? Maybe using population prediction intervals (by resampling parameters according to the fit following a multivariate normal distribution) (or would the delta method also work?)? And could somebody comment on whether my logic above is correct? I was also wondering if this kind of censored binomial model is known under some standard name in the literature, and if anyone knows of any published work on doing these kind of ML calculations under this kind of model? (I have a feeling that the problem should be fairly standard and correspond to something that’s been done already, but can’t seem to find anything…)

[PS Papers with more background on this topic/problem are available here and here]

Based on Ben Bolker’s chapter and comment above, I in the end figured that the 95% population prediction intervals are given by

# predicted EPP rate p as a function of n (nr of generations ago)
# plus 95% population prediction intervals (cf chapter B. Bolker)
pfun = function(a, b, n) exp(a+b*n)/(1+exp(a+b*n)) # model prediction
xvals=1:max(n)
set.seed(1001)
library(MASS)
nresamp=100000
pars.picked = mvrnorm(nresamp, mu = coef(estfit), Sigma = vcov(estfit)) # pick new parameter values by sampling from multivariate normal distribution based on fit
yvals = matrix(0, nrow = nresamp, ncol = length(xvals))
for (i in 1:nresamp) {
yvals[i,] = sapply(xvals,function (n) pfun(pars.picked[i,1], pars.picked[i,2], n))
}
quant = function(col) quantile(col, c(0.025,0.975)) # 95% percentiles
conflims = apply(yvals,2,quant) # 95% confidence intervals

par(mfrow=c(1,1))
plot(xvals, sapply(xvals,function (n) pfun(coef(estfit)[1], coef(estfit)[2], n)),
type="l", xlab="Generations ago (n)", ylab="EPP rate (p)", ylim=c(0,0.05))
lines(xvals, conflims[1,], col="red" , lty=2)
lines(xvals, conflims[2,], col="red" , lty=2)