# Discussing binomial regression and modeling strategies

Today I have got a question about binomial/ logistic regression, its based on an analysis that a group in my department have done and were seeking comments upon. I made up the example below to protect their anonymity, but they were keen to see the responses.

Firstly, the analysis began with a simple 1 or 0 binomial response (e.g. survival from one breeding season to the next) and the goal was to model this response as a function of some co-variates.

However, multiple measurements of some co-variates were available for some individuals, but not for others. For example, imagine variable x is a measure of metabolic rate during labour and individuals vary in the number of offspring that they have (e.g. variable x was measured 3 times for individual A, but only once for individual B). This imbalance is not due to the sampling strategy of the researchers per se, but reflects the characteristics of the population they were sampling from; some individuals have more offspring than others.

I should also point out that measuring the binomial 0\1 response between labour events was not possible because the interval between these events was quite short. Again, imagine the species in questions has a short breeding season, but can give birth to more than one offspring during the season.

The researchers chose to run a model in which they used the mean of variable x as one covariate and the number of offspring an individual gave birth to as another covariate.

Now, I was not keen on this approach for a number of reasons

1)Taking the average of x means losing information in the within-individual variability of x.

2) The mean is itself a statistic, so by putting it in the model we end up doing statistics on statistics.

3) The number of offspring an individual had is in the model, but it is also used to calculate the mean of variable x, which I think could cause trouble.

So, my question is how would people go about modeling this type of data?

At the moment, I would probably run separate models for individuals that had one offspring, then for individuals that had two offspring etc. Also, I would not use the mean of variable x and just use the raw data for each birth, but I am not convinced this is much better either.

(PS: I apologize that its quite a long question, and I hope that the example is clear)

It does sound like you are in a bit of a quandary because you only have 1 response variable for each individual measurement. I was initially going to recommend a multi-level approach. But in order for that to work you need to observe the response at the lowest level – which you do not – you observe your response at the individual level (which would be level 2 in a MLM)

1)Taking the average of x means losing information in the within-individual variability of x.

You are losing variability of the covariate x, but this only matters if the other information contained in X is related to the response. There is nothing from stopping you from putting the variance of X in as a covariate either.

2) The mean is itself a statistic, so by putting it in the model we end up doing statistics on statistics.

A statistic is a function of the observed data. So any covariate is a “statistic”. So you are already doing “statistics on statistics” whether you like it or not. However, it does make a difference to how you should interpret the slope coefficient – as an average value, and not a value in the individual birth. If you don’t care about the individual births, then this matters little. If you do, then this approach can be misleading.

3) The number of offspring an individual had is in the model, but it is also used to calculate the mean of variable x, which I think could cause trouble.

It would only matter if the mean of X was functionally/deterministically related to number of offspring. One way this can happen is if the value of X is the same for each individual who had the same number of births. Usually this isn’t the case.

You could specify a model which includes each value of X as a covariate. But this would probably involve some new methodological research on your part I would imagine. Your likelihood function would be different for different individuals, due to the different number of measurements within individuals. I don’t think multi-level modeling applies in this case conceptually. This is simply because the births are not a subset or sample within individuals. Although the maths may be the same.

One way you could incorporate this structure is to create a model like:

Where $Y_{ij}$ is the binomial response for individual $i$ and $j$ denotes the number of births, $x_{ij}$ is the covariates, and $n_{ij}$ is the number of individuals with the same covariate values, and also had the same number of births. $p_{ij}$ is the probability, which you normally model as:

For some monotonic/invertible function $g(.)$. The “tricky” part comes in because the dimension of $x_{ij}$ varies with $j$. The log-likelihood in this case is:

Where $B$ is just the set of the number of births which you have available in your data set. To maximise it is likely to be a nontrivial task, and you probably won’t get the usual IRLS equations from doing a taylor series expansions about the current estimate. Taylor series is the way I would go from here – I just don’t have the energy to run through the process at this time. I would suggest you try to re-arrange your answer so that it looks like an “ordinary” binomial GLM. This will allow you to take advantage of the standard software available.

What I can tell you is that when you differentiate with respect to a beta which depends on $j$ (e.g. the coefficient for the metabolic rate for the third birth), some of terms in this summation will drop out. This is basically the likelihood “telling you” that certain observations contribute nothing to estimating certain parameters (e.g. individuals who give birth to two or less offspring contribute nothing to the estimated slope for the metabolic rate for the third birth).

So in summary, your intuition is spot on when you suggest that something is being lost. However, the price for “purity” could be high – especially if you need to write your own algorithm to get your estimates.