# Gradient Descent negative binomial (but generally multiple-parameter families)

Say I’m doing a negative binomial regression and I’m trying to fit my model via gradient descent. Given that the Poisson and Negative Binomial have the same inverse link, how might I account for the dispersion?

P.s I need to implement this via GD as I’m working with encrypted ML; I can’t do inverses or transposes easily within my problem setup

I like looking at the Stata docs when I am implementing these models loss functions in other guises. So here on pg 11 Stata has a likelihood function for their version of negative binomial regression. Where $$α\alpha$$ is the dispersion estimate and $$uju_j$$ is the mean estimate (same as for Poisson regression). This is a variant called the NB2 distribution. I have an example on my blog of using this as a loss function in a pytorch deep learning model. So here is the code for torch tensors, but should be easily translatable to other coding languages:

# pytorch loss function
def nb2_loss(actual, log_pred, disp):
m = 1/disp.exp()
mu = log_pred.exp()
p = 1/(1 + disp.exp()*mu)
nll = torch.lgamma(m + actual) - torch.lgamma(actual+1) - torch.lgamma(m)
nll += m*torch.log(p) + actual*torch.log(1-p)
return -nll.mean()


Just a few notes — I paramaterize $$log(α)\log(\alpha)$$ as disp in this loss function. So this constrains the $$α\alpha$$ parameter to always be positive. Like all models that use backpropogation, you need decent starting parameters. I think a starting parameter of somewhere between 0 and 1 works well for the problems I have dealt with.

For a second note, I have had terrible time with using pytorch’s stochastic gradient descent in all my experiments with this (even with Poisson regression and fake data so I know good starting points). So at this point I always just default to the Adam optimizer (but again good starting points for all parameters are important).