Gradient Descent negative binomial (but generally multiple-parameter families)

Say I’m doing a negative binomial regression and I’m trying to fit my model via gradient descent.

enter image description here

Given that the Poisson and Negative Binomial have the same inverse link, how might I account for the dispersion?

P.s I need to implement this via GD as I’m working with encrypted ML; I can’t do inverses or transposes easily within my problem setup


I like looking at the Stata docs when I am implementing these models loss functions in other guises. So here on pg 11 Stata has a likelihood function for their version of negative binomial regression.

ll for nbreg

Where α is the dispersion estimate and uj is the mean estimate (same as for Poisson regression). This is a variant called the NB2 distribution. I have an example on my blog of using this as a loss function in a pytorch deep learning model. So here is the code for torch tensors, but should be easily translatable to other coding languages:

# pytorch loss function
def nb2_loss(actual, log_pred, disp):
    m = 1/disp.exp()
    mu = log_pred.exp()
    p = 1/(1 + disp.exp()*mu)
    nll = torch.lgamma(m + actual) - torch.lgamma(actual+1) - torch.lgamma(m)
    nll += m*torch.log(p) + actual*torch.log(1-p)
    return -nll.mean()

Just a few notes — I paramaterize log(α) as disp in this loss function. So this constrains the α parameter to always be positive. Like all models that use backpropogation, you need decent starting parameters. I think a starting parameter of somewhere between 0 and 1 works well for the problems I have dealt with.

For a second note, I have had terrible time with using pytorch’s stochastic gradient descent in all my experiments with this (even with Poisson regression and fake data so I know good starting points). So at this point I always just default to the Adam optimizer (but again good starting points for all parameters are important).

Source : Link , Question Author : IanQ , Answer Author : Andy W

Leave a Comment