# Framing the negative binomial distribution for DNA sequencing

The negative binomial distribution has become a popular model for count data (specifically the expected number of sequencing reads within a given region of the genome from a given experiment) in bioinformatics. Explanations vary:

• Some explain it as something that works like the Poisson
distribution but has an additional parameter, allowing more freedom to
model the true distribution, with a variance not necessarily equal to
the mean
• Some explain it as a weighted mixture of Poisson distributions (with
a gamma mixing distribution on the Poisson parameter)

Is there a way to square these rationales with the traditional
definition of a negative binomial distribution as modeling the number
of successes of Bernoulli trials before seeing a certain number of
failures? Or should I just think of it as a happy coincidence that a
weighted mixture of Poisson distributions with a gamma mixing
distribution has the same probability mass function as the negative
binomial?

IMOH, I really think that the negative binomial distribution is used for convenience.

So in RNA Seq there is a common assumption that if you take an infinite number of measurements of the same gene in an infinite number of replicates then the true distribution would be lognormal. This distribution is then sampled via a Poisson process (with a count) so the true distribution reads per gene across replicates would be a Poisson-Lognormal distribution.

But in packages that we use such as EdgeR and DESeq this distribution modeled as a negative binomial distribution. This is not because the guys that wrote it didn’t know about a Poisson Lognormal distribution.

It is because the Poisson Lognormal distribution is a terrible thing to work with because it requires numerical integration to do the fits etc. so when you actually try to use it sometimes the performance is really bad.

A negative binomial distribution has a closed form so it is a lot easier to work with and the gamma distribution (the underlying distribution) looks a lot like a lognormal distribution in that it sometimes looks kind of normal and sometimes has a tail.

But in this example (if you believe the assumption) it can’t possibly be theoretically correct because the theoretically correct distribution is the Poisson lognormal and the two distributions are reasonable approximations of one another but are not equivalent.

But I still think the “incorrect” negative binomial distribution is often the better choice because empirically it will give better results because the integration performs slowly and the fits can perform badly, especially with distributions with long tails.