# How can I sample from a distribution with incomputable CDF?

Semi-computer science simulation related problem here.

I have a distribution where

P(x) = $$(eb−1)eb(n−x)ebn+b−1\frac{(e^b-1) e^{b (n-x)}}{e^{b n+b}-1}$$

for some constants b and n, and x is an integer such that $$0≤x≤n0\leq x \leq n$$.

Now, I need to sample from this distribution. It has an invertible CDF, so it’s possible to do this directly in theory. The problem is that the numbers involved are LARGE. So large in fact, that they both overflow conventionally formatted variables, and take at least minutes (at some point I gave up…) to compute using arbitrary precision formats. Basically, the inverse CDF still involves a term of $$eb(n+1)e^{b(n+1)}$$, for $$350. Despite this, the output numbers will still be in the range $$0−n0-n$$, so it seems like there should be a way of doing this.

What I'm looking for is a way of approximately sampling from this distribution that is computable. Are there alternative methods of sampling? What are they?

The CDF is readily invertible. A formula for the inversion leads to what has to be one of the simplest and most expedient possible solutions.

Begin by observing that the probability of outcome $k$, $0 \le k \le n$, is proportional to $e^{-b k}$. Thus, if we generate a uniform value $q$ between $0$ and $q_{\max}=\sum_{k=0}^{n} e^{-b k}$ = $(1 - e^{-b(n+1)})/(1 - e^{-b})$, we only need find the largest $k$ for which

Simple algebra gives the solution

Here is an R implementation constructed like all the other random-number generators: its first argument specifies how many iid values to generate and the rest of the arguments name the parameters ($b$ as b and $n$ as n.max):

rgeom.truncated <- function(n=1, b, n.max) {
a <- 1 - exp(-b)
q.max <- (1 - exp(-b*(n.max+1))) / a
q <- runif(n, 0, q.max)
return(-ceiling(log(1 - q*a) / b))
}


As an example of its use, let's generate a million random variates according to this distribution:

b <- 0.001
n.max <- 3500
n.sim <- 10^6
set.seed(17)
system.time(sim <- rgeom.truncated(n.sim, b,n.max))


($0.10$ seconds were needed.)

h <- hist(sim+1, probability=TRUE, breaks=50, xlab="Outcome+1")
pmf <- exp(-b * (0: n.max)); pmf <- pmf / sum(pmf)
lines(0:n.max, pmf, col="Red", lwd=2) ($1$ was added to each value in order to create a better histogram: R's hist procedure has an idiosyncrasy (=bug) in which the first bar is too high when the left endpoint is set at zero.) The red curve is the reference distribution that this simulation attempts to reproduce. Let's evaluate the goodness of fit with a chi-square test:

observed <- table(sim)
expected <- n.sim * pmf
chi.square <- (observed-expected)^2 / expected
pchisq(sum(chi.square), n.max, lower.tail=FALSE)


The p-value is $0.84$: a beautiful fit.