How can I sample from a distribution with incomputable CDF?

Semi-computer science simulation related problem here.

I have a distribution where

P(x) = (eb1)eb(nx)ebn+b1

for some constants b and n, and x is an integer such that 0xn.

Now, I need to sample from this distribution. It has an invertible CDF, so it’s possible to do this directly in theory. The problem is that the numbers involved are LARGE. So large in fact, that they both overflow conventionally formatted variables, and take at least minutes (at some point I gave up…) to compute using arbitrary precision formats. Basically, the inverse CDF still involves a term of eb(n+1), for 350<n<3500. Despite this, the output numbers will still be in the range 0n, so it seems like there should be a way of doing this.

What I'm looking for is a way of approximately sampling from this distribution that is computable. Are there alternative methods of sampling? What are they?

Answer

The CDF is readily invertible. A formula for the inversion leads to what has to be one of the simplest and most expedient possible solutions.

Begin by observing that the probability of outcome k, 0kn, is proportional to ebk. Thus, if we generate a uniform value q between 0 and qmax = (1 - e^{-b(n+1)})/(1 - e^{-b}), we only need find the largest k for which

q \ge \sum_{i=0}^{k} e^{-bi} = \frac{1 - e^{-(k+1)b}}{1 -e^{-b}}.

Simple algebra gives the solution

k = - \text{ceiling}\left(\frac{\log(1 - q (1-e^{-b}))}{b}\right).

Here is an R implementation constructed like all the other random-number generators: its first argument specifies how many iid values to generate and the rest of the arguments name the parameters (b as b and n as n.max):

rgeom.truncated <- function(n=1, b, n.max) {
  a <- 1 - exp(-b)
  q.max <- (1 - exp(-b*(n.max+1))) / a
  q <- runif(n, 0, q.max)
  return(-ceiling(log(1 - q*a) / b))
}

As an example of its use, let's generate a million random variates according to this distribution:

b <- 0.001
n.max <- 3500
n.sim <- 10^6
set.seed(17)
system.time(sim <- rgeom.truncated(n.sim, b,n.max))

(0.10 seconds were needed.)

h <- hist(sim+1, probability=TRUE, breaks=50, xlab="Outcome+1")
pmf <- exp(-b * (0: n.max)); pmf <- pmf / sum(pmf)
lines(0:n.max, pmf, col="Red", lwd=2)

Histogram

(1 was added to each value in order to create a better histogram: R's hist procedure has an idiosyncrasy (=bug) in which the first bar is too high when the left endpoint is set at zero.) The red curve is the reference distribution that this simulation attempts to reproduce. Let's evaluate the goodness of fit with a chi-square test:

observed <- table(sim)
expected <- n.sim * pmf
chi.square <- (observed-expected)^2 / expected
pchisq(sum(chi.square), n.max, lower.tail=FALSE)

The p-value is 0.84: a beautiful fit.

Attribution
Source : Link , Question Author : John Doucette , Answer Author : whuber

Leave a Comment