Semi-computer science simulation related problem here.

I have a distribution where

P(x) = (eb−1)eb(n−x)ebn+b−1

for some constants b and n, and x is an integer such that 0≤x≤n.

Now, I need to sample from this distribution. It has an invertible CDF, so it’s possible to do this directly in theory. The problem is that the numbers involved are LARGE. So large in fact, that they both overflow conventionally formatted variables, and take at least minutes (at some point I gave up…) to compute using arbitrary precision formats. Basically, the inverse CDF still involves a term of eb(n+1), for 350<n<3500. Despite this, the output numbers will still be in the range 0−n, so it seems like there should be a way of doing this.

What I'm looking for is a way of approximately sampling from this distribution that

iscomputable. Are there alternative methods of sampling? What are they?

**Answer**

**The CDF is readily invertible.** A formula for the inversion leads to what has to be one of the simplest and most expedient possible solutions.

Begin by observing that the probability of outcome k, 0≤k≤n, is proportional to e−bk. Thus, if we generate a *uniform* value q between 0 and qmax = (1 - e^{-b(n+1)})/(1 - e^{-b}), we only need find the largest k for which

q \ge \sum_{i=0}^{k} e^{-bi} = \frac{1 - e^{-(k+1)b}}{1 -e^{-b}}.

Simple algebra gives the solution

k = - \text{ceiling}\left(\frac{\log(1 - q (1-e^{-b}))}{b}\right).

**Here is an R implementation** constructed like all the other random-number generators: its first argument specifies how many

*iid*values to generate and the rest of the arguments name the parameters (b as

`b`

and n as `n.max`

):```
rgeom.truncated <- function(n=1, b, n.max) {
a <- 1 - exp(-b)
q.max <- (1 - exp(-b*(n.max+1))) / a
q <- runif(n, 0, q.max)
return(-ceiling(log(1 - q*a) / b))
}
```

As an example of its use, let's generate a million random variates according to this distribution:

```
b <- 0.001
n.max <- 3500
n.sim <- 10^6
set.seed(17)
system.time(sim <- rgeom.truncated(n.sim, b,n.max))
```

(0.10 seconds were needed.)

```
h <- hist(sim+1, probability=TRUE, breaks=50, xlab="Outcome+1")
pmf <- exp(-b * (0: n.max)); pmf <- pmf / sum(pmf)
lines(0:n.max, pmf, col="Red", lwd=2)
```

(1 was added to each value in order to create a better histogram: `R`

's `hist`

procedure has an idiosyncrasy (=bug) in which the first bar is too high when the left endpoint is set at zero.) The red curve is the reference distribution that this simulation attempts to reproduce. **Let's evaluate the goodness of fit** with a chi-square test:

```
observed <- table(sim)
expected <- n.sim * pmf
chi.square <- (observed-expected)^2 / expected
pchisq(sum(chi.square), n.max, lower.tail=FALSE)
```

The p-value is 0.84: a beautiful fit.

**Attribution***Source : Link , Question Author : John Doucette , Answer Author : whuber*