# sampling cost of O(d)O(d) versus O(2d)O(2^d)

I came across the following simulation problem: given a set $\{\omega_1,\ldots,\omega_d\}$ of known real numbers, a distribution on $\{-1,1\}^d$ is defined by

where $(z)_+$ denotes the positive part of $z$. While I can think of a Metropolis-Hastings sampler targetting this distribution, I wonder if there exists an efficient direct sampler, taking advantage of the large number of zero probabilities to decrease the order of the algorithm from $O(2^d)$ to $O(d)$.

Here’s a fairly obvious recursive sampler that’s $O(d)$ in the best case (in terms of the weights $\omega_i$), but exponential in the worst case.

Suppose we’ve already selected $x_1, \dots, x_{i-1}$, and wish to choose $x_{i}$. We need to compute

and choose $x_i = 1$ with probability
The denominator will be nonzero for any valid choice of samples $x_1, \dots, x_{i-1}$.

Now, of course, the question is how to compute $w(x_1, \dots, x_i)$.

If we have that $C := \sum_{j=1}^{i} \omega_j x_j \ge \sum_{j=i+1}^{d} \lvert \omega_j \rvert$, then $\omega \cdot x \ge 0$ for any $x$ with leading entries $x_{1:i}$, and so $w$ becomes:

In the opposite case, $C \le - \sum_{j=i+1}^{d} \lvert \omega_j \rvert$,
we have that $\omega \cdot x \le 0$ and so $w(x_1, \dots, x_i) = 0$.

Otherwise, we must recurse, using $w(x_1, \dots, x_i) = w(x_1, \dots, x_i, 1) + w(x_1, \dots, x_i, -1)$.

Assume that memory isn’t an issue and that we can cache all sub-computations in $w(1)$, $w(-1)$ in a tree – up to the point that we hit one of the “nice” cases, after which any calls take constant time. (We’ll need to compute this whole tree anyway to select $x_1$.) Then, once this tree of $w$ computations is built, the sampler will take only $O(d)$ time. The question is how long it takes to build the tree, or equivalently how large it is.

We will of course hit the “nice” cases faster if the $\omega_i$ are sorted, $\omega_1 \ge \omega_2 \ge \dots \ge \omega_d$.

In the best case, $\lvert \omega_1 \rvert > \sum_{j=2}^d \lvert \omega_j \rvert$. Then we hit a “nice” case immediately for either $w(1)$ or $w(-1)$, so $w$ tree construction takes constant time, and the whole sampler takes $O(d)$ time.

In the worst (sorted) case, $\omega_1 = \omega_2 = \dots = \omega_d$. Then the question is: how big is the total tree?

Well, the first paths to terminate are of course $(1, 1, \dots, 1)$ and $(-1, -1, \dots, -1)$ of length $\lceil d/2 \rceil$. The tree is therefore complete up to that depth, and so contains at least $O(2^{d/2})$ nodes. (It has more; you can probably find it with an argument like the ones used in gambler’s ruin problems, but I couldn’t find it in two minutes of Googling and don’t particularly care – $2^{d/2}$ is bad enough….)

If your setting has only a few very large $\omega_i$, this is probably a reasonably practical approach. If the $\omega_i$ are all of similar magnitude, it’s probably still exponential and too expensive for large $d$.