# Expectation of the Sum of K Numbers without replacement

Given $n$ numbers, where the value of each number is different, denoted as $v_1, v_2, ..., v_n$, and the probability of selecting each number is $p_1, p_2, ..., p_n$, respectively.

Now if I select $K$ numbers based on the given probabilities, where $K \leq n$, what is the expectation of the sum of those $K$ numbers? Note that the selection is without replacement, so that the $K$ numbers cannot involve duplicate numbers. I understand that if the selection is with replacement, the expectation of the sum of the $K$ numbers equals $K \times E(V)$, where

Furthermore, what about the expectation of the variance of those $K$ numbers?

I am a CS PhD student who is working on a big data problem, and I don’t have any statistics background. I expect that someone can give me a formula as the answer. However, if the answer is too complicated to be described by a formula or intensive computation has to be involved, an approximate answer is totally acceptable.

You can assume $n$ here is quite large, and the probability can vary a lot. In practice, the values of those probabilities come from a query log, which records a series of aggregation queries. The point is that the frequency of each number involved in the queries can be quite skew, i.e., some are rarely queried, while some are queried very frequently. You can assume the probability distribution is normal distribution, zipf distribution or any other reasonable alternatives.

The value distribution is only a contiguous subset of any possible distribution. In other words, if you have a histogram that represents a certain distribution, the all the numbers involved in this problem are the numbers all within a single bucket.

In terms of the value of K, you may assume it is always less than the number of frequently queried elements.

This is probably in the nature of an answer that, while accurate, is
probably not that useful. Horvitz and Thompson (1952) provide results that cover this situation in general. These results are given in terms of the
combinatorial expressions one might expect.

To keep consistent with their notation, and also to correspond better with more widely used notation, let me redefine some quantities. Let $N$ be the number of elements in the population and $n$ be the sample size.

Let $u_i$, $i=1,...,N$, represent the $N$ elements of the population, with given values $V_i$, $i=1,...,N$ and probabilities of selection $p_1,...,p_N$. For a given sample of size $n$, let the observed values in the sample be $v_1,..., v_n$.

What is desired are the mean and variance of the sample total

As mentioned in the comments, the probability of selecting a particular sample $s = \{u_i, u_j, ..., u_t\}$ drawn in that order is

where the initial probability $p_{i_1}$ of drawing $u_i$ is given by $p_i$, the second probability $p_{j_2}$ of drawing $u_j$ is conditional on having removed $u_i$ from the population, and so forth. So each subsequent unit drawn results in a new probability distribution for the next unit (hence, the choice of different indicial letters, because each represents a different distribution.)

There are

samples of size $n$ that contain $u_i$ out of the entire population. Note that this takes into account the $n!$ permutations of the sample.

Let $s_n^{(i)}$ denote a specific sample of size $n$ which includes $u_i$.
Then, the probability of selecting element $u_i$ is given by

where the summation is over the set of size $S^{(i)}$ of all possible samples $s_n^{(i)}$ of size $n$ that contain $u_i$. (I changed the notation a little from the paper since it seemed confusing to me.)

Similarly, define

as the number of samples containing both $u_i$ and $u_j$. Then we can define the probability of a sample containing both as

where the summation is over the set of size $S^{(ij)}$ of all possible samples $s_n^{(ij)}$ of size $n$ that contain $u_i$ and $u_j$.

The expected value is then derived as

Although the variance is not derived explicitly in the paper, it could be obtained from expections of the $q$th moment

and the cross-products

In other words, it looks as though one would would need to go through all possible subsets to do these calculations. Maybe this could be done for smaller values of $n$, though.

Horvitz, D.G. and Thompson, D.J. (1952) A generalization of
sampling without replacement from a finite universe. Journal of the
American Statistical Association
47(260): 663-685.