Given n numbers, where the value of each number is different, denoted as v1,v2,...,vn, and the probability of selecting each number is p1,p2,...,pn, respectively.

Now if I select K numbers based on the given probabilities, where K≤n, what is the expectation of the sum of those K numbers? Note that the selection is without replacement, so that the K numbers cannot involve duplicate numbers. I understand that if the selection is with replacement, the expectation of the sum of the K numbers equals K×E(V), where E(V)=v1×p1+v2×p2+...+vn×pn.

Furthermore, what about the expectation of the variance of those K numbers?

I am a CS PhD student who is working on a big data problem, and I don’t have any statistics background. I expect that someone can give me a formula as the answer. However, if the answer is too complicated to be described by a formula or intensive computation has to be involved, an approximate answer is totally acceptable.

You can assume n here is quite large, and the probability can vary a lot. In practice, the values of those probabilities come from a query log, which records a series of aggregation queries. The point is that the frequency of each number involved in the queries can be quite skew, i.e., some are rarely queried, while some are queried very frequently. You can assume the probability distribution is normal distribution, zipf distribution or any other reasonable alternatives.

The value distribution is only a contiguous subset of any possible distribution. In other words, if you have a histogram that represents a certain distribution, the all the numbers involved in this problem are the numbers all within a single bucket.

In terms of the value of K, you may assume it is always less than the number of frequently queried elements.

**Answer**

This is probably in the nature of an answer that, while accurate, is

probably not that useful. Horvitz and Thompson (1952) provide results that cover this situation in general. These results are given in terms of the

combinatorial expressions one might expect.

To keep consistent with their notation, and also to correspond better with more widely used notation, let me redefine some quantities. Let N be the number of elements in the population and n be the sample size.

Let u_i, i=1,…,N, represent the N elements of the population, with given values V_i, i=1,…,N and probabilities of selection p_1,…,p_N. For a given sample of size n, let the observed values in the sample be v_1,…, v_n.

What is desired are the mean and variance of the sample total

\sum_{i=1}^n v_i.

As mentioned in the comments, the probability of selecting a particular sample s = \{u_i, u_j, …, u_t\} drawn in that order is

\textrm{Pr}(s) = p_{i_1}p_{j_2}\cdots p_{t_n},

where the initial probability p_{i_1} of drawing u_i is given by p_i, the second probability p_{j_2} of drawing u_j is conditional on having removed u_i from the population, and so forth. So each subsequent unit drawn results in a new probability distribution for the next unit (hence, the choice of different indicial letters, because each represents a different distribution.)

There are

S^{(i)} = n! \binom{N-1}{n-1}

samples of size n that contain u_i out of the entire population. Note that this takes into account the n! permutations of the sample.

Let s_n^{(i)} denote a specific sample of size n which includes u_i.

Then, the probability of selecting element u_i is given by

P(u_i) = \sum \textrm{Pr}(s_n^{(i)}),

where the summation is over the set of size S^{(i)} of all possible samples s_n^{(i)} of size n that contain u_i. (I changed the notation a little from the paper since it seemed confusing to me.)

Similarly, define

S^{(ij)} = n! \binom{N-2}{n-2}

as the number of samples containing both u_i and u_j. Then we can define the probability of a sample containing both as

\textrm{P}(u_i u_j) = \sum \textrm{Pr}(s_n^{(ij)}),

where the summation is over the set of size S^{(ij)} of all possible samples s_n^{(ij)} of size n that contain u_i and u_j.

The expected value is then derived as

E \left( \sum_{i=1}^n v_i \right) =

\sum_{i=1}^N \textrm{P}(u_i) V_i.

Although the variance is not derived explicitly in the paper, it could be obtained from expections of the qth moment

E \left( \sum_{i=1}^n v_i^q \right) =

\sum_{i=1}^N \textrm{P}(u_i) V_i^q

and the cross-products

E \left( \sum_{i \ne j}^n v_iv_j \right) =

\sum_{i \ne j} \textrm{P}(u_i u_j) V_i V_j.

In other words, it looks as though one would would need to go through all possible subsets to do these calculations. Maybe this could be done for smaller values of n, though.

**Attribution***Source : Link , Question Author : SciPioneer , Answer Author : jvbraun*