Let’s say I have a large set of S values which sometimes repeat. I wish to estimate the total number of unique values in the large set.
If I take a random sample of T values, and determine that it contains Tu unique values, can I use this to estimate the number of unique values in the large set?
Here is a whole paper about the problem, with a summary of various approaches. It’s called Distinct Value Estimation in the literature.
If I had to do this myself, without having read fancy papers, I’d do this. In building language models, one often has to estimate the probability of observing a previously unknown word, given a bunch of text. A pretty good approach at solving this problem for language models in particular is to use the number of words that occurred exactly once, divided by the total number of tokens. It’s called the Good Turing Estimate.
Let u1 be the number of values that occurred exactly once in a sample of m items.
P[new item next] ~= u1 / m.
Let u be the number of unique items in your sample of size m.
If you mistakenly assume that the ‘new item next’ rate didn’t decrease as you got more data, then using Good Turing, you’ll have
total uniq set of size s ~= u + u1 / m * (s - m)
This has some nasty behavior as u1 becomes really small, but that might not be a problem for you in practice.