If I want to have 95% chance that less than 1% objects are faulty, how many samples do I need?

I need to make sure that my XML sitemap has less than $1\%$ rubbish (broken links).
The list of URL is in the hundred of thousands, and even if it could be feasible to test them all 1 by 1 I’d rather not, for many reasons:

1 - Saved bandwidth
2 - Faster traffic for real clients
3 - Less noise in visitor statistics (because my test would count as a visit)
5 - I could go on...

So I think taking a random subset would be sufficient, problem is I don’t know probabilities.

Is there a simple function I can use?

If it helps, we can suppose to have an a priori information on the probability of a link to be broken across runs. Let’s say that across runs there is a $0.75\%$ for any given link to be broken.

Answer

So it depends on the distribution of your prior belief about the breakage rate, but: about 3600.

import scipy as sp

p = 0.0075
threshold = .01
confidence = .95

f = lambda n: sp.stats.beta(a=n*p, b=n*(1-p)).cdf(threshold) - confidence
print(sp.optimize.fsolve(f, 1000)[0])

>> 3627.45119614

The idea here is to model link breakages as a Bernoulli trial, and model your beliefs about the breakage rate as the beta distribution. The beta distribution is conjugate to the Bernoulli distribution, and the way to update a beta distribution when you run a trial is pretty simple:

  • if it’s a failure, you add one to the first parameter, $\alpha$
  • if it’s a success, you add one to the second parameter, $\beta$

So if we start with a $\text{Beta}(0, 0)$ distribution and see failures about .75% of the time, how many trials will it take before 95% of the distribution’s mass is below 0.01? About 3600.

Attribution
Source : Link , Question Author : gurghet , Answer Author : Andy Jones

Leave a Comment