I need to make sure that my XML sitemap has less than $1\%$ rubbish (broken links).

The list of URL is in the hundred of thousands, and even if it could be feasible to test them all 1 by 1 I’d rather not, for many reasons:`1 - Saved bandwidth 2 - Faster traffic for real clients 3 - Less noise in visitor statistics (because my test would count as a visit) 5 - I could go on...`

So I think taking a random subset would be sufficient, problem is I don’t know probabilities.

Is there a simple function I can use?

If it helps, we can suppose to have an

a prioriinformation on the probability of a link to be broken across runs. Let’s say that across runs there is a $0.75\%$ for any given link to be broken.

**Answer**

So it depends on the distribution of your prior belief about the breakage rate, but: about 3600.

```
import scipy as sp
p = 0.0075
threshold = .01
confidence = .95
f = lambda n: sp.stats.beta(a=n*p, b=n*(1-p)).cdf(threshold) - confidence
print(sp.optimize.fsolve(f, 1000)[0])
>> 3627.45119614
```

The idea here is to model link breakages as a Bernoulli trial, and model your beliefs about the breakage rate as the beta distribution. The beta distribution is conjugate to the Bernoulli distribution, and the way to update a beta distribution when you run a trial is pretty simple:

- if it’s a failure, you add one to the first parameter, $\alpha$
- if it’s a success, you add one to the second parameter, $\beta$

So if we start with a $\text{Beta}(0, 0)$ distribution and see failures about .75% of the time, how many trials will it take before 95% of the distribution’s mass is below 0.01? About 3600.

**Attribution***Source : Link , Question Author : gurghet , Answer Author : Andy Jones*