Could anyone explain in plain English what the difference is between Scott’s and Silverman’s rules of thumb for bandwidth selection?

Specifically,whenis one better than the other? Is it related to the underlying distribution? Number of samples?P.S. I am referring to the code in SciPy.

**Answer**

The comments in the code seem to end up defining the two essentially identically (aside a relatively small difference in the constant).

Both are of the form cAn^{-1/5}, both with what looks like the same A (estimate of scale), and c‘s very close to 1 (close relative to the typical uncertainty in the estimate of the optimum bandwidth).

[The binwdith estimate that more usually seems to be associated with Scott is the one from his 1979 paper[1] (3.49 s n^{-1/3}) — e.g. see Wikipedia – scroll down a little – or R’s `nclass.scott`

.]

The 1.059 in what the code calls the “Scott estimate” is in the (prior) book by Silverman (see p45 of the Silverman reference at your link — Scott’s derivation of it is on p130-131 of the book they refer to). It comes from a normal-theory estimate.

The optimum bandwidth (in integrated mean square error terms) is a function of of the integrated squared second derivative, and 1.059\sigma comes out of that calculation for a normal, but in many cases that’s a good deal wider than is optimum for other distributions.

The A term is an estimate of \sigma (sort of a robustified estimate, in a way that reduces the tendency for it to be too large if there are outliers/skewness/heavy tails). See eq 3.30 on p47, justified on p46-7.

For similar reasons to those I suggested before, Silverman goes on to suggest reducing 1.059 (in fact he actually uses 1.06 throughout, not 1.059 — as does Scott in his book). He chooses a reduced value that loses no more than 10% efficiency on IMSE at the normal, which is where the 0.9 comes from.

So both those binwidths are based on the IMSE-optimal binwidth at the normal, one right at the optimum, the other (about 15% smaller, to get within 90% of the efficiency of the optimum at the normal). [I’d call *both* of them “Silverman” estimates. I have no idea why they name the 1.059 one for Scott.]

In my opinion, both are far too large. I don’t use histograms to get IMSE-optimal estimates of the density. If that (obtaining estimates of the density that are optimal in the IMSE sense) was what I wanted to do, I wouldn’t want to use histograms for that purpose.

Histograms should be erring on the noisier side (let the eye do the necessary smoothing). I nearly always double (or more) the default number of bins these kinds of rules give. So I wouldn’t use 1.06 or 0.9, I’d tend to use something around 0.5, maybe less at really large sample sizes.

There’s really very little to choose between them, since they both give far too few bins to be much use at finding what’s going on in the data (on which, at least at small sample sizes, see here.

[1]: Scott, D.W. (1979), “On optimal and data-based histograms,” *Biometrika*, **66**, 605-610.

**Attribution***Source : Link , Question Author : xrfang , Answer Author : Glen_b*