I was reading about the Jeffreys prior on wikipedia: Jeffreys Prior and saw that after each example, it describes how a variance-stabilizing transformation turns the Jeffreys prior into a uniform prior.
As an example, for the Bernoulli case, it states that for a coin that is heads with probability γ∈[0,1], the Bernoulli trial model yields that the Jeffreys prior for the parameter γ is:
It then states that this is a beta distribution with α=β=12. It also states that if γ=sin2(θ), then the Jeffreys prior for θ is uniform in the interval [0,π2].
I recognize the transformation as that of a variance-stabilizing transformation. What confuses me is:
Why would a variance-stabilizing transformation result in a uniform prior?
Why would we even want a uniform prior? (since it seems it may be more susceptible to being improper)
In general, I’m not quite sure why the squared-sine transformation is given and what role is plays. Would anyone have any ideas?
The Jeffreys prior is invariant under reparametrization. For that reason, many Bayesians consider it to be a “non-informative prior”. (Hartigan showed that there is a whole space of such priors JαHβ for α+β=1 where J is Jeffreys’ prior and H is Hartigan’s asymptotically locally invariant prior. — Invariant Prior Distributions)
It is an often-repeated falsehood that the uniform prior is non-informative, but after an arbitrary transformation of your parameters, and a uniform prior on the new parameters means something completely different. If an arbitrary change of parametrization affects your prior, then your prior is clearly informative.
Using the Jeffreys is, by definition, equivalent to using a flat prior after applying the variance-stabilizing transformation.
From a mathematical standpoint, using the Jeffreys prior, and using a flat prior after applying the variance-stabilizing transformation are equivalent. From a human standpoint, the latter is probably nicer because the parameter space becomes “homogeneous” in the sense that differences are all the same in every direction no matter where you are in the parameter space.
Consider your Bernoulli example. Isn’t a little bit weird that scoring 99% on a test is the same distance to 90% as 59% is to 50%? After your variance-stabilizing transformation the former pair are more separated, as they should be. It matches our intuition about actual distances in the space. (Mathematically, the variance-stabilizing transformation is making the curvature of the log-loss equal to the identity matrix.)