I was reading about the Jeffreys prior on wikipedia: Jeffreys Prior and saw that after each example, it describes how a variancestabilizing transformation turns the Jeffreys prior into a uniform prior.
As an example, for the Bernoulli case, it states that for a coin that is heads with probability γ∈[0,1], the Bernoulli trial model yields that the Jeffreys prior for the parameter γ is:
p(γ)∝1√γ(1−γ)
It then states that this is a beta distribution with α=β=12. It also states that if γ=sin2(θ), then the Jeffreys prior for θ is uniform in the interval [0,π2].
I recognize the transformation as that of a variancestabilizing transformation. What confuses me is:
Why would a variancestabilizing transformation result in a uniform prior?
Why would we even want a uniform prior? (since it seems it may be more susceptible to being improper)
In general, I’m not quite sure why the squaredsine transformation is given and what role is plays. Would anyone have any ideas?
Answer
The Jeffreys prior is invariant under reparametrization. For that reason, many Bayesians consider it to be a “noninformative prior”. (Hartigan showed that there is a whole space of such priors JαHβ for α+β=1 where J is Jeffreys’ prior and H is Hartigan’s asymptotically locally invariant prior. — Invariant Prior Distributions)
It is an oftenrepeated falsehood that the uniform prior is noninformative, but after an arbitrary transformation of your parameters, and a uniform prior on the new parameters means something completely different. If an arbitrary change of parametrization affects your prior, then your prior is clearly informative.

Using the Jeffreys is, by definition, equivalent to using a flat prior after applying the variancestabilizing transformation.

From a mathematical standpoint, using the Jeffreys prior, and using a flat prior after applying the variancestabilizing transformation are equivalent. From a human standpoint, the latter is probably nicer because the parameter space becomes “homogeneous” in the sense that differences are all the same in every direction no matter where you are in the parameter space.
Consider your Bernoulli example. Isn’t a little bit weird that scoring 99% on a test is the same distance to 90% as 59% is to 50%? After your variancestabilizing transformation the former pair are more separated, as they should be. It matches our intuition about actual distances in the space. (Mathematically, the variancestabilizing transformation is making the curvature of the logloss equal to the identity matrix.)
Attribution
Source : Link , Question Author : user1398057 , Answer Author : Neil G