I notice in statistics/machine learning methods, a distribution is often approximated by a Gaussian, and then that Gaussian is used for sampling. They start by computing the first two moments of the distribution, and use those to estimate μ and σ2. Then they can sample from that Gaussian.
It seems to me the more moments I calculate, the better I ought to be able to approximate the distribution I wish to sample.
What if I calculate 3 moments…how can I use those to sample from the distribution? And can this be extended to N moments?
Three moments don’t determine a distributional form; if you choose a distribution-famiy with three parameters which relate to the first three population moments, you can do moment matching (“method of moments”) to estimate the three parameters and then generate values from such a distribution. There are many such distributions.
Sometimes even having all the moments isn’t sufficient to determine a distribution. If the moment generating function exists (in a neighborhood of 0) then it uniquely identifies a distribution (you could in principle do an inverse Laplace transform to obtain it).
[If some moments are not finite this would mean the mgf doesn’t exist, but there are also cases where all moments are finite but the mgf still doesn’t exist in a neighborhood of 0.]
Given there’s a choice of distributions, one might be tempted to consider a maximum entropy solution with the constraint on the first three moments, but there’s no distribution on the real line that attains it (since the resulting cubic in the exponent will be unbounded).
How the process would work for a specific choice of distribution
We can simplify the process of obtaining a distribution matching three moments by ignoring the mean and variance and working with a scaled third moment — the moment-skewness (γ1=μ3/μ3/22).
We can do this because having selected a distribution with the relevant skewness, we can then back out the desired mean and variance by scaling and shifting.
Let’s consider an example. Yesterday I created a large data set (which still happens to be in my R session) whose distribution I haven’t tried to calculate the functional form of (it’s a large set of values of the log of the sample variance of a Cauchy at n=10). We have the first three raw moments as
1.519, 3.597 and 11.479 respectively, or correspondingly a mean of 1.518, a standard deviation* of 1.136 and a skewness of 1.429 (so these are sample values from a large sample).
Formally, method of moments would attempt to match the raw moments, but the calculation is simpler if we start with the skewness (turning solving three equations in three unknowns into solving for one parameter at a time, a much simpler task).
* I am going to handwave away the distinction between using an n-denominator on the variance – as would correspond to formal method of moments – and an n-1 denominator and simply use sample calculations.
This skewness (~1.43) indicates we seek a distribution which is right-skew. I could choose, for example, a shifted lognormal distribution (three parameter lognormal, shape σ, scale μ and location-shift γ) with the same moments. Let’s begin by matching the skewness. The population skewness of a two parameter lognormal is:
So let’s start by equating that to the desired sample value to obtain an estimate of σ2, ˜σ2, say.
Note that γ21 is (τ+2)2(τ−1) where τ=eσ2. This then yields a simple cubic equation τ3+3τ2−4=γ21. Using the sample skewness in that equation yields ˜τ≈1.1995 or ˜σ2≈0.1819. (The cubic has only one real root so there’s no issue with choosing between roots; nor is there any risk of choosing the wrong sign on γ1 — we can flip the distribution left-for-right if we need negative skewness)
We can then in turn solve for μ by matching the variance (or standard deviation) and then for the location parameter by matching the mean.
But we could as easily have chosen a shifted-gamma or a shifted-Weibull distribution (or a shifted-F or any number of other choices) and run through essentially the same process. Each of them would be different.
[For the sample I was dealing with, a shifted gamma would probably have been a considerably better choice than a shifted lognormal, since the distribution of the logs of the values was left skew and the distribution of their cube root was very close to symmetric; these are consistent with what you will see with (unshifted) gamma densities, but a left-skewed density of the logs cannot be achieved with any shifted lognormal.]
One could even take the skewness-kurtosis diagram in a Pearson plot and draw a line at the desired skewness and thereby obtain a two-point distribution, sequence of beta distributions, a gamma distribution, a sequence of beta-prime distributions, an inverse-gamma disribution and a sequence of Pearson type IV distributions all with the same skewness.
We can see this illustrated in a skewness-kurtosis plot (Pearson plot) below (note that β1=γ21 and β2 is the kurtosis), with the regions for the various Pearson-distributions marked in.
The green horizontal line represents γ21=2.042, and we see it pass through each of the mentioned distribution-families, each point corresponding to a different population kurtosis. (The dashed curve represents the lognormal, which is not a Pearson-family distribution; its intersection with the green line marks the particular lognormal-shape we identified. Note that the dashed curve is purely a function of σ.)
Moments don’t pin distributions down very well, so even if you specify many moments, there will still be a lot of different distributions (particularly in relation to their extreme-tail behavior) that will match them.
You can of course choose some distributional family with at least four parameters and attempt to match more than three moments; for example the Pearson distributions above allow us to match the first four moments, and there are other choices of distributions that would allow similar degree of flexibility.
One can adopt other strategies to choose distributions that can match distributional features – mixture distributions, modelling the log-density using splines, and so forth.
Frequently, however, if one goes back to the initial purpose for which one was trying to find a distribution, it often turns out there’s something better that can be done than the sort of strategy outlined here.