In one of the exercises for my course, we’re using a Kaggle medical dataset.
The exercise says:
we want to model the distribution of individual charges and we also really want to be able to capture our uncertainty about that distribution so we can better capture the range of values we might see. Loading the data and performing an initial view:
We may suspect from the above that there is some sort of exponential-like distribution at play here. …The insurance claim charges may possibly be multimodal. The gamma distribution may be applicable and we could test this for the distribution of charges that weren’t insurance claims first.
I looked up “Gamma distribution” and found “a continuous, positive-only, unimodal distribution that encodes the time required for «alpha» events to occur in a Poisson process with mean arrival time of «beta»”
There’s no time involved here, just unrelated charges, either insured or not.
Why would they choose a gamma distribution?
When you’re considering simple parametric models for the conditional distribution of data (i.e. the distribution of each group, or the expected distribution for each combination of predictor variables), and you are dealing with a positive continuous distribution, the two common choices are Gamma and log-Normal. Besides satisfying the specification of the domain of the distribution (real numbers greater than zero), these distributions are computationally convenient and often make mechanistic sense.
- The log-Normal distribution is easily derived by exponentiating a Normal distribution (conversely, log-transforming log-Normal deviates gives Normal deviates). From a mechanistic point of view, the log-Normal arises via the Central Limit Theorem when each observation reflects the product of a large number of iid random variables. Once you’ve log-transformed the data, you have access to a huge variety of computational and analytical tools (e.g., anything assuming Normality or using least-squares methods).
- As your question points out, one way that a Gamma distribution arises is as the distribution of waiting times until n independent events with a constant waiting time λ occur. I can’t easily find a reference for a mechanistic model of Gamma distributions of insurance claims, but it also makes sense to use a Gamma distribution from a phenomenological (i.e., data description/computational convenience) point of view. The Gamma distribution is part of the exponential family (which includes the Normal but not the log-Normal), which means that all of the machinery of generalized linear models is available; it also has a particularly convenient form for analysis.
There are other reasons one might pick one or the other – for example, the “heaviness” of the tail of the distribution, which might be important in predicting the frequency of extreme events. There are plenty of other positive, continuous distributions (e.g see this list), but they tend to be used in more specialized applications.
Very few of these distributions will capture the multi-modality you see in the marginal distributions above, but multi-modality may be explained by the data being grouped into categories described by observed categorical predictors. If there are no observable predictors that explain the multimodality, one might choose to fit a finite mixture model based on a mixture of a (small, discrete) number of positive continuous distributions.