An odds is the ratio of the probability of an event to its complement:
An odds ratio (OR) is the ratio of the odds of an event in one group (say, A) versus the odds of an event in another group (say, B):
OR(X)A vs B=P(X|A)1−P(X|A)P(X|B)1−P(X|B)
A probability ratio1 (PR, aka prevalence ratio) is the ratio of the probability of an event in one group (A) versus the probability of an event in another group (B):
PR(X)A vs B=P(X|A)P(X|B)
An incidence proportion can be thought of as pretty similar to a probability (although technically is a rate of probability occurring over time), and we contrast incidence proportions (and incidence densities, for that matter) using relative risks (aka risk ratios, RR), along with other measures like risk differences:
RRA vs B=incidence proportion(X|A)incidence proportion(X|B)
Why are relative probability contrasts so often represented using relative odds instead of probability ratios, when risk contrasts are represented using relative risks instead of odds ratios (calculated using incidence proportions instead of probabilities)?
My question is foremost about why prefer ORs to PRs, rather than why not use incidence proportions to calculate a quantity like an OR. Edit: I am aware that risks are sometimes contrasted using a risk odds ratio.
1 As near as I can tell… I do not actually encounter this term in my discipline other than very rarely.
I think the reason that OR is far more common that PR comes down to the standard ways in which different types of quantity are typically transformed.
When working with normal quantities, like temperature, height, weight, then the standard assumptions is that they are approximately Normal. When you take contrasts between these sorts of quantities, then a good thing to do is take the difference. Equally if you fit a regression model to it you don’t expect a systematic change in the variance.
When you are working with quantities that are “rate like”, that is they are bounded at zero and typically come from calculating things like “number per day”, then taking raw differences is awkward. Since the variance of any sample is proportional to the rate, the residuals of any fit to count or rate data won’t generally have constant variance. However, if we work with the log of the mean, then the variances will be “stabilized” – that is they add rather than multiply. Thus for rates we typically handle them as the log. Then when you form contrasts you are taking differences of logs, and that is the same as a ratio.
When you are working with probability like quantities, or fractions of a cake, then you are now bounded above and below. You now also have an arbitrary choice what you code as 1 and 0 (or more in multi-class models). Differences between probabilities are invariant to switching 1 to 0, but have the problem of rates that the variance changes with the mean again. Logging them wouldn’t give you invariance for 1s and 0s, so instead we tend to logit them (log-odds). Working with log-odds you are now back on the full real line, the variance is the same all along the line, and differences of log-odds behave a bit like normal quantities.
- Variance does not depend on μ
- Canonical link for GLM is x
- Transformation not helpful
- Variance is proportional to the rate λ
- Canonical link for GLM is ln(x)
- Logging should result in residuals of constant variance
- Variance is proportional to p(1−p)
- Canonical link for GLM is logit ln(p1−p)
- Taking logit (log-odds) of data should result in residuals of constant variance
So I think that the reason you see lots of RR, but very little PR is that PR is constructed from probability/Binomial type quantities, while RR is constructed from rate type quantities. In particular note that incidence can exceed 100% if people can catch the disease multiple times per year, but probability can never exceed 100%.
Is odds the only way?
No, the general messages above are just useful rules of thumb, and these “canonical” forms are just convenient mathematically – hence why you tend to see it most. The probit function is used instead for probit regression, so in principle differences of probit would be just as valid as OR. Similarly, despite best efforts to word it carefully, the text above still sort of suggests that logging and logiting your raw data, and then fitting a model to it is a good idea – it’s not a terrible idea, but there are better things that you can do (GLM etc.).