- When dichotomising variables, what information is lost in the process?
- How does a dichotomisation help in the analyses?
What information is lost: It depends on the variable. Generally, by dichotomizing, you’re asserting that there is a straight line of effect between one variable and another. For example, consider a continuous measure of exposure to a pollutant in a study on cancer. If you dichotomize it to “High” and “Low”, you assert that those are the only two values that matter. There is a risk of cancer in high, and there is one in low. But what if the risk rises steadily for awhile, then flattens out, then rises again before finally spiking at high values? All of that is lost.
What you gain: It’s easier. Dichotomous variables are often much easier to deal with statistically. There are reasons to do it – if a continuous variable falls into two clear groupings anyway, but I tend to avoid dichotomizing unless its a natural form of the variable in the first place. It is often also useful if your field is dichotomizing things anyway to have a dichotomized form of a variable. For example, many consider CD4 cell count of less than 400 to be a critical threshold for HIV. As such, I’d often have a 0/1 variable for Above/Below 400, though I would retain the continuous CD4 count variable as well. This helps cohere your study with others.
I’ll disagree slightly with Peter. While dividing a continuous variable up into categories is often far more sensible than a crude dichotomization, I’m rather opposed to quantile categorization. Such categorizations are very difficult to give meaningful interpretations. I think your first step should be to see if there are biologically or clinically well supported categorization one can use, and only once those options are exhausted should you use quantiles.