I’m struggling to find a method for reducing the number of categories in nominal or ordinal data.
For example, let’s say that I want to build a regression model on a dataset that has a number of nominal and ordinal factors. While I have no problems with this step, I often run into situations where a nominal feature is without observations in the training set, but subsequently exists in the validation dataset. This naturally leads to and error when the model is presented with (so far) unseen cases.
Another situation where I would like to combine categories is simply when there are too many categories with few observations.
So my questions are:
- While I realize it might be best to combine many nominal (and ordinal) categories based on the prior real-world background information they represent, are there systematic methods (
Rpackages preferably) available?
- What guidelines and suggestions would you make regarding, cut-off thresholds and so on?
- What are the most popular solutions in literature?
- Are there other strategies than combining small nominal categories to a new, “OTHERS” category?
Please feel free to chime in if you have other suggestions also.
This is a response to your second question.
I suspect the correct approach to these kinds of decisions will be determined largely by disciplinary norms and the expectations of the intended audience of your work. As a social scientist, I often work with survey (or survey-like) data and I always try to balance substantive and data-driven logics when I collapse ordinal scales or categorical variables. In other words, I’ll do my best to consider what combinations of items “hang together” in terms of their substance as well as the distribution of responses before I collapse the items.
Here’s a recent example of a specific (ordinal) survey question that involved a five-point frequency scale:
How often do you attend the meetings of a club or organization in your community?
- A few times a year
- Once a month
- A few times a month
- Once a week or more
I don’t have the data available to me at the moment, but the results were strongly skewed towards the “never” end of the scale. As a result, my co-author and I chose to pool responses into two groups: “Once a month or more” and “Less than once a month.” The resulting (binary) variable was more evenly distributed and reflected a meaningful distinction in practical terms: since many clubs and organizations don’t meet more than once a month, there are good reasons to believe that people who attend meetings at least that often are “active” members of such groups whereas those who attend less frequently (or never) are “inactive.”
So in my experience, these decisions are at least as much art as science. That said, I also usually try to do this before fitting any models, since I work in a discipline where anything else is viewed (negatively) as data mining and highly un-scientific (fun times!) .
With that in mind, it might help if you could say a little bit more about what sort of audience you have in mind for this work. It would also be in your best interests to review a few prominent methodology textbooks in your field as they can often clarify what passes for “normal” behavior among a given research community.