I am trying to make group together different datasets using unsupervised algorithms (clustering). The problem is that I have many features (~500) and a small amount of cases (200-300).
So far I used to do only classification problems for which I always had labeled data as training sets. There I used some criterion (i.e. random.forest.importance or information.gain) for preselection of the features and then I used a sequential forward selection for different learners to find the relevant features.
Now I see that in case of unsupervised learning I have neither any criterion for preselection nor can I use the sequential forward selection (at least not in the mlr package).
I was wondering if I could do a principal component analysis before to find a small number of features to fead to my clustering algorithm. Or do you have any other idea?
Ok, so after some research online I can update my question a bit:
First of all I have read some articles that discourage the use of PCA before clustering algorithms, due to two reasons:
The PCs are functions of all features so it is hard to relate the result to the inital data set and thus it is harder to interpret
Moreover, if you have the problem that in truth only a very small fraction of your features are helpful to do the clustering, it is not said that these features are also describing the largest variance among the samples (which is what the PCs do)
So PCA is off the table…
Now I am back to my initial idea to do a sequential forward selection for clustering.
What performance measure would you recommend? (I thought about the Dunn-Index)
Which clustering algorithm would lead to clusters of more or less the same size? (for hierarchical clustering I usually get one cluster with a single outlier and another with all the rest -> so I would need something that somehow protects against outliers)
Hope you guys can help me…
I have a few thoughts to share about dimension reduction in unsupervised learning problems. In answering, I’ve assumed that your interest is in “high-touch,” human involvement wrt cluster interpretation as opposed to an automated, turnkey, black box and “low-touch” machine learning approach in which interpretation is deliberately de-emphasized. If it were the latter, why would you even be asking the question? Also, note that I’ve had a ton of experience running cluster solutions across a wide range of business environments over the years including strategic B2C marketing, B2B tech arenas and education policy (clustering students and schools).
First though, I do have a question about your comment concerning “grouping different datasets.” I didn’t know what you meant by that or how it might impact the approach and was hoping you could elaborate.
I would like to challenge your assumption in #1 above that solutions based on PCAs are “hard to interpret.” The reasons for even running a PCA as a preliminary step in clustering have mostly to do with the hygiene of the resulting solution insofar as many clustering algorithms are sensitive to feature redundancy. PCA collapses this redundancy into a manageable handful of components, thereby minimizing the challenges and difficulties that you note regarding feature selection. While it is true that the components output from a PCA blur the granularity and specificity of the individual features, this is a problem iff you solely rely on those components in analyzing the results. In other words, you are not in any way locked into using only the components for cluster interpretation. Not only that, you don’t necessarily even need to care what the factor dimensions “mean.” They are only an intermediate and (ultimately) disposable means to an end facilitating an actionable solution. But in making this point I differ from many practitioners since teams can, will and do spend weeks carefully building a “meaningful” factor solution. To me, this is an inefficient waste of client time and money.
At this point there will be a boatload of technical considerations to address. For one, if your PCA algorithm is not scale invariant (e.g., is OLS vs ML), then any resulting PCA solution will be distorted, loading more heavily on the high variance features. In these cases your features need to be preprocessed or transformed in some way to flatten this variance out. There are a huge number of possibilities here including mean standardizing, range or IQR standardizing, ipsative scaling, and so on. Leverage that transformation delivering the best, most interpretable solution.
Once a cluster solution is generated, interpretation is best motivated (in my experience) by ignoring the components and folding back in the original features along with any additional descriptive information not directly used in the solution. At this point a few heuristics are the best guides to qualitative insight. This can be as easy as generating a spreadsheet that profiles your clusters based on averages or medians for each feature (the rows of the sheet), for each cluster (the columns) as well as an additional column representing the grand mean for your total sample. Then, by indexing the cluster averages for each feature against the grand mean (and multiplying by 100), a heuristic is created that is like an IQ score insofar as around “100” is “normal” IQ or average behavior, indexes of 120+ are suggestive of high likelihoods for a feature to be “true” about the behavior of a cluster and indexes of 80 or less are indicative of features that are “not true” of a cluster. These indexes of 120+ and 80 or less are like proxy t-tests for significance of a given feature in driving the solution. Of course, you can run between group tests of significance and, depending on sample sizes, will get answers that vary around these quick and dirty rules of thumb.
Ok…after all of that, suppose you’re still opposed to using PCA as direct input into a clustering algorithm, the problem remains regarding how to select a reduced set of features. PCA can still be useful here since PCAs are like running a regression without a dependent variable. The top loading features on each component can become the inputs into the cluster algorithm.
To your point about the large number of features and relatively small sample size of your data, the typical rule of thumb in many “full information” multivariate analyses is a minimum of about 10 observations per feature. There are some specialized methods that can be leveraged to work around this challenge. For instance, partial least squares (PLS) was first developed by Herman Wold in his 1990 book Theoretical Empiricism for use in fields such as chemometrics which face this precise issue. It is factor-analytic in nature but is much less stringent in requiring a large n to generate the dimensions. Other solutions include the random forest-like, “divide and conquer,” machine learning approaches used with massive amounts of information. These methods are reviewed in this pdf http://www.wisdom.weizmann.ac.il/~harel/papers/Divide%20and%20Conquer.pdf
But suppose you’ve decided that you still want nothing to do with factor analysis and are dead set on running some kind of supervised, “sequential” selection process. In my view, the most important issue is less about finding a post-hoc performance metric (Dunn Index) and more about identifying a suitable proxy — a dependent variable — to even make this approach possible. This decision is entirely a function of your judgement and SME status wrt your data. There are no “best practices,” much less easy answers for this and given how you’ve described your data, no small challenge.
Once that decision is made, then there are literally hundreds of possible variable selection solutions to choose from. Variable selection is a topic area on which every statistician and their brother has published a paper. Your preferred approach seems to be “sequential forward selection” is fine.
It’s worth noting that supervised learning models exist which fold in a cluster solution as part of the algorithm. Examples of this include the large and highly flexible approaches known as latent class models. The essence of LC models is that they are two stage: in stage one a DV is defined and a regression model is built. In the second stage, any heterogeneity in the residual output from the model — a single latent vector — is partitioned into latent “classes.” There’s an overview of LC modeling in this CV discussion here … Latent class multinomial logit model doubt
Hope this helps.