# Do I need to drop variables that are correlated/collinear before running kmeans?

I am running kmeans to identify clusters of customers. I have approximately 100 variables to identify clusters. Each of these variables represent the % of spend by a customer on a category. So, if I have 100 categories, I have these 100 variables such that sum of these variables is 100% for each customer. Now, these variables are strongly correlated with each other. Do I need to drop some of these to remove collinearity before I run kmeans?

Here is the sample data. In reality I have 100 variables and 10 million customers.

Customer CatA CatB CatC
1         10%  70%  20%
2         15%  60%  25%


Don’t drop any variables, but do consider using PCA. Here’s why.

Firstly, as pointed out by Anony-mousse, k-means is not badly affected by collinearity/correlations. You don’t need to throw away information because of that.

Secondly, if you drop your variables in the wrong way, you’ll artificially bring some samples closer together. An example:

Customer CatA CatB CatC
1        1    0    0
2        0    1    0
3        0    0    1


(I’ve removed the % notation and just put values between 0 and 1, constrained so they all sum to 1.)

The euclidean distance between each of those customers in their natural 3d space is $\sqrt{(1-0)^2+(0-1)^2+(0-0)^2} = \sqrt{2}$

Now let’s say you drop CatC.

Customer CatA CatB
1        1    0
2        0    1
3        0    0


Now the distance between customers 1 and 2 is still $\sqrt{2}$, but between customers 1 and 3, and 2 and 3, it’s only $\sqrt{(1-0)^2+(0-0)^2}=1$. You’ve artificially made customer 3 more similar to 1 and 2, in a way the raw data doesn’t support.

Thirdly, collinerarity/correlations are not the problem. Your dimensionality is. 100 variables is large enough that even with 10 million datapoints, I worry that k-means may find spurious patterns in the data and fit to that. Instead, think about using PCA to compress it down to a more manageable number of dimensions – say 10 or 12 to start with (maybe much higher, maybe much lower – you’ll have to look at the variance along each component, and play around a bit, to find the correct number). You’ll artificially bring some samples closer together doing this, yes, but you’ll do so in a way that should preserve most of the variance in the data, and which will preferentially remove correlations.

~~~~~

EDIT:

Re, comments below about PCA. Yes, it absolutely does have pathologies. But it’s pretty quick and easy to try, so still seems not a bad bet to me if you want to reduce the dimensionality of the problem.

On that note though, I tried quickly throwing a few sets of 100 dimensional synthetic data into a k-means algorithm to see what they came up with. While the cluster centre position estimates weren’t that accurate, the cluster membership (i.e. whether two samples were assigned to the same cluster or not, which seems to be what the OP is interested in) was much better than I thought it would be. So my gut feeling earlier was quite possibly wrong – k-means migth work just fine on the raw data.