I am running kmeans to identify clusters of customers. I have approximately 100 variables to identify clusters. Each of these variables represent the % of spend by a customer on a category. So, if I have 100 categories, I have these 100 variables such that sum of these variables is 100% for each customer. Now, these variables are strongly correlated with each other. Do I need to drop some of these to remove collinearity before I run kmeans?

Here is the sample data. In reality I have 100 variables and 10 million customers.

`Customer CatA CatB CatC 1 10% 70% 20% 2 15% 60% 25%`

**Answer**

Don’t drop any variables, but do consider using PCA. Here’s why.

Firstly, as pointed out by Anony-mousse, k-means is not badly affected by collinearity/correlations. You don’t need to throw away information because of that.

Secondly, if you drop your variables in the wrong way, you’ll artificially bring some samples closer together. An example:

```
Customer CatA CatB CatC
1 1 0 0
2 0 1 0
3 0 0 1
```

(I’ve removed the % notation and just put values between 0 and 1, constrained so they all sum to 1.)

The euclidean distance between each of those customers in their natural 3d space is √(1−0)2+(0−1)2+(0−0)2=√2

Now let’s say you drop CatC.

```
Customer CatA CatB
1 1 0
2 0 1
3 0 0
```

Now the distance between customers 1 and 2 is still √2, but between customers 1 and 3, and 2 and 3, it’s only √(1−0)2+(0−0)2=1. You’ve artificially made customer 3 more similar to 1 and 2, in a way the raw data doesn’t support.

Thirdly, collinerarity/correlations are not the problem. Your dimensionality is. 100 variables is large enough that even with 10 million datapoints, I worry that k-means may find spurious patterns in the data and fit to that. Instead, think about using PCA to compress it down to a more manageable number of dimensions – say 10 or 12 to start with (maybe much higher, maybe much lower – you’ll have to look at the variance along each component, and play around a bit, to find the correct number). You’ll artificially bring some samples closer together doing this, yes, but you’ll do so in a way that should preserve most of the variance in the data, and which will preferentially remove correlations.

~~~~~

EDIT:

Re, comments below about PCA. Yes, it absolutely does have pathologies. But it’s pretty quick and easy to try, so still seems not a bad bet to me if you want to reduce the dimensionality of the problem.

On that note though, I tried quickly throwing a few sets of 100 dimensional synthetic data into a k-means algorithm to see what they came up with. While the cluster centre position estimates weren’t that accurate, the cluster *membership* (i.e. whether two samples were assigned to the same cluster or not, which seems to be what the OP is interested in) was much better than I thought it would be. So my gut feeling earlier was quite possibly wrong – k-means migth work just fine on the raw data.

**Attribution***Source : Link , Question Author : Ashish Jha , Answer Author : Pat*