Cosine Distance as Similarity Measure in KMeans [duplicate]

I am currently solving a problem where I have to use Cosine distance as the similarity measure for k-means clustering. However, the standard k-means clustering package (from Sklearn package) uses Euclidean distance as standard, and does not allow you to change this.

Therefore it is my understanding that by normalising my original dataset through the code below. I can then run kmeans package (using Euclidean distance); will it be the same as if I had changed the distance metric to Cosine distance?

from sklearn import preprocessing  # to normalise existing X
X_Norm = preprocessing.normalize(X)

km2 = cluster.KMeans(n_clusters=5,init='random').fit(X_Norm)

Please let me know if my mathematical understanding of this is incorrect.

Answer

It should be the same, for normalized vectors cosine similarity and euclidean similarity are connected linearly. Here’s the explanation:

Cosine distance is actually cosine similarity: cos(x,y)=xiyix2iy2i.

Now, let’s see what we can do with euclidean distance for normalized vectors (x2i=y2i=1):

||xy||2=(xiyi)2=(x2i+y2i2xiyi)=x2i+y2i2xiyi=1+12cos(x,y)=2(1cos(x,y))

Note that for normalized vectors cos(x,y)=xiyix2iy2i=xiyi

So you can see that there is a direct linear connection between these distances for normalized vectors.

Attribution
Source : Link , Question Author : MSalty , Answer Author : Jan Kukacka

Leave a Comment