k-means vs k-median?

I know there is k-means clustering algorithm and k-median. One that uses the mean as the center of the cluster and the other uses the median. My question is: when/where to use which?


k-means minimizes within-cluster variance, which equals squared Euclidean distances.

In general, the arithmetic mean does this. It does not optimize distances, but squared deviations from the mean.

k-medians minimizes absolute deviations, which equals Manhattan distance.

In general, the per-axis median should do this. It is a good estimator for the mean, if you want to minimize the sum of absolute deviations (that is sum_i abs(x_i-y_i)), instead of the squared ones.

It’s not a question about accuracy. It’s a question of correctness. 😉

So here’s your decision tree:

  • If your distance is squared Euclidean distance, use k-means
  • If your distance is Taxicab metric, use k-medians
  • If you have any other distance, use k-medoids

Some exceptions: as far as I can tell, maximizing cosine similarity is related to minimizing squared Euclidean distance on L2-normalized data. So if your data is L2 normalized; and you l2-normalize your means each iteration, then you can use k-means again.

Source : Link , Question Author : Jack Twain , Answer Author : Has QUIT–Anony-Mousse

Leave a Comment