I know there is k-means clustering algorithm and k-median. One that uses the mean as the center of the cluster and the other uses the median. My question is: when/where to use which?
k-means minimizes within-cluster variance, which equals squared Euclidean distances.
In general, the arithmetic mean does this. It does not optimize distances, but squared deviations from the mean.
k-medians minimizes absolute deviations, which equals Manhattan distance.
In general, the per-axis median should do this. It is a good estimator for the mean, if you want to minimize the sum of absolute deviations (that is sum_i abs(x_i-y_i)), instead of the squared ones.
It’s not a question about accuracy. It’s a question of correctness. 😉
So here’s your decision tree:
- If your distance is squared Euclidean distance, use k-means
- If your distance is Taxicab metric, use k-medians
- If you have any other distance, use k-medoids
Some exceptions: as far as I can tell, maximizing cosine similarity is related to minimizing squared Euclidean distance on L2-normalized data. So if your data is L2 normalized; and you l2-normalize your means each iteration, then you can use k-means again.