How to define number of clusters in K-means clustering?

Is there any way to determine the optimal cluster number or should I just try different values and check the error rates to decide on the best value?


The method I use is to use CCC (Cubic Clustering Criteria). I look for CCC to increase to a maximum as I increment the number of clusters by 1, and then observe when the CCC starts to decrease. At that point I take the number of clusters at the (local) maximum. This would be similar to using a scree plot to picking the number of principal components.

SAS Technical Report A-108 Cubic Clustering Criterion (pdf)

n = number of observations
nk = number in cluster k
p = number of variables
q = number of clusters
X = n×p data matrix
M = q×p matrix of cluster means
Z = cluster indicator (zik=1 if obs. i in cluster k, 0 otherwise)

Assume each variable has mean 0:
ZZ=diag(n1,,nq), M=(ZZ)1ZX

SS(total) matrix = T= XX
SS(between clusters) matrix = B = MZZM
SS(within clusters) matrix = W = TB

R^2 = 1 – \frac{\text{trace(W)}}{\text{trace}(T)}
(trace = sum of diagonal elements)

Stack columns of X into one long column.
Regress on Kronecker product of Z with p\times p identity matrix
Compute R^2 for this regression – same R^2

The CCC idea is to compare the R^2 you get for a given set of clusters with the R^2 you would get by clustering a uniformly distributed set of points in p dimensional space.

Source : Link , Question Author : berkay , Answer Author : gung – Reinstate Monica

Leave a Comment