# How to define number of clusters in K-means clustering?

Is there any way to determine the optimal cluster number or should I just try different values and check the error rates to decide on the best value?

The method I use is to use CCC (Cubic Clustering Criteria). I look for CCC to increase to a maximum as I increment the number of clusters by 1, and then observe when the CCC starts to decrease. At that point I take the number of clusters at the (local) maximum. This would be similar to using a scree plot to picking the number of principal components.

SAS Technical Report A-108 Cubic Clustering Criterion (pdf)

$n$ = number of observations
$n_k$ = number in cluster $k$
$p$ = number of variables
$q$ = number of clusters
$X$ = $n\times p$ data matrix
$M$ = $q\times p$ matrix of cluster means
$Z$ = cluster indicator ($z_{ik}=1$ if obs. $i$ in cluster $k$, 0 otherwise)

Assume each variable has mean 0:
$Z’Z = \text{diag}(n_1, \cdots, n_q)$, $M = (Z’Z)-1Z’X$

$SS$(total) matrix = $T$= $X’X$
$SS$(between clusters) matrix = $B$ = $M’ Z’Z M$
$SS$(within clusters) matrix = $W$ = $T-B$

$R^2 = 1 – \frac{\text{trace(W)}}{\text{trace}(T)}$
(trace = sum of diagonal elements)

Stack columns of $X$ into one long column.
Regress on Kronecker product of $Z$ with $p\times p$ identity matrix
Compute $R^2$ for this regression – same $R^2$

The CCC idea is to compare the $R^2$ you get for a given set of clusters with the $R^2$ you would get by clustering a uniformly distributed set of points in $p$ dimensional space.