# Why doesn’t k-means give the global minimum?

I read that the k-means algorithm only converges to a local minimum and not to a global minimum. Why is this? I can logically think of how initialization could affect the final clustering and there is a possibility of sub-optimum clustering, but I did not find anything that will mathematically prove that.

Also, why is k-means an iterative process? Can’t we just partially differentiate the objective function w.r.t. to the centroids, equate it to zero to find the centroids that minimizes this function? Why do we have to use gradient descent to reach the minimum step by step?

Say you are estimating a multivariate normal distribution for each cluster with the covariance matrix fixed to the identity matrix for all, but variable mean $\mu_i$ where $i$ is the cluster’s index. Clearly, if the parameters $\{\mu_i\}$ are known, you can assign each point $p$ its maximum likelihood cluster (ie. the $\mu_i$ for which the distance to $p$ in minimal). The EM algorithm for this problem is almost equivalent to k-means.
The other way around, if you know which points belong to which cluster, you can estimate the optimal $\mu_i$. The closed form solution to this (which finds a global optimum) basically says that to find the maximum likelihood models $\{\hat\mu_i\}$ you integrate over all possible assignments of points to clusters. Since even with just thirty points and two clusters, there are about a billion such possible assignments, this is unfeasible to calculate.