I have gotten the following question as a test question for my exam and I simply cannot understand the answer.
A scatter plot of the data projected onto the first two principal components
is shown below. We wish to examine if there exists some group structure in the data set. To do this, we have run the k-means algorithm with k = 2 using the Euclidean distance measure. The result of the k-means algorithm can vary between runs depending on the random initial conditions. We ran the algorithm several times and got some different clustering results.
Only three of the four clusterings shown can be obtained by running the k-means algorithm on the data. Which one can not be obtained by k-means?
(there’s nothing special about the data)
The correct answer is D.
Can any of you explain why?
To put some more meat on Peter Flom’s answer, k-means clustering looks for k groups in data. The method assumes that Each cluster has a centroid at a certain
(x,y). The k-means algorithm minimizes the distance of each point to the centroid (this could be euclidian or manhattan distance depending on your data).
To identify the clusters, an initial guess is made of which data points belong in which cluster, and the centroid is calculated for each cluster. The distance metric is then calculated, and then some points are swapped between clusters to see if the fit improves. There are lots of variations on the details, but fundamentally k-means is a brute force solution that is dependent on the initial conditions, as there are local minima to the clustering solution.
So, in your case it looks like case A had initial conditions that were widely separated in
x and so the clusters resolve because the distances from the centroids to the data are small, and it’s a stable solution. Conversely, you can’t obtain D because that single red point is closer to the centroid of the blue points than many others, so the red point should have become part of the blue set.
Therefore the only way you could get D is if you interrupt the clustering process before it’s finished (or the code that made the clusters is broken).