I am currently trying to compute the BIC for my toy data set (ofc iris (: ). I want to reproduce the results as shown here (Fig. 5). That paper is also my source for the BIC formulas.
I have 2 problems with this:
- Notation:
- ni = number of elements in cluster i
- Ci = center coordinates of cluster i
- xj = data points assigned to cluster i
- m = number of clusters
1) The variance as defined in Eq. (2):
∑i=1ni−mni∑j=1‖
As far as I can see it is problematic and not covered that the variance can be negative when there are more clusters m than elements in the cluster. Is this correct?
2) I just cannot make my code work to compute the correct BIC. Hopefully there is no error, but it would be highly appreciated if someone could check. The whole equation can be found in Eq. (5) in the paper. I am using scikit learn for everything right now (to justify the keyword 😛 ).
from sklearn import cluster from scipy.spatial import distance import sklearn.datasets from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt import numpy as np def compute_bic(kmeans,X): """ Computes the BIC metric for a given clusters Parameters: ----------------------------------------- kmeans: List of clustering object from scikit learn X : multidimension np array of data points Returns: ----------------------------------------- BIC value """ # assign centers and labels centers = [kmeans.cluster_centers_] labels = kmeans.labels_ #number of clusters m = kmeans.n_clusters # size of the clusters n = np.bincount(labels) #size of data set N, d = X.shape #compute variance for all clusters beforehand cl_var = [(1.0 / (n[i] - m)) * sum(distance.cdist(X[np.where(labels == i)], [centers[0][i]], 'euclidean')**2) for i in xrange(m)] const_term = 0.5 * m * np.log10(N) BIC = np.sum([n[i] * np.log10(n[i]) - n[i] * np.log10(N) - ((n[i] * d) / 2) * np.log10(2*np.pi) - (n[i] / 2) * np.log10(cl_var[i]) - ((n[i] - m) / 2) for i in xrange(m)]) - const_term return(BIC) # IRIS DATA iris = sklearn.datasets.load_iris() X = iris.data[:, :4] # extract only the features #Xs = StandardScaler().fit_transform(X) Y = iris.target ks = range(1,10) # run 9 times kmeans and save each result in the KMeans object KMeans = [cluster.KMeans(n_clusters = i, init="k-means++").fit(X) for i in ks] # now run for each cluster the BIC computation BIC = [compute_bic(kmeansi,X) for kmeansi in KMeans] plt.plot(ks,BIC,'r-o') plt.title("iris data (cluster vs BIC)") plt.xlabel("# clusters") plt.ylabel("# BIC")
My results for the BIC look like this:
Which is not even close to what I have expected and also makes no sense… I looked at the equations now for some time and am not getting any further locating my mistake ):
Answer
It seems you have a few errors in your formulas, as determined by comparing to:
- https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf (there are some
errors in the paper) - https://github.com/bobhancock/goxmeans/blob/master/km.go
- https://github.com/mynameisfiber/pyxmeans/blob/master/pyxmeans/xmeans.py
- https://github.com/bobhancock/goxmeans/blob/master/doc/BIC_notes.pdf
1.
np.sum([n[i] * np.log(n[i]) -
n[i] * np.log(N) -
((n[i] * d) / 2) * np.log(2*np.pi) -
(n[i] / 2) * np.log(cl_var[i]) -
((n[i] - m) / 2) for i in range(m)]) - const_term
Here there are three errors in the paper, fourth and fifth lines are missing a factor of d, the last line substitute m for 1. It should be:
np.sum([n[i] * np.log(n[i]) -
n[i] * np.log(N) -
((n[i] * d) / 2) * np.log(2*np.pi*cl_var) -
((n[i] - 1) * d/ 2) for i in range(m)]) - const_term
2.
The const_term:
const_term = 0.5 * m * np.log(N)
should be:
const_term = 0.5 * m * np.log(N) * (d+1)
3.
The variance formula:
cl_var = [(1.0 / (n[i] - m)) * sum(distance.cdist(p[np.where(label_ == i)], [centers[0][i]], 'euclidean')**2) for i in range(m)]
should be a scalar:
cl_var = (1.0 / (N - m) / d) * sum([sum(distance.cdist(p[np.where(labels == i)], [centers[0][i]], 'euclidean')**2) for i in range(m)])
4.
Use natural logs, instead of your base10 logs.
5.
Finally, and most importantly, the BIC you are computing has an inverse sign from the regular definition. so you are looking to maximize instead of minimize
Attribution
Source : Link , Question Author : Kam Sen , Answer Author : Seanny123