Why are most of my points classified as noise using DBSCAN?

I’m using several clustering algorithms from sklearn to cluster some data, and can’t seem to figure out what’s happening with DBSCAN. My data is a document-term matrix from TfidfVectorizer, with a few hundred preprocessed documents.

Code:

tfv = TfidfVectorizer(stop_words=STOP_WORDS, tokenizer=StemTokenizer())
data = tfv.fit_transform(dataset)

db = DBSCAN(eps=eps, min_samples=min_samples)
result = db.fit_predict(data)
svd = TruncatedSVD(n_components=2).fit_transform(data)
// Set the colour of noise pts to black
for i in range(0,len(result)):
        if result[i] == -1:
            result[i] = 7
colors = [LABELS[l] for l in result]
pl.scatter(svd[:,0], svd[:,1], c=colors, s=50, linewidths=0.5, alpha=0.7)

Here’s what I get for eps=0.5, min_samples=5:

Basically I can’t get any clusters at all unless I set min_samples to 3, which gives this:

I’ve tried various combinations of eps/min_samples values and get similar results. It seems to always clusters areas of low density first. Why is it clustering like this? Am I maybe using TruncatedSVD incorrectly?

Answer

The scatter-plot of the SVD projection scores of the original TFIDF data does suggest that indeed some density structure should be detected. Nevertheless these data are not the inputs DBSCAN is presented with. It appears you are using as input the original TFIDF data.

It is very plausible that the original TFIDF dataset is sparse and high-dimensional. Detecting density-based clusters in such a domain would very demanding. High-dimensional density estimation is a properly hard problem; it is a typical scenario where the curse of dimensionality kicks in. We are just seeing a manifestation of this problem (“curse”); the resulting clustering returned by DBSCAN is rather sparse itself and assumes (probably wrongly) that the data at hand are riddled with outliers.

I would suggest that, at first instance at least, DBSCAN is provided with the projection scores used to create the scatter-plot shown as inputs. This approach would be effectively Latent Semantic Analysis (LSA). In LSA we use the SVD decomposition of a matrix containing word counts of the text corpus analysed (or a normalised term-document matrix of as the one returned by TFIDF) to investigate the relations between the text-units of the corpus at hand.

Attribution
Source : Link , Question Author : filaments , Answer Author : usεr11852

Leave a Comment