Curse of dimensionality- does cosine similarity work better and if so, why? [duplicate]

When working with high dimensional data, it is almost useless to compare data points using euclidean distance – this is the curse of dimensionality.

However, I have read that using different distance metrics, such as a cosine similarity, performs better with high dimensional data.

Why is this? Is there some mathematical proof / intuition?

My intuition is that it’s because the cosine metric is only concerned with the angle between data points, and the fact that the plane between any two data points and the origin is 3 dimensional. But what if two data points have a very small angle but lie “far away” from each other (in an absolute difference sense) – then how would they still be considered close / similar?

Answer

Contrary to various unproven claims, cosine cannot be significantly better.

It is easy to see that Cosine is essentially the same as Euclidean on normalized data. The normalization takes away one degree of freedom. Thus, cosine on a 1000 dimensional space is about as “cursed” as Euclidean on a 999 dimensional space.

What is usually different is the data where you would use one vs. the other. Euclidean is commonly used on dense, continuous variables. There every dimension matters, and a 20 dimensional space can be challenging. Cosine is mostly used on very sparse, discrete domains such as text. Here, most dimensions are 0 and do not matter at all. A 100.000 dimensional vector space may have just some 50 nonzero dimensions for a distance computation; and of these many will have a low weight (stopwords). So it is the typical use case of cosine that is not cursed, even though it theoretically is a very high dimensional space.

There is a term for this: intrinsic dimensionality vs. representation dimensionality.

Attribution
Source : Link , Question Author : PyRsquared , Answer Author : Has QUIT–Anony-Mousse

Leave a Comment