I know that in high-dimensional space, the distance between almost all pairs of points has almost the same value (“Distance Concentration”). See Aggarwal et al. 2001, On the Surprising Behavior of Distance Metrics in High Dimensional Space.
Is there a mathematical way to observe this phenomenon?
There is a simple mathematical thought experiment that sheds light on this phenomenon, although it might not seem immediately applicable. I will therefore describe this experiment briefly and follow that, in a separate section, by a computer analysis of a concrete situation.
A Thought Experiment
An old cartographic chestnut is that most of the area of a map lies near its edges. Similarly, much of a pizza–more than you might think–consists of its crust. Even more so is the fact that a great deal of the volume of a thick-skinned fruit, like a grapefruit or watermelon, is in its skin.
More than half this pizza lies near its edge, outside the shaded area. However, the width of this “crust” is only 18% of the diameter of the pie.
Maps and pizzas and grapefruits don’t have a common shape, but there is a common underlying explanation. We may model the border of a map, the crust of a pizza, or the skin of a fruit by supposing its basic shape — a rectangle, circle, sphere, or whatever — has been shrunk uniformly by some factor α and that the “crust” or “rind” consists of what lies between these two concentric similar shapes.
In n dimensions (the examples involve n=2 or n=3), the n-dimensional volume of the interior will therefore be αn times the volume of the original shape. (This volume scaling law is sometimes used to define the number of dimensions of a space.) The volume of the rind relative to the original volume therefore is
As a function of α its rate of growth is
Beginning with no shrinking (α=1) and noting α is decreasing (dα is negative), we find the initial rate of growth of the rind equals n.
This shows that the volume of the rind initially grows much faster — n times faster — than the rate at which the object is being shrunk. The factor of n implies
in higher dimensions, relatively tiny changes in distance translate to much larger changes in volume.
Let’s call this the “edge-of-map principle.”
Consider, now, a tabular dataset consisting of observations of n numerical features. We may view each observation as a point in Rn and (at least in our imagination) might also suppose this collection of points is contained within some kind of compact region D with relatively simple boundary.
If we choose to use Euclidean distances to compare these points to each other (and to other points in D) and consider an arbitrary observation x, the edge-of-map principle implies that most of the room in D is nearly as far as possible from x. (The fudge term “nearly” is needed to account for what goes on around the boundary of D.)
Another implication that goes to the heart of the question is the generalization of the cartographer’s quandary: if our observations are somewhat “spread out” over D, then the cartographer’s question is “what proportion of this dataset is near the boundary?” To express this in a quantitative fashion, let’s invert it: we ask, by how much should we shrink D to make it, say, only half its original volume? Let’s call this the “half-length” of D, analogously to the half-life of a radioactive decay.
If the half-length is α, we need only solve the equation
In two dimensions the half-length is 1−0.35. Since half of the shrinking occurs on one side of the map or pizza and the other half on the other side (refer to the preceding figure), half of the area of a map (n=2) lies within (approximately) 35/2=18% of its diameter from the boundary.
In three dimensions the half-length is 1−0.23: now, half the volume of a fruit lies within 12% of its diameter from its boundary. A fruit whose skin is just one-eighth the width of the entire fruit is more than half skin.
Despite appearances, approximately half the volume of this grapefruit is rind. (Source: FreeDigitalPhotos.net.)
In very large dimensions the half-length is very close to 1. In n=350 dimensions it is greater than 98%, within two percent of 1. Thus, expect half of any 350-dimensional dataset to lie within 1% of its diameter from its boundary. Unless the data are strongly clustered, this generalization will be accurate.
Another way to express these results is:
Absent strong clustering, in higher dimensions n we can expect most Euclidean distances between observations in a dataset to be very nearly the same and to be very close to the diameter of the region in which they are enclosed. “Very close” means on the order of 1/n.
Several parts of this analysis are really just hand-waving and approximations, due to the vagueness of D and the very general assumptions about the dataset. How is D defined, anyway? In some applications it is determined by inherent limits; for instance, when all features are proportions. In many applications the features are arbitrarily scaled to lie within a fixed interval (“normalized”) and we often take D to be the corresponding hypercube. But that’s only an artifice and it is exquisitely sensitive to any outlying data values. The rest of this post explores an alternative in which the boundary plays a less important role in the results. It comes to similar conclusions.
Analysis of distances in a closed Euclidean space
I find the paper’s setting rather arbitrary, because it is exploring distances within unit cubes. The distance distributions depend strongly on the shapes of the boundaries of those cubes.
There’s a way to avoid boundary effects. In one dimension, the “cube” is just the unit interval, [0,1].
Because this interval has two ends, some of the points are far from the rest; others (near the middle) tend to be close to all the points. This is asymmetric. To remove the asymmetry, roll the interval around into a loop where the beginning point 0 meets the end point 1:
Geometrically, all its points are equivalent.
We can do the same in higher dimensions by rolling up each coordinate separately into a loop. The result in dimension d is the d-torus. It has no boundaries and all points are geometrically equivalent. It’s not perfectly symmetrical like a sphere, though: unlike the (Euclidean) sphere, whose geometry no longer is Euclidean due to its curvature, these d-tori are flat, without curvature. They can give us insight into Euclidean distances without the complication of dealing with boundaries.
Analytical study of the distances in a torus is complicated, at least for dimensions greater than 1. Let’s study these distances by generating random points from the uniform distribution on a d-torus and computing all their mutual distances (apart from the necessarily zero distances between each point and itself). For the following figures I generated 500 points in each of eight separate dimensions, resulting in over 100,000 distances in each dataset. How are these distances distributed and how do those distributions vary with the dimension d?
Here is an array of histograms of these distances, one per dimension.
It’s not hard to prove mathematically what the eye already sees: the distributions tend to a Gaussian, or “Normal,” shape, as the dimension increases.
There is another remarkable regularity: the spreads of these histograms are nearly constant. Beneath each I have printed the standard deviation (SD) of the distances. It hardly changes from 1 through 128 dimensions. In this sense, there is no “concentration” of distances in high dimensions at all!
Here are the same figures shown on a common plot for easier comparison:
The colors mean the same as before, showing that the average distances increase with dimension. They do so roughly with a square-root law: the average distance is about one-quarter the square root of the dimension. (Those familiar with the Pythagorean Theorem in higher dimensions will at once understand why.) The greatest possible distance in the d-torus is achieved by pairs of points whose coordinates all differ by 1/2 (because you cannot get any further apart than that along a loop); that distance obviously is √d/2.
Thus, it makes sense to compare the relative distances in each dimension. Here we go with one more plot of the same datasets, now with the distances all divided by √d/2:
This normalization has centered the histograms near 0.58, regardless of dimension. Here we are looking at the clearest manifestation of a “concentration of distance:” although the relative distances are typically the same in each dimension, as the dimension increases the distances concentrate more closely around a central value. As you can tell from the posted standard deviations, they too enjoy an inverse square-root law: the spread of the relative distances is approximately 1/(4√d).
In other words, around any given point on a high-dimensional torus (and all points are geometrically the same, so it doesn’t matter which point), nearly all other points on the torus are nearly the same distance away! If you were an inhabitant of a high-dimensional flat Euclidean space, albeit one with no boundaries, most of that space would seem to lie close to a spherical shell surrounding you. In d = a million dimensions, for instance, the maximum possible distance is 500, the average distance would be around 288.7, and virtually all distances would be within 0.5 of that value.
All these general conclusions about the shape, typical value, and spread of Euclidean distances hold in other domain shapes, but the details vary. The general result, though, is that randomly selected points within reasonably compact high-dimensional domains tend not to cluster appreciably. This has obvious implications for statistical (and machine-learning) methods based on clustering and nearest neighbor analyses.