I’m having trouble understanding the curse of dimensionality. Specifically, I came across it while doing the
scikit-learntutorial in python. Can someone please explain the below in a simpler manner? Sorry I have been trying to understand for the longest time and can’t understand how they came up with the calculation for number of training examples to achieve an efficient KNN estimator?
Here is the explanation:
For an estimator to be effective, you need the distance between neighboring points to be less than some value d, which depends on the problem. In one dimension, this requires on average n ~ 1/d points. In the context of the above KNN example, if the data is described by just one feature with values ranging from 0 to 1 and with n training observations, then new data will be no further away than 1/n. Therefore, the nearest neighbor decision rule will be efficient as soon as 1/n is small compared to the scale of between-class feature variations.
If the number of features is p, you now require n ~ 1/d^p points. Let’s say that we require 10 points in one dimension: Now 10^p points are required in p dimensions to pave the [0, 1] space. As p becomes large, the number of training points required for a good estimator grows exponentially.
EDIT: also is the tilde (
~) supposed to represent approximate in that example? or the python tilde operator?
Translating that paragraph:
Let there be a set of features that describe a data point. Maybe you’re looking at the weather. That set of features might include things like temperature, humidity, time of day, etc. So each data point might have one feature (if you’re only looking at temperature) or it might have 2 features (if you’re looking at temperature and humidity) and so on. What this paragraph is saying is that based on the number of dimensions your data has (how many features it has), the more difficult it is to make an estimator. This is because if you simply have one feature of data, or 1-dimensional data, then when you go to graph this data, you get a line graph, and imagining a line graph between let’s say 0-50 degrees C, it only takes 50 random points before each data point is about 1 degree from any other data point. Now let’s think about 2 dimensions, talking about humidity and temperature, now it’s trickier to find that d such that all the points are within “d” units of each other. Imagine temperature is still between 0-50 but now humidity is also between 0-100%. How many random points does it take to get all the points within 1 or 2 of each other? Now it’s 100 * 50 or ~5,000! Now imagine 3 dimensions, etc etc. You start needing way more points to ensure that every point is within d of some other point. To make your life easier try assuming “d” is 1 and see what happens. Hope that helps!