How to measure distance for features with different scales?

I’m reading the book “Collective Intelligence” and in one chapter they introduce how to measure similarity between users on a movie review website with euclidean distance.

Now are the movies rated all on the scale from 1-5. But what if I want to find similar users based on features like – lets say body height, body width, weight and ratio of eye-distance to nose-length. This features operate on different scales, so e.g. body height would influence the distance much more than the eye-nose ratio.

My question is what is the best way to approach this example. Should one use a different distance measure (which?), or normalize the data somehow and use euclidean distance?

Answer

A very common solution for this very common problem (ie, over-weighting variables) is to standardize your data.

To do this, you just perform two successive column-wise operations on your data:

  • subtract the mean and

  • divide by the standard deviation

For instance, in NumPy:

>>> # first create a small data matrix comprised of three variables 
>>> # having three different 'scales' (means and variances)

>>> a = 10*NP.random.rand(6)
>>> b = 50*NP.random.rand(6)
>>> c = 2*NP.random.rand(6)
>>> A = NP.column_stack((a, b, c))
>>> A   # the pre-standardized data
    array([[  1.753,  37.809,   1.181],
           [  1.386,   8.333,   0.235],
           [  2.827,  40.5  ,   0.625],
           [  5.516,  47.202,   0.183],
           [  0.599,  27.017,   1.054],
           [  8.918,  35.398,   1.602]])

>>> # mean center the data (columnwise)
>>> A -= NP.mean(A, axis=0)
>>> A
    array([[ -1.747,   5.099,   0.368],
           [ -2.114, -24.377,  -0.578],
           [ -0.673,   7.79 ,  -0.189],
           [  2.016,  14.493,  -0.631],
           [ -2.901,  -5.693,   0.24 ],
           [  5.418,   2.688,   0.789]])

>>> # divide by the standard deviation
>>> A /= NP.std(A, axis=0)
>>> A
    array([[-0.606,  0.409,  0.716],
           [-0.734, -1.957, -1.125],
           [-0.233,  0.626, -0.367],
           [ 0.7  ,  1.164, -1.228],
           [-1.007, -0.457,  0.468],
           [ 1.881,  0.216,  1.536]])

Attribution
Source : Link , Question Author : tobigue , Answer Author : doug

Leave a Comment