# How to measure distance for features with different scales?

I’m reading the book “Collective Intelligence” and in one chapter they introduce how to measure similarity between users on a movie review website with euclidean distance.

Now are the movies rated all on the scale from 1-5. But what if I want to find similar users based on features like – lets say body height, body width, weight and ratio of eye-distance to nose-length. This features operate on different scales, so e.g. body height would influence the distance much more than the eye-nose ratio.

My question is what is the best way to approach this example. Should one use a different distance measure (which?), or normalize the data somehow and use euclidean distance?

A very common solution for this very common problem (ie, over-weighting variables) is to standardize your data.

To do this, you just perform two successive column-wise operations on your data:

• subtract the mean and

• divide by the standard deviation

For instance, in NumPy:

``````>>> # first create a small data matrix comprised of three variables
>>> # having three different 'scales' (means and variances)

>>> a = 10*NP.random.rand(6)
>>> b = 50*NP.random.rand(6)
>>> c = 2*NP.random.rand(6)
>>> A = NP.column_stack((a, b, c))
>>> A   # the pre-standardized data
array([[  1.753,  37.809,   1.181],
[  1.386,   8.333,   0.235],
[  2.827,  40.5  ,   0.625],
[  5.516,  47.202,   0.183],
[  0.599,  27.017,   1.054],
[  8.918,  35.398,   1.602]])

>>> # mean center the data (columnwise)
>>> A -= NP.mean(A, axis=0)
>>> A
array([[ -1.747,   5.099,   0.368],
[ -2.114, -24.377,  -0.578],
[ -0.673,   7.79 ,  -0.189],
[  2.016,  14.493,  -0.631],
[ -2.901,  -5.693,   0.24 ],
[  5.418,   2.688,   0.789]])

>>> # divide by the standard deviation
>>> A /= NP.std(A, axis=0)
>>> A
array([[-0.606,  0.409,  0.716],
[-0.734, -1.957, -1.125],
[-0.233,  0.626, -0.367],
[ 0.7  ,  1.164, -1.228],
[-1.007, -0.457,  0.468],
[ 1.881,  0.216,  1.536]])
``````