I’m reading the book “Collective Intelligence” and in one chapter they introduce how to measure similarity between users on a movie review website with euclidean distance.
Now are the movies rated all on the scale from 15. But what if I want to find similar users based on features like – lets say body height, body width, weight and ratio of eyedistance to noselength. This features operate on different scales, so e.g. body height would influence the distance much more than the eyenose ratio.
My question is what is the best way to approach this example. Should one use a different distance measure (which?), or normalize the data somehow and use euclidean distance?
Answer
A very common solution for this very common problem (ie, overweighting variables) is to standardize your data.
To do this, you just perform two successive columnwise operations on your data:

subtract the mean and

divide by the standard deviation
For instance, in NumPy:
>>> # first create a small data matrix comprised of three variables
>>> # having three different 'scales' (means and variances)
>>> a = 10*NP.random.rand(6)
>>> b = 50*NP.random.rand(6)
>>> c = 2*NP.random.rand(6)
>>> A = NP.column_stack((a, b, c))
>>> A # the prestandardized data
array([[ 1.753, 37.809, 1.181],
[ 1.386, 8.333, 0.235],
[ 2.827, 40.5 , 0.625],
[ 5.516, 47.202, 0.183],
[ 0.599, 27.017, 1.054],
[ 8.918, 35.398, 1.602]])
>>> # mean center the data (columnwise)
>>> A = NP.mean(A, axis=0)
>>> A
array([[ 1.747, 5.099, 0.368],
[ 2.114, 24.377, 0.578],
[ 0.673, 7.79 , 0.189],
[ 2.016, 14.493, 0.631],
[ 2.901, 5.693, 0.24 ],
[ 5.418, 2.688, 0.789]])
>>> # divide by the standard deviation
>>> A /= NP.std(A, axis=0)
>>> A
array([[0.606, 0.409, 0.716],
[0.734, 1.957, 1.125],
[0.233, 0.626, 0.367],
[ 0.7 , 1.164, 1.228],
[1.007, 0.457, 0.468],
[ 1.881, 0.216, 1.536]])
Attribution
Source : Link , Question Author : tobigue , Answer Author : doug