# What is the optimal distance function for individuals when attributes are nominal?

I do not know which distance function between individuals to use in case of nominal (unordered categorical) attributes.
I was reading some textbook and they suggest Simple Matching function but some books suggest that I should change the nominal to binary attributes and use Jaccard Coefficient. However, what if the values of the nominal attribute is not 2? what if there are three or four values in that attribute?

Which distance function should I use for nominal attributes?

Technically to compute a dis(similarity) measure between individuals on nominal attributes most programs first recode each nominal variable into a set of dummy binary variables and then compute some measure for binary variables. Here is formulas of some frequently used binary similarity and dissimilarity measures.

What is dummy variables (also called one-hot)? Below is 5 individuals, two nominal variables (A with 3 categories, B with 2 categories). 3 dummies created in place of A, 2 dummies created in place of B.

ID   A    B      A1 A2 A3      B1 B2
1    2    1       0  1  0       1  0
2    1    2       1  0  0       0  1
3    3    2       0  0  1       0  1
4    1    1       1  0  0       1  0
5    2    1       0  1  0       1  0


(There is no need to eliminate one dummy variable as “redundant” as we typically would do it in regression with dummies. It is not practised in clustering, albeit in special situations you might consider that option.)

There are many measures for binary variables, however, not all of them logically suit dummy binary variables, i.e. former nominal ones. You see, for a nominal variable, the fact “the 2 individuals match” and the fact “the 2 individuals don’t match” are of equal importance. But consider popular Jaccard measure $\frac{a}{a+b+c}$, where

• a – number of dummies 1 for both individuals
• b – number of dummies 1 for this and 0 for that
• c – number of dummies 0 for this and 1 for that
• d – number of dummies 0 for both

Here mismatch consists of two variants, $b$ and $c$; but for us, as already said, each of them is of the same importance as match $a$. Hence we should double-weight $a$, and get formula $\frac{2a}{2a+b+c}$, known as Dice (after Lee Dice) or Czekanovsky-Sorensen measure. It is more appropriate for dummy variables. Indeed, famous composite Gower coefficient (which is recommeded for you with your nominal attributes) is exactly equal to Dice when all the attributes are nominal. Note also that for dummy variables Dice measure (between individuals) = Ochiai measure (which is simply a cosine) = Kulczynsky 2 measure. And more for your information, 1-Dice = binary Lance-Williams distance known also as Bray-Curtis distance. Look how many synonyms – you are sure to find something of that in your software!

The intuitive validity of Dice similarity coefficient comes from the fact that it is simply the co-occurence proportion (or relative agreement). For the data snippet above, take nominal column A and compute 5x5 square symmetric matrix with either 1 (both individuals fell in the same category) or 0 (not in the same category). Compute likewise the matrix for B.

A    1  2  3  4  5        B    1  2  3  4  5
_____________             _____________
1| 1                      1| 1
2| 0  1                   2| 0  1
3| 0  0  1                3| 0  1  1
4| 0  1  0  1             4| 1  0  0  1
5| 1  0  0  0  1          5| 1  0  0  0  1


Sum the corresponding entries of the two matrices and divide by 2 (number of nominal variables) – here you are with the matrix of Dice coefficients. (So, actually you don’t have to create dummies to compute Dice, with matrix operations you may probably do it faster the way just described.) See a related topic on Dice for the association of nominal attribures.

Albeit Dice is the most apparent measure to use when you want a (dis)similarity function between cases when attributes are categorical, other binary measures could be used – if find their formula satisfy considerations about your nominal data.

Measures like Simple Matching (SM, or Rand) $\frac{a+d}{a+b+c+d}$ which contain $d$ in the numerator won’t suit you on the grounds that they treat 0-0 (both individuals do not have a specific common attribute/category) as a match, which is obviuosly nonsense with originally nominal, qualitative features. So check the formula of the similarity or dissimilarity you plan to use with the sets of dummy variables: if it has or implies $d$ as the grounds for sameness, don’t use that measure for nominal data. For example, squared Euclidean distance, which formula becomes with binary data just $b+c$ (and is synonymic in this case to Manhattan distance or Hamming distance) does treat $d$ as the basis of sameness. Actually, $d^2 = p(1-SM)$, where $p$ is the number of binary attributes; thus Euclidean distance is informationally equal worth with SM and shouldn’t be applied to originally nominal data.

But

Having read the previous “theoretical” paragraph I realized that – in spite what I wrote – the majority of binary coefficients (also those using $d$) practically will do most of the time. I established by check that with dummy variables obtained from a number of nominal ones Dice coefficient is related strictly functionally with a number of other binary measures (acronym is the measure’s keyword in SPSS):

                                                       relation with Dice
Similarities
Russell and Rao (simple joint prob)    RR          proportional
Simple matching (or Rand)              SM          linear
Jaccard                                JACCARD     monotonic
Sokal and Sneath 1                     SS1         monotonic
Rogers and Tanimoto                    RT          monotonic
Sokal and Sneath 2                     SS2         monotonic
Sokal and Sneath 4                     SS4         linear
Hamann                                 HAMANN      linear
Phi (or Pearson) correlation           PHI         linear
Dispersion similarity                  DISPER      linear
Dissimilarities
Euclidean distance                     BEUCLID     monotonic
Squared Euclidean distance             BSEUCLID    linear
Pattern difference                     PATTERN     monotonic (linear w/o d term omitted from formula)
Variance dissimilarity                 VARIANCE    linear


Since in many applications of a proximity matrix, such as in many methods of cluster analysis, results will not change or will change smoothly under linear (and sometimes even under monotonic) transform of proximities, it appears one may be justified to a vast number of binary measures besides Dice to get same or similar results. But you should first consider/explore how the specific method (for example a linkage in hierarchical clustering) reacts to a given transformation of proximities.

If your planned clustering or MDS analysis is sensitive to monotonic transforms of distances you better refrain from using measures noted as “monotonic” in the table above (and thus yes, it isn’t good idea to use Jaccard similarity or nonsquared euclidean distance with dummy, i.e. former nominal, attributes).