Variance of a distribution of multi-level categorical data

I am currently analyzing large data sets with various characteristics (such as city). I wanted to find a measure which would essentially say how much or how little of a variance there was across the data. This would be much more useful than simply getting a count of the number of distinct elements.

For example, consider the following data:

City
----
Moscow
Moscow
Paris
London
London
London
NYC
NYC
NYC
NYC


I can see that there are 4 distinct cities, but that doesn’t tell me how much a distribution there is. One ‘formula’ I came up with was taking the sum of the fractions of the total dataset for each element. In this case, it would be (2/10)^2 + (1/10)^2 + (3/10)^2 + (4/10)^2. I have no real mathematical proof for this, but just thought about it.

In this case, for example, in a set with 10 elements, if 9 were the same, and 1 was different, the number would be (9/10)^2 + (1/10)^2. However, if it were half and half, it would be (5/10)^2 + (5/10)^2.

I wanted to get an opinion on what similar formulas and areas of study there are. I really could not find anything with a few quick Google searches.

I think what you probably want is (Shannon’s) entropy. It is calculated like this:

This represents a way of thinking about the amount of information in a categorical variable.

In R, we can calculate this as follows:

City = c("Moscow", "Moscow", "Paris", "London", "London",
"London", "NYC", "NYC", "NYC", "NYC")
table(City)
# City
# London Moscow    NYC  Paris
#      3      2      4      1
entropy = function(cat.vect){
px  = table(cat.vect)/length(cat.vect)
lpx = log(px, base=2)
ent = -sum(px*lpx)
return(ent)
}
entropy(City)                                             # [1] 1.846439
entropy(rep(City, 10))                                    # [1] 1.846439
entropy(c(    "Moscow",       "NYC"))                     # [1] 1
entropy(c(    "Moscow",       "NYC", "Paris", "London"))  # [1] 2
entropy(rep(  "Moscow", 100))                             # [1] 0
entropy(c(rep("Moscow",   9), "NYC"))                     # [1] 0.4689956
entropy(c(rep("Moscow",  99), "NYC"))                     # [1] 0.08079314
entropy(c(rep("Moscow",  97), "NYC", "Paris", "London"))  # [1] 0.2419407


From this, we can see that the length of the vector doesn’t matter. The number of possible options (‘levels’ of a categorical variable) makes it increase. If there were only one possibility, the value is $0$ (as low as you can get). The value is largest, for any given number of possibilities when the probabilities are equal.

Somewhat more technically, with more possible options, it takes more information to represent the variable while minimizing error. With only one option, there is no information in your variable. Even with more options, but where almost all actual instances are a particular level, there is very little information; after all, you can just guess “Moscow” and nearly always be right.

your.metric = function(cat.vect){
px   = table(cat.vect)/length(cat.vect)
spx2 = sum(px^2)
return(spx2)
}
your.metric(City)                                             # [1] 0.3
your.metric(rep(City, 10))                                    # [1] 0.3
your.metric(c(    "Moscow",       "NYC"))                     # [1] 0.5
your.metric(c(    "Moscow",       "NYC", "Paris", "London"))  # [1] 0.25
your.metric(rep(  "Moscow", 100))                             # [1] 1
your.metric(c(rep("Moscow",   9), "NYC"))                     # [1] 0.82
your.metric(c(rep("Moscow",  99), "NYC"))                     # [1] 0.9802
your.metric(c(rep("Moscow",  97), "NYC", "Paris", "London"))  # [1] 0.9412


Your suggested metric is the sum of squared probabilities. In some ways it behaves similarly (e.g., notice that it is invariant to the length of the variable), but note that it decreases as the number of levels increases or as the variable becomes more imbalanced. It moves inversely to entropy, but the units—size of the increments—differ. Your metric will be bound by $0$ and $1$, whereas entropy ranges from $0$ to infinity. Here is a plot of their relationship: