How to measure the statistical “distance” between two frequency distributions?

I am undertaking a data analysis project which involves investigating website usage times over the course of the year. What I would like to do is compare how “consistent” the usage patterns are, say, how close they are to a pattern which involves using it for 1 hour once per week, or one which involves using it for 10 minutes a time, 6 times per week. I am aware of several things which can be calculated:

  • Shannon entropy: measures how much the “certainty” in the outcome differs, i.e. how much a probability distribution differs from one that is uniform;
  • Kullback-Liebler divergence: measures how much one probability distribution differs from another
  • Jensen-Shannon divergence: similar to the KL-divergence, but more useful as it returns finite values
  • Smirnov-Kolmogorov test: a test to determine whether two cumulative distribution functions for continuous random variables come from the same sample.
  • Chi-squared test: a goodness-of-fit test to decide how well a frequency distribution differs from an expected frequency distribution.

What I would like to do is compare how much the actual usage durations (blue) differ from ideal usage times (orange) in distribution. These distributions are discrete, and the versions below are normalised to become probability distributions. The horizontal axis represents the amount of time (in minutes) a user has spent on the website; this has been recorded for each day of the year; if the user has not gone on the website at all then this counts as a zero duration but these have been removed from the frequency distribution. On the right is the cumulative distribution function.

Distribution of website usage data versus ideal usage data

My only problem is, even though I can get the JS-divergence to return a finite value, when I look at different users and compare their usage distributions to the ideal one, I get values that are mostly identical (which is therefore not a good indicator of how much they differ). Also, quite a bit of information is lost when normalising to probability distributions rather than frequency distributions (say a student uses the platform 50 times, then the blue distribution should be vertically scaled so that the total of the lengths of the bars equals 50, and the orange bar should have a height of 50 rather than 1). Part of what we mean by “consistency” is whether how often a user goes on the website affects how much they get out of it; if the number of times they visit the website is lost then comparing probability distributions is a bit dubious; even if the probability distribution of a user’s duration is close to the “ideal” usage, that user may only have used the platform for 1 week during the year, which arguably is not very consistent.

Are there any well-established techniques for comparing two frequency distributions and calculating some sort of metric which characterises how similar (or dissimilar) they are?

Answer

You may be interested in the Earth mover’s distance, also known as the Wasserstein metric. It is implemented in R (look at the emdist package) and in Python. We also have a number of threads on it.

The EMD works for both continuous and discrete distributions. The emdist package for R works on discrete distributions.

The advantage over something like a χ2 statistic is that the EMD yields interpretable results. Picture your distribution as mounds of earth, then the EMD tells you how much earth you would need to transport how far to turn one distribution into the other.

Put another way: two distributions (1,0,0) and (0,1,0) should be “more similar” than (1,0,0) and (0,0,1). The EMD will recognize this and assign a smaller distance to the first pair than to the second. The χ2 statistic will assign the same distance to both pairs, because it has no notion of an ordering in the distribution entries.

Attribution
Source : Link , Question Author : Community , Answer Author : Stephan Kolassa

Leave a Comment