Is there more than one “median” formula?

In my work, when individuals refer to the “mean” value of a data set, they’re typically referring to the arithmetic mean (i.e. “average”, or “expected value”). If I provided the geometric mean, people would likely think I’m being snide or non-helpful, as the definition of “mean” is known in advance.

I’m trying to determine if there are multiple definitions of the “median” of a data set. For example, one of the definitions provided by a colleague for finding the median of a data set with an even number of elements would be:

Algorithm ‘A’

  • Divide the number of elements by two, round down.
  • That value is the index of the median.
  • i.e. For the following set, the median would be 5.
  • [4, 5, 6, 7]

This seems to make sense, though the rounding-down aspect seems a bit arbitrary.

Algorithm ‘B’

In any case, another colleague has proposed a separate algorithm, which was in a stats textbook of his (need to get the name and author):

  • Divide the number of elements by 2, and keep a copy of the rounded-up and rounded-down integers. Name them n_lo and n_hi.
  • Take the arithmetic mean of the elements at n_lo and n_hi.
  • i.e. For the following set, the median would be (5+6)/2 = 5.5.
  • [4, 5, 6, 7]

This seems wrong though, as the median value, 5.5 in this case, isn’t actually in the original data set. When we swapped out algorithm ‘A’ for ‘B’ in some test code, it broke horribly (as we expected).

Question

Is there a formal “name” for these two approaches to calculating the median of a data set? i.e. “lesser-of-the-two median” versus “average-the-middle-elements-and-make-new-data median”?

Answer

TL;DR – I’m not aware of specific names being given to different estimators of sample medians. Methods to estimate sample statistics from some data are rather fussy and different resources give different definitions.

In Hogg, McKean and Craig’s Introduction to Mathematical Statistics, the authors provide a definition of medians of random samples, but only in the case that there are an odd number of samples! The authors write

Certain functions of the order statistics are important statistics themselves… if n is odd, Y_{(n+1)/2} … is called the median of the random sample.

The authors provide no guidance on what to do if you have an even number of samples. (Note that Y_i is the ith smallest datum.)

But this seems unnecessarily restrictive; I would prefer to be able to define a median of a random sample for even or odd n. Moreover, I would like the median to be unique. Given these two requirements, I have to make some decisions about how to best find a unique sample median. Both Algorithm A and Algorithm B satisfy these requirements. Imposing additional requirements could eliminate either or both from consideration.

Algorithm B has the property that half the data fall above the value, and half the data fall below the value. In light of the definition of the median of a random variable, this seems nice.


Whether or not a particular estimator breaks unit tests is a property of the unit tests — unit tests written against a specific estimator won’t necessarily hold when you substitute another estimator. In the ideal case, the unit tests were chosen because they reflect the critical needs of your organization, not because of a doctrinaire argument over definitions.

Attribution
Source : Link , Question Author : Cloud , Answer Author : Sycorax

Leave a Comment