The Ninther — Approximating Medians

Andy Greatorex
2 min readMar 23, 2017

--

The main benefit of using the median as oppose to other average approximation aggregates like the mean, is because it is less skewed by extremely large or small values. It offers a better approximation of what is called a typical value of the data.

To estimate the median of a dataset, you must read all the elements into memory, sort them and then find the middle value; a process which is inefficient and often impractical if the dataset is very large.

Ideally, what we want is to estimate a value for the median that uses as few comparisons and arithmetic computations as possible.

I recently came across a simple and effective median estimation technique, written by Tukey from 1978 called “median of medians”.

It works as follows. Given a list of data points S, we partition S into buckets and calculate the median for each bucket. We then take those medians and calculate the median of them (in his paper, Tukey calls this value the ninther).

If S is small, as in the following demonstration, then finding the median is simple, but as the number of elements in S gets very large, the technique becomes more powerful.

Suppose S={4,7,1,3,20,8,3,9,2}. If we split S into 3 equal(ish) sets. So we might end up with the following sets:

--

--

Andy Greatorex

London based data scientist @Revolut. Formerly in NYC @Barclays. Building stuff for the fun of it.