We saw in the previous section that the mean, while an intuitive way to measure central tendency, can generate skewed results when there are outliers in the dataset. In this section we'll consider the median, which is a robust measure of central tendency. That is, it resists being changed much when the dataset contains outliers.
The idea behind the median is that we want to pick the value that's in the middle of the sorted dataset. That way half of the values are above the value we pick, and the other half are below. The slight wrinkle here is that the exact calculation depends on whether the dataset size is odd or even.
If the number is odd, then it's easy. There's an element in the exact center. That's the median.
Otherwise, there are two numbers in the center. The median is the mean of these two numbers.
Let's try some examples.
Example: Median of a dataset with an odd size
Consider the dataset
4, 8, 5, 10, 6
First we sort it:
4, 5, 6, 8, 10
This dataset has an odd size. The number in the middle is 6, so that's the median.
Example: Median of a dataset with a repeated element
Now consider the dataset
4, 8, 6, 10, 6
This one's pretty similar to the one above, but there's a repeated element. Doesn't matter though. Again we sort it:
4, 6, 6, 8, 10
The median is 6 again since that's the number in the middle.
Example: Median of a dataset with an even size
Consider the dataset
4, 8, 5, 10, 6, 12
First we sort it:
4, 5, 6, 8, 10, 12
This time the dataset has an even number of elements. The two numbers in the middle are 6 and 8. The mean of these two numbers is 7, so that's the median of the overall dataset.
Example: Median of the housing dataset
Finally, let's return to the augmented version of our housing dataset from the previous section—the one where a wealthy homeowner placed his $3.125M home on the market:
To calculate the median, we first note that the size of the dataset is 19, so we just need to sort the values and pick the number in the middle.
The sorted values are
400, 465, 475, 489, 499, 520, 525, 527, 535, 550, 552, 555, 560, 562, 571, 575, 590, 605, 3125
The middle value is 550, so that's the median. And indeed, $550,000 would seem to be a more reasonable summary of this dataset than the mean, $667,368. The median, being robust, "resisted" the influence of the large outlier in the dataset.
Mean vs. median: which is better?
We might ask then which is a better measure of central tendency: the mean or the median? If the median is robust, doesn't that make it better?
The answer is that it just depends. In the housing data example above, the median definitely seems better. Indeed if you go to a real estate website like Zillow, you'll usually see prices for an area given in terms of the median rather than the mean:
We often use the median to measure central tendency for certain financial statistics too. For example, the mean net worth across all American families as of 2016 was $692,100, which sounds implausibly high. (It's true though.) But the reason for such a high measurement is that there's a relatively small percentage of very high net worth families that skew the mean way, way up. So the mean is misleading here. The median net worth, on the other hand, was a much more believable $97,300.
As a general rule, the median is better when you have outliers, or else when the data are skewed such that a fairly small number of values are either very large or very small. The median reduces their influence on the measurement. On the other hand, the mean is great when the values are fairly similar in scale, since it incorporates every value into the numerical calculation, instead of simply looking for the middle value.
Exercise 1. Compute the median of the following dataset: 14, 10, 8, 12. Plot it on a number line. Is the median a reasonable measurement of the center of this dataset? How does it compare with the mean?
Exercise 2. Compute the median of the following dataset: 3, 1, 2, 14, 18. Plot it on a number line. Is the median a reasonable measurement of the center of this dataset? How does it compare with the mean?
Exercise 3. Compute the median of the following dataset: 14, 10, 8, 12, 220. Once again, plot it on a number line. Is the median a reasonable measurement of the center of this dataset? How does it compare with the mean?