Mean

The most common approach to measuring the center of a numerical dataset is called the mean, which is also known as the arithmetic mean or the average.

Definition. The mean \(\bar{x}\) (pronounced "x-bar") of a dataset \(\{x_{1},\cdots,x_{n}\}\) is given by:

$${\bar{x}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}}$$

Let's consider an example.

Example: Housing prices

Let's use the formula above to calculate the mean of the house price dataset that we considered in the previous section. The following are prices in thousands of USD:

499525555
489475400
560571465
552535527
520562605
575550590

The prices are the \(x_{i}\) in the formula. To get the numerator, we need to add them up. Their total is 9,555. The denominator is 18 since the dataset contains 18 elements. So:

$${\bar{x}={\frac {9555}{18}}=530.833}$$

That is, the mean house price is $530,833. Here's how it looks on the number line:

House prices with mean

Is the mean a good measure of central tendency?

Judging from the number line above, the mean would appear to be a reasonable measure of central tendency. Unlike our earlier approach of simply taking the midpoint of the range of values, here we're accounting for every data point in the dataset, and so the heavier concentration of prices toward the upper end of the range "pulls" the mean toward the right.

In many cases, the mean is an outstanding way to measure the central tendency of a dataset. It's intuitive and often generates reasonable results. There is however one sort of case where the results are more questionable: when the dataset contains a so-called "outlier". An outlier is a data point that is significantly different than the other data points. For numerical data, outliers are usually data points that are either much larger or much smaller that the other data points.

Another way to say this is that the mean is not robust. A measurement is robust if it resists being changed much by outliers. As we'll see in the next example, outliers strongly impact the mean. So the mean isn't a robust measurement of central tendency.

This concept of the robustness of a given measurement technique will come up again and again. For example, we'll see later that the range is not a robust measure of spread. As a general rule, we don't want outliers to unduly influence measurements. So we either need to preprocess our data to remove outliers, or else use robust measurements.

Example: Housing prices with an outlier

Now let's make a single change to the house price dataset. A wealthy homeowner puts his $3.125M home on the market, which certainly counts as an outlier for this dataset:

499525555
489475400
560571465
552535527
520562605
575550590
3,125

Recalculating the mean, we now have

$${\bar{x}={\frac {12680}{19}}=667.368}$$

The outlier has clearly pulled the mean to the right, above the values of all houses other than the outlier itself. In this case, the mean is a misleading way to summarize the dataset:

The mean of house prices with an outlier

In the next section, we'll learn about a new measurement of central tendency—one that's robust in the face of outliers.

Exercises

Exercise 1. Compute the mean of the following dataset: 14, 10, 8, 12. Plot it on a number line. Is the mean a reasonable measurement of the center of this dataset?

Exercise 2. Now compute the mean of the following dataset: 14, 10, 8, 12, 220. Once again, plot it on a number line. Is the mean a reasonable measurement of the center of this dataset?