# Mean

The most common approach to measuring the center of a numerical
dataset is called the *mean*, which is also known as the *arithmetic mean* or the
*average*.

**Definition.** The *mean* \(\bar{x}\) (pronounced "x-bar") of a dataset
\(\{x_{1},\cdots,x_{n}\}\) is given by:

Let's consider an example.

## Example: Housing prices

Let's use the formula above to calculate the mean of the house price dataset that we considered in the previous section. The following are prices in thousands of USD:

499 | 525 | 555 |

489 | 475 | 400 |

560 | 571 | 465 |

552 | 535 | 527 |

520 | 562 | 605 |

575 | 550 | 590 |

The prices are the \(x_{i}\) in the formula. To get the numerator, we need to add them up. Their total is 9,555. The denominator is 18 since the dataset contains 18 elements. So:

That is, the mean house price is $530,833. Here's how it looks on the number line:

## Is the mean a good measure of central tendency?

Judging from the number line above, the mean would appear to be a reasonable measure of central tendency. Unlike our earlier approach of simply taking the midpoint of the range of values, here we're accounting for every data point in the dataset, and so the heavier concentration of prices toward the upper end of the range "pulls" the mean toward the right.

In many cases, the mean is an outstanding way to measure the central tendency of a dataset. It's intuitive
and often generates reasonable results. There is however one sort of case where the results are more
questionable: when the dataset contains a so-called "outlier". An *outlier* is a data point that is
significantly different than the other data points. For numerical data, outliers are usually data points that
are either much larger or much smaller that the other data points.

Another way to say this is that the mean is not robust. A measurement is *robust* if it resists being
changed much by outliers. As we'll see in the next example, outliers strongly impact the mean. So the mean
isn't a robust measurement of central tendency.

This concept of the *robustness* of a given measurement technique will come up again and again.
For example, we'll see later that the range is not a robust
measure of spread. As a general rule, we don't want outliers to unduly influence measurements. So we either
need to preprocess our data to remove outliers, or else use robust measurements.

## Example: Housing prices with an outlier

Now let's make a single change to the house price dataset. A wealthy homeowner puts his $3.125M home on the market, which certainly counts as an outlier for this dataset:

499 | 525 | 555 |

489 | 475 | 400 |

560 | 571 | 465 |

552 | 535 | 527 |

520 | 562 | 605 |

575 | 550 | 590 |

3,125 |

Recalculating the mean, we now have

The outlier has clearly pulled the mean to the right, above the values of all houses other than the outlier itself. In this case, the mean is a misleading way to summarize the dataset:

In the next section, we'll learn about a new measurement of central tendency—one that's robust in the face of outliers.

## Exercises

**Exercise 1.** Compute the mean of the following dataset: 14, 10, 8, 12. Plot it on a number
line. Is the mean a reasonable measurement of the center of this dataset?

**Exercise 2.** Now compute the mean of the following dataset: 14, 10, 8, 12, 220. Once again,
plot it on a number line. Is the mean a reasonable measurement of the center of this dataset?