Empirical Rule

You may have heard of the so-called "bell curve". (If not, don't worry: we'll get to this later in the course.) It looks like this:

Bell-shaped data

With bell-shaped data, the center (whether measured using the mean, median or mode) is the x value right in the middle, corresponding to the peak of the curve. The data on either side of the mean is symmetric, and drops off in a characteristic fashion.

A great many real-world phenomena have roughly bell-shaped distributions. Examples include:

• IQ scores
• Test scores
• Shoe size
• Heights of people
• Skin color
• Blood pressure
• Daily returns for a given stock
• Measurement errors

The list goes on and on, but that gives you an idea as to how ubiquitous bell curves are.

A useful fact about bell curves is that for any given number of standard deviations σ, the percentage of data within σ standard deviations of the mean is known and constant. Here are the approximate percentages for σ = 1, 2 and 3:

# of standard deviations σ % of data within σ standard deviations of the mean
1 68%
2 95%
3 99.7%

Visually, it looks like this:

This mapping from sigmas to percentages is called the empirical rule.

Now let's check out some examples showing how we can put the empirical rule to good use.

Example: Estimating data distribution

IQ scores have mean $$\mu = 100$$ and standard deviation $$\sigma = 15$$. Using the empirical rule, we can estimate the following data distribution:

• Approximately 68% of people have IQs in the range 85-115.
• Approximately 95% of people have IQs in the range 70-130.
• Approximately 99.7% of people have IQs in the range 55-145.

Another use for the empirical rule is estimating the standard deviation from the range of the dataset. Take another look at the diagram above. The range of the dataset is likely to be between two and three standard deviations out, since in general that range contains between 95% and 99.7% of the data. Accordingly, we can guess that the standard deviation is likely to be between 1/4 and 1/6 of the range. This is a handy quick estimate since computing the range is easier than computing the standard deviation.

Example: Estimating the standard deviation from the range

Suppose that we have IQ data where the range is 143 - 72 = 71. Using the heuristic above allows us to estimate the standard deviation as being between 71/6 = 11.83 and 71/4 = 17.75, which is a pretty good estimate of the true value $$\sigma = 15$$.

Note that if the dataset contains extreme outliers, the estimate will be worse. For example, if the dataset range is 182 - 72 = 110, then the estimate for the standard deviation would be between 110/6 = 18.33 and 110/4 = 27.5, which is a much poorer estimate of the range. Thus this method of estimating the standard deviation is not robust (i.e., it's not resistant to outliers).

Another application of the empirical rule is as a rough indicator of the "normality" of a dataset (roughly speaking, how bell-shaped it is). If a dataset follows a normal distribution then we expect its values to comport with the empirical rule.

Example: Testing for normality

Consider the following dataset ($$n = 30$$):

 15.9249942  27.4487900  -1.2342597  30.6912334  13.6815346  23.6385636
12.2561295   4.1386065  -6.9577697   7.6475802 -35.2359455  13.7864672
26.0087985   9.3880788  10.4400216  34.8728048  39.9747441  11.4210232
35.9917044  25.7475065 -17.0352082  -9.1563771  -0.7987454  25.9819210
16.0954104 -46.5617619  46.0382561   0.9013163 -15.7993406  13.6762780

This dataset has a mean of $$\mu \approx 10.43$$ and a standard deviation of $$\sigma \approx 21.13$$. Let's take a look at the expected and actual counts within each standard deviation range:

Sigmas Range Expected Count Actual Count Close Match?
1 [-10.70, 31.56] 20.4 22 Yes
2 [-31.83, 52.69] 28.5 28 Yes
3 [-52.96, 73.82] 29.91 30 Yes

From the table above we can see that the actual counts match the expected counts quite closely, and so this data "passes" this informal normality test. (This doesn't mean that we know for sure that it's normally distributed, but it does mean that it looks like it could be.)

If we know that a process generates bell-shaped data, then we know that values outside of the three-sigma range will be quite rare. This useful observation allows us to detect data anomalies, as the next example shows.

Example: Anomaly detection

A web-based travel company sells hotel bookings. Volume varies depending on the hour of the day, day of the week and month. Bookings at any given point in time are roughly bell-shaped.

The company's operations team is charged with ensuring that the site is working correctly, meaning in part that customers can make hotel bookings. Sometimes there are technical or other issues that cause bookings to drop. When bookings are too low, the ops team needs to investigate.

For a certain one-minute window of time, the relevant historical dataset has a mean $$\mu = 205$$ and standard deviation $$\sigma = 12$$. The ops team wants to use this dataset to determine the threshold below which it should treat the bookings level as low enough to warrant an investigation.

To do this, they apply the so-called "three-sigma limit", which means that the threshold sits three standard deviations away from the mean. The empirical rule tells us that 99.7% of data lives within three standard deviations of the mean, and that the tail to the left of $$\mu - 3\sigma$$ has only 0.15% of the data. Thus if we see a data point below this threshold, chances are good that something is broken.

To calculate the threshold, we simply plug in the values:

$${\mu - 3\sigma = 205 - 3 \cdot 12 = 169}$$

Therefore, if the bookings drop below 169, the ops team should investigate.

Exercises

Exercise 1. The distribution of heights across adult males follows a bell-shaped pattern. Suppose that the mean is 70 inches with a standard deviation of 3 inches. Apply the empirical rule to describe the distribution of heights across adult males.

Exercise 2. You have a dataset of adult female heights where the tallest woman is 71 inches and the shortest is 57 inches. Use this information to estimate the standard deviation.