You may have heard of the so-called "bell curve". (If not, don't worry: we'll get to this later in the course.) It looks like this:
With bell-shaped data, the center (whether measured using the mean, median or mode) is the x value right in the middle, corresponding to the peak of the curve. The data on either side of the mean is symmetric, and drops off in a characteristic fashion.
A great many real-world phenomena have roughly bell-shaped distributions. Examples include:
- IQ scores
- Test scores
- Shoe size
- Heights of people
- Skin color
- Blood pressure
- Daily returns for a given stock
- Measurement errors
The list goes on and on, but that gives you an idea as to how ubiquitous bell curves are.
A useful fact about bell curves is that for any given number of standard deviations σ, the percentage of data within σ standard deviations of the mean is known and constant. Here are the approximate percentages for σ = 1, 2 and 3:
|# of standard deviations σ||% of data within σ standard deviations of the mean|
Visually, it looks like this:
This mapping from sigmas to percentages is called the empirical rule.
Now let's check out some examples showing how we can put the empirical rule to good use.
Example: Estimating data distribution
IQ scores have mean \(\mu = 100\) and standard deviation \(\sigma = 15\). Using the empirical rule, we can estimate the following data distribution:
- Approximately 68% of people have IQs in the range 85-115.
- Approximately 95% of people have IQs in the range 70-130.
- Approximately 99.7% of people have IQs in the range 55-145.
Another use for the empirical rule is estimating the standard deviation from the range of the dataset. Take another look at the diagram above. The range of the dataset is likely to be between two and three standard deviations out, since in general that range contains between 95% and 99.7% of the data. Accordingly, we can guess that the standard deviation is likely to be between 1/4 and 1/6 of the range. This is a handy quick estimate since computing the range is easier than computing the standard deviation.
Example: Estimating the standard deviation from the range
Suppose that we have IQ data where the range is 143 - 72 = 71. Using the heuristic above allows us to estimate the standard deviation as being between 71/6 = 11.83 and 71/4 = 17.75, which is a pretty good estimate of the true value \(\sigma = 15\).
Note that if the dataset contains extreme outliers, the estimate will be worse. For example, if the dataset range is 182 - 72 = 110, then the estimate for the standard deviation would be between 110/6 = 18.33 and 110/4 = 27.5, which is a much poorer estimate of the range. Thus this method of estimating the standard deviation is not robust (i.e., it's not resistant to outliers).
Another application of the empirical rule is as a rough indicator of the "normality" of a dataset (roughly speaking, how bell-shaped it is). If a dataset follows a normal distribution then we expect its values to comport with the empirical rule.
Example: Testing for normality
Consider the following dataset (\(n = 30\)):
15.9249942 27.4487900 -1.2342597 30.6912334 13.6815346 23.6385636 12.2561295 4.1386065 -6.9577697 7.6475802 -35.2359455 13.7864672 26.0087985 9.3880788 10.4400216 34.8728048 39.9747441 11.4210232 35.9917044 25.7475065 -17.0352082 -9.1563771 -0.7987454 25.9819210 16.0954104 -46.5617619 46.0382561 0.9013163 -15.7993406 13.6762780
This dataset has a mean of \(\mu \approx 10.43\) and a standard deviation of \(\sigma \approx 21.13\). Let's take a look at the expected and actual counts within each standard deviation range:
|Sigmas||Range||Expected Count||Actual Count||Close Match?|
From the table above we can see that the actual counts match the expected counts quite closely, and so this data "passes" this informal normality test. (This doesn't mean that we know for sure that it's normally distributed, but it does mean that it looks like it could be.)
If we know that a process generates bell-shaped data, then we know that values outside of the three-sigma range will be quite rare. This useful observation allows us to detect data anomalies, as the next example shows.
Example: Anomaly detection
A web-based travel company sells hotel bookings. Volume varies depending on the hour of the day, day of the week and month. Bookings at any given point in time are roughly bell-shaped.
The company's operations team is charged with ensuring that the site is working correctly, meaning in part that customers can make hotel bookings. Sometimes there are technical or other issues that cause bookings to drop. When bookings are too low, the ops team needs to investigate.
For a certain one-minute window of time, the relevant historical dataset has a mean \(\mu = 205\) and standard deviation \(\sigma = 12\). The ops team wants to use this dataset to determine the threshold below which it should treat the bookings level as low enough to warrant an investigation.
To do this, they apply the so-called "three-sigma limit", which means that the threshold sits three standard deviations away from the mean. The empirical rule tells us that 99.7% of data lives within three standard deviations of the mean, and that the tail to the left of \(\mu - 3\sigma\) has only 0.15% of the data. Thus if we see a data point below this threshold, chances are good that something is broken.
To calculate the threshold, we simply plug in the values:
Therefore, if the bookings drop below 169, the ops team should investigate.
Exercise 1. The distribution of heights across adult males follows a bell-shaped pattern. Suppose that the mean is 70 inches with a standard deviation of 3 inches. Apply the empirical rule to describe the distribution of heights across adult males.
Exercise 2. You have a dataset of adult female heights where the tallest woman is 71 inches and the shortest is 57 inches. Use this information to estimate the standard deviation.