Range

For numerical variables, the easiest way to measure the spread is called the range.

Definition. The range is the difference between the maximum and the minimum values in the dataset. That is, given a dataset X,

$$range(X) = max(X) - min(X)$$

Here are a couple examples showing how to calculate the range.

Example: Test scores for a reasonably challenging chemistry test

Recall the test scores for the chemistry test from the previous section:

83, 87, 61, 92, 38, 78, 73, 55, 98, 74, 86, 69, 40, 83

The largest value in the dataset is 98, and the smallest is 38. Therefore the range is 98 - 38 = 60. This corresponds to our intuition that the spread is reasonably high, as it covers 60% of the possible range of values.

Example: Test scores for an easy math test

Recall the test scores for the math test from the previous section:

100, 100, 93, 92, 95, 98, 100, 100, 100, 95, 94, 88, 92

The largest value here is 100, and the smallest is 88. So the range is 100 - 88 = 12. This matches our intuition that the spread is low, since it covers only 12% of the possible range.

Now let's consider some pros and cons of the range as a measure of spread.

Strengths of the range as a measure of spread

Easy to calculate. The calculation involved is straightforward.

Easy to understand when the variable's bounds are known. The output of the calculation is easy to understand, at least in the context of a single variable with known numerical bounds. (But see below.)

Matches our intuitions for many datasets. We want our measurement techniques to match our pre-formal intuitions about when the spread is high or low. In many cases, such as the cases above, it does just that.

Weaknesses of the range as a measure of spread

The range is highly scale-dependent. This means that the measurement it generates very much depends on the size of the values in the dataset. So it's hard to compare ranges across datasets that use different scales.

For example, suppose that somebody told you that 30 students took a test, and the range was 80. Is that high or low? There's no way to tell without more information. If it's a test where the scale is 0-100, then we'd likely consider that to be a high range. But if the test is the SAT (a standardized test for college admissions in the United States), that range would be low indeed, as the scale for the SAT is 400-1600. Because we often want to compare datasets that use different scales to measure roughly the same thing (like test performance), it's inconvenient that the range requires additional context to interpret.

Interpretability is an important consideration when evaluating different ways to measure things. All things being equal, we prefer measurements that are easy to interpret, since the goal of descriptive statistics is to describe datasets in a way that makes them easier to understand.

The range hides everything in between the upper and lower bounds. For variables like net worth or income, most values are going to be within a fairly restricted range, with a long tail of increasingly large values in the billions of dollars. Obviously a single number can do only so much work in characterizing a dataset. But the range applied to this kind of dataset can paint a misleading picture.

The range is sensitive to outliers (i.e., it isn't robust). This is closely related to the previous concern, but it's a little different. Suppose for example that our fearless set of students took a test, with the following results:

100, 100, 93, 92, 95, 98, 100, 0, 100, 100, 95, 94, 88, 92

Here, everybody received an 88 or higher with one exception: one student received a zero because he was caught cheating. The range would be 100, which is very high. The outlier score of zero has caused the range to generate a measurement that doesn't reflect the actual tightness of the spread. In the same way that the mean is not a robust measurement of central tendency, the range is not a robust measure of spread.

Later we'll learn about the interquartile range, which is a robust version of the range.

Exercises

Exercise 1. What is the range of the following dataset?

91, 392, 44, 153, 20

Exercise 2. Create two datasets that have the same mean but different ranges.

Exercise 3. Create two datasets that have the same range but different means.

Exercise 4. Imagine a geographical dataset with longitude and latitude data, and think about all cities that lie within 50 miles of the equator. How would you describe the range of the latitudes? How about the range of the longitudes?