Stem and Leaf Plots

A stem and leaf plot provides a helpful way to visualize the shape of a numerical distribution. The idea is to take all the data values in a dataset, and for each, separate it into a "stem" and "leaf". In general, the "stems" represent the larger place values, whereas the leaves represent smaller place values. There are different ways to do this depending on the scale, range and precision of the actual data values.

Let's start by looking at an example.

Example: Annual precipitation in Central Park, 1869-2019

In this example we use the stem function from R to generate a stem and leaf plot for a precipitation dataset.

> precip.df <- read.csv("precip-central-park.csv")
> precip.annual <- precip[, "ANNUAL"]
> precip.annual
  [1] 45.25 39.25 51.26 42.49 47.99 45.83 40.90 41.77 40.17 48.66 39.01 36.66
 [13] 36.26 45.27 35.77 52.25 35.37 39.38 43.99 53.32 57.16 45.63 39.55 35.60
 [25] 48.26 41.01 35.37 41.96 44.55 47.90 38.57 41.19 48.69 52.77 58.32 41.64
 [37] 37.44 40.18 45.48 41.05 41.67 33.72 48.83 45.40 58.00 40.17 45.73 37.90
 [49] 41.04 38.27 53.29 53.20 37.76 44.66 40.57 41.72 41.43 47.83 56.06 45.62
 [61] 40.38 38.95 36.07 43.93 53.53 49.84 33.85 49.83 52.97 48.49 38.55 45.07
 [73] 39.04 48.51 36.74 45.01 44.98 38.36 40.79 46.87 36.24 36.89 44.40 41.51
 [85] 45.20 35.58 39.90 36.25 36.49 40.94 38.77 46.39 39.32 37.15 34.28 32.99
 [97] 26.09 39.90 49.12 43.57 48.54 35.29 56.77 67.03 57.23 47.69 61.21 41.28
[109] 54.73 49.81 52.13 44.55 38.11 41.40 80.56 57.03 38.82 42.95 46.39 44.67
[121] 65.11 60.92 45.18 43.35 44.28 47.39 40.42 56.19 43.93 48.69 42.50 45.42
[133] 35.92 45.21 58.56 51.97 55.97 59.89 61.67 53.61 53.62 49.37 72.81 38.51
[145] 46.32 53.79 40.97 42.17 45.04 65.55 53.03
> stem(precip.annual)

  The decimal point is 1 digit(s) to the right of the |

  2 | 6
  3 | 3444
  3 | 555666666666777778888899999999999
  4 | 0000000011111111111122222222333444444
  4 | 555555555555555566666667788888899999999
  5 | 00012223333334444
  5 | 56667777889
  6 | 0112
  6 | 567
  7 | 3
  7 |
  8 | 1

In the stem and leaf plot above, the row 2 | 6 represents the data value 26.09, and the row 8 | 1represents the data value 80.56. Notice that there are two rows for the range 30-39, two for 40-49 and so forth. The first 3 stem represents the half-open interval [30, 35) and the second 3 stem represents the half-open interval [35, 40). Speaking more generally, in the plot above, each row represents an interval 5 units wide.

I mentioned that the details of the plot depend on the scale, range and precision of the data, just as they do with other plots. Let's dissect the example above to see how this works in this particular case.

The first thing to notice is that the scale of the data is in the tens, since the lowest value is 26.09 and the highest is 80.56. The range covers several tens, so that tells us that units are probably the appropriate leaf here, and we can just round the data points to the nearest unit. Finally, each row in the plot covers 5 units instead of (say) 10 to give us a little more detail on how the data are distributed across the full range. There's no right or wrong choice (5 vs. 10) here—it's really just about what level of detail you want to communicate.

Exercises

Exercise 1. Redraw the stem and leaf plot for the precipitation data, but with each row representing 10 units instead of 5.

Exercise 2. What if we were to add the value 7.13 to the precipitation dataset? How would that impact the plot?

Exercise 3. What if we were to add the value 102.66 to the precipitation dataset? How would that impact the plot?