A boxplot is a visualization that allows us to see multiple aspects of a dataset at once:
- the median
- the first and third quartiles
- the interquartile range
- depending on the approach, either outliers or else the data maximum and minimum
Here's a boxplot of annual precipitation in Central Park for the years 1869-2019:
In the boxplot above, we see a box with a heavy line through it, some "whiskers" above and below the box (boxplots are sometimes called "box and whisker plots"), and finally some open circles above the upper whisker. Here's what these represent:
- The heavy line in the middle of the box represents the median, which here happens to be 44.5.
- The upper bound of the box is the third quartile, and the lower bound is the first quartile. Thus the distance between them is the interquartile range.
- In this boxplot, the whiskers extend 1.5 · IQR above and below the box.
- The open circles are outliers (here, anything outside the whiskers).
Note that in the boxplot above, we can see the maximum value because it happens to be an outlier, but we can't see the minimum value because it isn't an outlier.
Oh, speaking of boxplots and outliers...
I couldn't resist. :)
It happens that while there is a standard practice around what the box and the heavy line represent, there's
no standard for the whiskers. In the boxplot above, I used the
boxplot function in R, which by
default draws whiskers that extend 1.5 · IQR above and below the box. The exact IQR scaling factor is
configurable, and if we set it to 0, then the behavior is for the whiskers to represent the maximum and
minimum values, like this:
Exercise 1. Using the precipitation data, create a boxplot for the July column.