Histograms

On the surface, histograms appear similar to bar charts. There's an important difference, though. A bar chart involves a relationship between a categorical variable and a numerical variable. A histogram, on the other hand, involves taking a numerical variable, establishing intervals or "bins" for its values, and then displaying counts for each bin.

Here's an example.

Example: Annual precipitation in New York City, 1869-1957

Source: https://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture18.pdf

Here are the annual precipitation measurements, in inches, for New York City during the years 1869-1957:

43.6 37.8 49.2 40.3 45.5 44.2 38.6 40.6 38.7 46.0
37.1 34.7 35.0 43.0 34.4 49.7 33.5 38.3 41.7 51.0
54.4 43.7 37.6 34.1 46.6 39.3 33.7 40.1 42.4 46.2
36.8 39.4 47.0 50.3 55.5 39.5 35.5 39.4 43.8 39.4
39.9 32.7 46.5 44.2 56.1 38.5 43.1 36.7 39.6 36.9
50.8 53.2 37.8 44.7 40.6 41.7 41.4 47.8 56.1 45.6
40.4 39.0 36.1 43.9 53.5 49.8 33.8 49.8 53.0 48.5
38.6 45.1 39.0 48.5 36.7 45.0 45.0 38.4 40.8 46.9
36.2 36.9 44.4 41.5 45.2 35.6 39.9 36.2 36.5

We can create a histogram using 5-inch bins as follows:

Annual precipitation in New York City, 1869-1957
Annual precipitation in New York City, 1869-1957

In the example above, each of the histogram's bins has equal width. But that's not at all necessary, as the next example illustrates.

Example: Distribution of household income in Moreno Valley, CA in 2017

The City-Data.com web site has lots of interesting data about U.S. cities. One such city is Moreno Valley, CA. The site presents the following histogram depicting the distribution of household income in Moreno Valley for 2017:

Distribution of household income in Moreno Valley, CA in 2017
Distribution of household income in Moreno Valley, CA in 2017

Notice that not only are the widths different, but the final bin doesn't even have an upper bound. That is, it's just a catch-all bin for everybody making over $200K per year.

Generally, we'll use wider bins when there aren't that many values in the range, and so it just makes sense to treat them all as a group instead of having a bunch of small bins with few or no values.

This example highlights one of the key differences between bar charts and histograms. With a bar chart, the width of the bar doesn't mean anything—it's a purely visual/stylistic decision on the part of whoever created the chart. With a histogram, the bar's width directly reflects the bin's width.

An aside

There are a couple of things about the Moreno Valley histogram that are either confusing or wrong. Let's talk about those since you're going to run into this kind of thing when you do real-world work.

First, City-Data.com calls this "Distribution of median household income...", but that isn't right and doesn't make sense. This is the distribution of household income, not the distribution of median household income. Each household has an income that we count and assign to a bin. The median would be a single measurement applying to the dataset as a whole.

Second, the x-axis in the histogram is a little confusing because City-Data.com opted to put the bin's upper bounds directly under the bar instead of directly under the boundary between bars. Sometimes you have to pay attention to how somebody decided to draw the chart and figure out what it means. Here, we can reverse engineer the chart by noticing that the last bin means "everything strictly over $200K". If we think through that a bit, we can conclude that the numbers under the bars are inclusive upper bounds for the intervals they represent. For example, the first bar is for the range [0, 10], the second is for (10, 20], the third is for (20, 30] and so forth. This must be the case because the second-to-last bar must have 200 as an inclusive upper bound, given that the last bar is everything over 200.

The moral of the story is to pay attention so you can avoid misinterpreting or misrepresenting what a given chart is saying.

Exercises

Exercise 1. Create a new histogram for the Moreno Valley household income data, using the following bins: [0, 50], (50, 100], (100, 150], (150, 200], and (200, +∞). Note that you'll need to estimate the new bin counts based on the original histogram's bin counts.

Exercise 2. In reference to the two Moreno Valley household income histograms, which is more informative? Why? Are there situations in which one would be preferred over the other?