Martian crater
Martian crater. Source


So far we've learned how to understand a dataset in terms of its center, spread and position. Now it's time to get the big picture view, and look at the "shape" of a dataset, or at least a single variable in that dataset.

First let's explain what we mean by the "shape" of a dataset. We'll assume a single variable, which can be either categorical or numerical. For categorical variables and for certain numerical variables, some values will be more common than others. For other numerical variables, we may find that the individual values are mostly distinct, but when we group them into "bins" (numerical ranges), we find that certain bins are more common than others. This pattern, where certain values are more or less common, looks a certain way when we visualize the dataset using graphical techniques. This distribution of values is what we call the "shape".

Some examples will help make sense of this idea.

Example: Rolling a single die

When we roll a single fair six-sided die, the possible values are 1, 2, 3, 4, 5 and 6, and each outcome is equally likely. So imagine that we were to roll the die 1,000 times, and count each outcome. We might see something like this:

Rolling a single die 1,000 times
Rolling a single die 1,000 times

Here, each outcome appears approximately 1,000 / 6 = 167 times, give or take a handful. The resulting shape is "flat", or "uniform". Using technical language, we'd say that the variable of rolling a single, fair six-sided die is uniformly distributed.

Example: Rolling two dice

Now let's repeat the same experiment, but rolling two dice and taking their sum. We expect certain outcomes to be more common (say 6, 7 and 8), and other outcomes to be relatively rarer (2, 3, 11, 12). Well, here we go:

Rolling two dices 1,000 times
Rolling two dice 1,000 times

This clearly has a very different shape. It looks almost like a triangle. At any rate it definitely isn't flat.

As we can see, depending on the variable, some values may be more common that other values (or not). We also see that visualization tools can really help to bring the shape into focus. In the following sections we'll learn about counting the different values, and also about different ways to visualize the data.


Exercise 1. Imagine that your were going to roll four dice 1,000 times. Which sums would you expect to be most common, and which would be rare? What sort of shape would you expect?

Exercise 2 (advanced). Write a computer program to calculate the outcomes for the experiment described in Exercise 1 above. Plot out the results. What happens if you roll the dice 10,000 times?