So far we've learned how to understand a dataset in terms of its center, spread and position. Now it's time to get the big picture view, and look at the "shape" of a dataset, or at least a single variable in that dataset.
First let's explain what we mean by the "shape" of a dataset. We'll assume a single variable, which can be either categorical or numerical. For categorical variables and for certain numerical variables, some values will be more common than others. For other numerical variables, we may find that the individual values are mostly distinct, but when we group them into "bins" (numerical ranges), we find that certain bins are more common than others. This pattern, where certain values are more or less common, looks a certain way when we visualize the dataset using graphical techniques. This distribution of values is what we call the "shape".
Some examples will help make sense of this idea.
Example: Rolling a single die
When we roll a single fair six-sided die, the possible values are 1, 2, 3, 4, 5 and 6, and each outcome is equally likely. So imagine that we were to roll the die 1,000 times, and count each outcome. We might see something like this:
Here, each outcome appears approximately 1,000 / 6 = 167 times, give or take a handful. The resulting shape is "flat", or "uniform". Using technical language, we'd say that the variable of rolling a single, fair six-sided die is uniformly distributed.
Example: Rolling two dice
Now let's repeat the same experiment, but rolling two dice and taking their sum. We expect certain outcomes to be more common (say 6, 7 and 8), and other outcomes to be relatively rarer (2, 3, 11, 12). Well, here we go:
This clearly has a very different shape. It looks almost like a triangle. At any rate it definitely isn't flat.
As we can see, depending on the variable, some values may be more common that other values (or not). We also see that visualization tools can really help to bring the shape into focus. In the following sections we'll learn about counting the different values, and also about different ways to visualize the data.
Exercise 1. Imagine that your were going to roll four dice 1,000 times. Which sums would you expect to be most common, and which would be rare? What sort of shape would you expect?
Exercise 2 (advanced). Write a computer program to calculate the outcomes for the experiment described in Exercise 1 above. Plot out the results. What happens if you roll the dice 10,000 times?