Populations vs. Samples

Before we dive into descriptive statistics, I need to say a little more about the difference between descriptive and inferential descriptive statistics, and how the notions of populations and samples come into play.

In the course overview you'll recall that I said that statistics divides into descriptive statistics and inferential statistics, and that the former concerns describing datasets whereas the latter concerns drawing inferences about a larger population based on a more limited data sample. So we've already touched on populations vs. samples. But let's be explicit.

Populations

A population is some group of interest under study. It's the group that we actually care about describing or understanding. This could be as small as four people in a single family, or it could be as large as the population of the United States, or the world population for that matter.

Samples

A sample is a smaller subset of a population that serves as a means to the end of understanding a population. We use samples when it's impractical to study the entire population directly, such as when the population is too large. Usually we build a sample by drawing individual members of a population at random, and then study the sample so that we can make judgments about the larger population. But the key point here is that it's not the sample itself that we care about. We care about the population, and we're just using the sample as a way to learn something about the population.

Here's a visual to illustrate the concepts:

A population vs. a sample

OK, so why am I telling you this, when we're just starting with descriptive statistics? Can't the population vs. sample discussion wait until we get to inferential statistics?

The reason is this. In the following sections, we're going to cover lots of different ways to describe datasets: broadly, central tendency (where is the "center" of the dataset), spread (how spread out are the data), position (where are the individual elements of the dataset located relative to the dataset at large), shape (how are the data distributed across the range of values) and relationships (how do multiple variables behave with respect to one another). These descriptions involve mathematical concepts such as the mean, the standard deviation and correlation, just to name a few.

The mathematical formulas for measuring these concepts often depend on whether we're measuring a population or a sample. In such cases, the sample formula is a modified version of the population formula, to make it a better estimate of the corresponding population measurement. This means that when we cover the various ways of describing a dataset, we're going to need to distinguish between the population and sample measurements.

If the above interests and surprises you, here's an example to illustrate what I'm talking about.

Example: Estimating the standard deviation (optional)

We'll see a little later in the course that there's an important statistic called the standard deviation that helps us to measure how spread out a dataset is. Don't worry if you don't know what standard deviation is yet—it doesn't matter for the purpose of this example. What I want to show you is how we get systematically wrong estimates if we try to use the population standard deviation formula on a sample, and how we get better estimates when we use the sample standard deviation.

What I'm going to do is show you a dataset representing a population of IQ scores, and also show you the standard deviation for that population. Then I'm going to draw a bunch of smaller random samples from that population, and apply both the population formula and the sample formula to each sample. I'll show you how the population formula, when applied to the sample, tends to underestimate the true population standard deviation.

First, here's our population of IQ scores for a group of 89 university students who were awarded a particular merit-based scholarship. (This is simulated data, so please ignore the dubious precision on the IQ scores.)

115.07225 150.78431 156.82500 147.45731 124.45789 143.47918 124.02679 141.03491
119.34552 140.64775 119.25767 129.49132 135.11360 120.60776 132.32080 138.24203
152.94123 149.17588 127.03219 141.32129 105.99778 122.02125 127.08299 120.80996
135.63379 113.50084 146.65044 137.67319 119.00944 103.58349 103.67011 120.07658
129.26855 135.79004 131.48624 119.72113 129.46582 121.07851 144.63035 125.23805
118.53238 130.73670 125.61688 124.11164 105.55364  92.94535 111.59578 134.19099
133.74457 124.91177  97.01827 152.44976 110.13326 103.24348 124.77763 133.04959
112.83344 120.22513 141.67214 122.87383 121.91932 114.78806 124.35512 107.57671
113.40962 111.21891 126.33730 113.63691 151.22122 127.55691 137.38297 124.74266
132.38332 130.33233 126.68865 142.05544 116.49583 133.28924 124.81247 116.43184
114.73584 142.53964 130.54203 117.33898 115.52683 150.69917 121.65495 110.61654
125.56618

The population standard deviation for the IQ scores 13.51172. You can calculate it yourself if you like, but it's probably easier to just take my word for it.

I used a computer program to draw 10,000 random samples from this dataset, each of size \(n = 15\), and then to apply both the population formula and the sample formula to each sample. I'm obviously not going to list all 10,000 resulting estimate pairs, but here are a handful just to give you a taste:

Sample # Estimate based on population SD formula Estimate based on sample SD formula
1 15.56686 16.11323
2 10.20163 10.55970
3 13.70025 14.18111
4 14.60159 15.11408
... ... ...

The final result across all 10,000 samples is that the mean estimate based on the population formula is 12.92624, whereas the mean estimate based on the sample formula is 13.37993. Again the actual population standard deviation is 13.51172, so in this case the sample standard deviation formula provides the better estimate.

Though this is just one example, in general the population standard deviation formula, when applied to a sample, tends to undershoot the true population standard deviation.

The example above highlights why we must distinguish the population and sample formulas when measuring various dataset quantities. For a population, these "quantities" are called population parameters. For a sample, they are called sample statistics. We use sample statistics to estimate population parameters, and we use sample-specific formulas to calculate sample statistics.

Now we've covered the key preliminaries: datasets, variables, populations and samples. It's time to jump into our study of the different ways of describing datasets, starting with central tendency.

Exercises

Exercise 1. When studying a population, are there cases where it wouldn't make sense to use a sample? Explain.

Exercise 2. Why would we want to draw the members of a sample at random from the population? Describe a scenario where doing otherwise might give a misleading result.