Most real-world datasets involve capturing data for multiple variables at a time. I'll give you an example. Recently my son and I built a hygrometer from a Raspberry Pi. A hygrometer is a device that measures both temperature and relative humidity over time. So hygrometer data has a few different variables:
- a timestamp
- a temperature measurement
- a relative humidity measurement
This hygrometer project was something I did for a hackathon at work, so I didn't know anything about the relationship between temperature and relative humidity, or even that there was a relationship. But after we built the hygrometer, we captured its data for over a week and took a look. Here's what we saw:
Several things jump out at us right away:
- There's a relationship between both temperature and relative humidity, on the one hand, and the timestamp, on the other. We can see a clear pattern, especially with the temperature. That's not too surprising, at least as far as the temperature goes: days are warmer than nights, and to save money, most people set the thermostat to allow the temperature to drop lower at night.
- There's a relationship between temperature and relative humidity: as the temperature increases, the relative humidity decreases. After seeing this in the data I did a little research on the web and found out that basically what's going on is that as the temperature increases, the "carrying capacity" of the air also increases, which means that the amount of water relative to the capacity decreases.
So even though correlation isn't causation, we understand enough about what's happening here to explain the relationships: during the day, we're getting sunlight and so the temperature increases. This in turn increases carrying capacity of the air and hence the relative humidity drops.
In statistics, one of the major goals is to understand how different variables are related to one another, and ideally to support analyses of a more causal nature. After all, science as an enterprise concerns uncovering causal relationships in the world.
In the next few sections we'll look at some basic tools that statistics provides to help us understand relationships between two variables in a dataset. We'll limit our discussion to two variables at a time, since this is normally what an elementary statistics course covers. If you decide to pursue your studies, you can learn about multivariate analysis, which generally covers cases where you have more than two variables.