In a scatterplot, we visualize the relationship between two numerical variables in a dataset. The idea is to choose one of the variables as being on the x-axis, the other on the y-axis, and then just plot each point.
Suppose for example that we have a dataset of hygrometer data, which involves measuring both temperature and relative humidity over time. We saw in the previous section that the relative humidity tends to drop as the temperature increases, and vice versa:
How can we visualize the relationship between the temperature and relative humidity?
First, let's look at a small subset of the data:
|11/13/2020 12:00 AM||60.3720||19.8415|
|11/13/2020 12:06 AM||60.6801||18.7828|
|11/13/2020 12:12 AM||60.0155||20.1079|
|11/13/2020 12:18 AM||60.5750||18.8529|
|11/13/2020 12:24 AM||60.5309||19.5025|
|11/13/2020 12:30 AM||60.3386||19.0101|
|11/13/2020 12:36 AM||60.3915||19.6573|
|11/13/2020 12:42 AM||60.5946||18.8577|
|11/13/2020 12:48 AM||60.3363||19.6353|
|11/13/2020 12:54 AM||60.5608||18.5713|
In the dataset above we can see that there are three variables: a timestamp, an average relative humidity and an average temperature. The values under humidity and temperature above are too close together to see any obvious patterns, but if we plot the humidity against the temperature for the whole dataset, we get the following:
The scatterplot above makes it clear that there is indeed a relationship between these two variables, also perhaps a fairly loose one. We can see that the relative humidity tends to be higher when the temperature is lower, and as the temperature gets higher, we see more of a spread in the values into the lower part of the range.
Smaller datasets are easy to plot by hand. For a larger dataset, you'll typically use a software package to generate a scatterplot, since it would be tedious to plot the individual data points manually. See the tech tutorials below for information on using Excel, Python and R to generate scatterplots.
Exercise 1. Create a scatterplot for the variables
in the following dataset:
Exercise 2. Use the same dataset to create a scatterplot for the variables