Covariance

When thinking about the relationship between two numerical variables, one of the things we often want to understand is the extent to which they behave in a similar way. "Similar" here means something like the following:

  • When one goes up, the other tends to go up too.
  • When one goes down, the other tends to go down too.
  • When one moves a lot (either up or down), the other tends to move a lot (in the same direction) too.

Covariance is the extent to which two numerical variables behave in the fashion I just described. Two variables exhibit covariance—they covary— when they tend to move together.

It's pretty easy to think of examples where we might expect to see high (even if imperfect) covariance:

  • a person's height and weight
  • outdoor temperature and ice cream consumption
  • exposure to direct sunlight and incidence of skin cancer
  • stock prices of two close competitors in the same industry

Now let's see how to measure the covariance between two variables.

Measuring covariance

As you might guess, there are both population and sample formulas for measuring the covariance between with two variables.

Definition. The population covariance is given by

$${Cov(X, Y) = \frac{\sum (X_i - \mu_X)(Y_i - \mu_Y)}{N}}$$

where \(\mu_X\) and \(\mu_Y\) are population means, and \(N\) is the size of the population.

Definition. The sample covariance estimates the population covariance from a random sample drawn from that population:

$${q_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n - 1}}$$

where \(\bar{x}\) and \(\bar{y}\) are sample means and \(n\) is the sample size.

Let's take a few moments to understand how these formulas work.

Understanding the covariance formulas

A geometric interpretation makes the covariance formula pretty intuitive to understand. (I'll speak of "the covariance formula" since the population and sample versions are conceptually identical.) In the following we'll consider the example from the exercises in the previous section on scatterplots:

Name Age Height Weight (lbs)
Ivo 51 6'2" 205
Lorelei 23 5'9" 138
Beatrix 48 5'6" 129
Persephone 12 5'4" 104
Pandora 30 5'5" 142

Let's interpret the covariance between height and weight. First, here's the scatterplot:

Scatterplot: weight vs height
Scatterplot: weight vs height

In the scatterplot above, I've used closed circles for the data points, and an open circle to represent the (population) mean for the dataset.

The idea behind covariance is that we're taking the average of the \((X_i - \bar{X})(Y_i - \bar{Y})\) across all observations \(i\). Each factor is essentially a distance from the corresponding mean. Let's recenter the points around the origin so we can deal directly with the distances:

Recentering the points about the origin
Recentering the points about the origin

Now we can more directly see the distances involved. Notice that each \((X_i - \bar{X})(Y_i - \bar{Y})\) is a product of mean-adjusted variables; geometrically, each is the area of a rectangle whose sides are the \(X\) and \(Y\) distances:

The data points as rectangles
The data points as rectangles

In the plot above, there's a large rectangle in the upper right quadrant, three smaller rectangles in the lower left quadrant, and a small rectangle in the lower right quadrant. The sum in the formula is basically a sum of areas of rectangles. The upper right and lower left quadrants contribute positive terms since the product of two numbers with the same sign is positive. The upper left and lower right quadrants contribute negative terms since the product of two numbers with differing signs is negative.

Let's color the rectangles to reflect this difference, and label the rectangles with their areas:

Positive (green) and negative (red) contributions to the covariance
Positive (green) and negative (red) contributions to the covariance

We can tell from the above that the covariance is going to be positive, since there is only one red rectangle, and it's small. Here's a table of the numerical values:

Name Height offset Weight offset Area
Ivo 6.4 61.4 392.96
Lorelei 1.4 -5.6 -7.84
Beatrix -1.6 -14.6 23.36
Persephone -3.6 -39.6 142.56
Pandora -2.6 -1.6 4.16
Mean area: 111.04

Taking the average of these areas gives us the (population) covariance of 111.04.

Strengths the covariance as a measure of relationship

Captures both "positive" and "negative" relationships. The covariance captures not only the case where two variables tend to move together, but also the case where they tend to move in opposite directions.

Easy to understand. The covariance has an intuitive geometric interpretation, as we just saw. Even the purely algebraic presentation is mostly straightforward to understand once you notice that the covariance is essentially a mean.

Weaknesses the covariance as a measure of relationship

Scale-dependent. The covariance reflects the scale of the variables involved. This makes it difficult to compare covariance across different pairs of variables.

Sensitive to outliers. It should be clear from the preceding discussion that the covariance is sensitive to outliers. In the geometric explanation, we saw that a large rectangle will have large influence on the covariance. Adding a single arbitrarily large rectangle will have an arbitrarily large impact.

Less useful for nonlinear relationships. Covariance measures linear dependence, but often misses cases where there's a clear but nonlinear relationship. For example, a covariance of zero means a lack of linear relationship between the two variables. But there are many other datasets that show clear relationships and yet have a covariance of zero:

Datasets with covariance zero
Datasets with covariance zero

Applications of covariance

In finance, covariance is useful for portfolio management. Portfolio managers generally try to reduce portfolio risk by diversifying the stocks in a portfolio. In modern portfolio theory this means including assets with negative covariance in the portfolio.

In the next section we'll learn about a normalized version of covariance, the correlation coefficient.

Exercises

Exercise 1. Download the hygrometer3.csv dataset from williewheeler/datasets, and use a software package to calculate the sample covariance between temperature and relative humidity. What does the result tell you about the relationship between the two variables?

Exercise 2. What happens when you take the covariance of a variable with itself? How does it compare to the variance?

Exercise 3. Create a dataset whose covariance you expect to be a large positive number. Calculate the covariance to see if you are right.

Exercise 4. Create a dataset whose covariance you expect to be a large negative number. Calculate the covariance to see if you are right.

Exercise 5. Create a dataset whose covariance you expect to be close to zero. Calculate the covariance to see if you are right.