# Covariance

When thinking about the relationship between two numerical variables, one of the things we often want to understand is the extent to which they behave in a similar way. "Similar" here means something like the following:

- When one goes up, the other tends to go up too.
- When one goes down, the other tends to go down too.
- When one moves a lot (either up or down), the other tends to move a lot (in the same direction) too.

*Covariance* is the extent to which two numerical variables behave in the fashion I just described.
Two variables exhibit covariance—they *covary*— when they tend to move together.

It's pretty easy to think of examples where we might expect to see high (even if imperfect) covariance:

- a person's height and weight
- outdoor temperature and ice cream consumption
- exposure to direct sunlight and incidence of skin cancer
- stock prices of two close competitors in the same industry

Now let's see how to measure the covariance between two variables.

### Measuring covariance

As you might guess, there are both population and sample formulas for measuring the covariance between with two variables.

**Definition.** The *population covariance* is given by

where \(\mu_X\) and \(\mu_Y\) are population means, and \(N\) is the size of the population.

**Definition.** The *sample covariance* estimates the population covariance from a
random sample drawn from that population:

where \(\bar{x}\) and \(\bar{y}\) are sample means and \(n\) is the sample size.

Let's take a few moments to understand how these formulas work.

### Understanding the covariance formulas

A geometric interpretation makes the covariance formula pretty intuitive to understand. (I'll speak of "the covariance formula" since the population and sample versions are conceptually identical.) In the following we'll consider the example from the exercises in the previous section on scatterplots:

Name | Age | Height | Weight (lbs) |
---|---|---|---|

Ivo | 51 | 6'2" | 205 |

Lorelei | 23 | 5'9" | 138 |

Beatrix | 48 | 5'6" | 129 |

Persephone | 12 | 5'4" | 104 |

Pandora | 30 | 5'5" | 142 |

Let's interpret the covariance between height and weight. First, here's the scatterplot:

In the scatterplot above, I've used closed circles for the data points, and an open circle to represent the (population) mean for the dataset.

The idea behind covariance is that we're taking the average of the \((X_i - \bar{X})(Y_i - \bar{Y})\) across all observations \(i\). Each factor is essentially a distance from the corresponding mean. Let's recenter the points around the origin so we can deal directly with the distances:

Now we can more directly see the distances involved. Notice that each \((X_i - \bar{X})(Y_i - \bar{Y})\) is a product of mean-adjusted variables; geometrically, each is the area of a rectangle whose sides are the \(X\) and \(Y\) distances:

In the plot above, there's a large rectangle in the upper right quadrant, three smaller rectangles in the lower left quadrant, and a small rectangle in the lower right quadrant. The sum in the formula is basically a sum of areas of rectangles. The upper right and lower left quadrants contribute positive terms since the product of two numbers with the same sign is positive. The upper left and lower right quadrants contribute negative terms since the product of two numbers with differing signs is negative.

Let's color the rectangles to reflect this difference, and label the rectangles with their areas:

We can tell from the above that the covariance is going to be positive, since there is only one red rectangle, and it's small. Here's a table of the numerical values:

Name | Height offset | Weight offset | Area |
---|---|---|---|

Ivo | 6.4 | 61.4 | 392.96 |

Lorelei | 1.4 | -5.6 | -7.84 |

Beatrix | -1.6 | -14.6 | 23.36 |

Persephone | -3.6 | -39.6 | 142.56 |

Pandora | -2.6 | -1.6 | 4.16 |

Mean area: 111.04 |

Taking the average of these areas gives us the (population) covariance of 111.04.

### Strengths the covariance as a measure of relationship

**Captures both "positive" and "negative" relationships.**
The covariance captures not only the case where two variables tend to move together, but also the case where
they tend to move in opposite directions.

**Easy to understand.** The covariance has an
intuitive geometric interpretation, as we just saw. Even the purely algebraic presentation is mostly
straightforward to understand once you notice that the covariance is essentially a mean.

### Weaknesses the covariance as a measure of relationship

**Scale-dependent.** The covariance reflects
the scale of the variables involved. This makes it difficult to compare covariance across different pairs of
variables.

**Sensitive to outliers.** It should be clear
from the preceding discussion that the covariance is sensitive to outliers. In the geometric explanation, we
saw that a large rectangle will have large influence on the covariance. Adding a single arbitrarily large
rectangle will have an arbitrarily large impact.

**Less useful for nonlinear relationships.**
Covariance measures linear dependence, but often misses cases where there's a clear but nonlinear
relationship. For example, a covariance of zero means a lack of linear relationship between the two
variables. But there are many other datasets that show clear relationships and yet have a covariance of
zero:

## Applications of covariance

In finance, covariance is useful for portfolio management. Portfolio managers generally try to reduce portfolio risk by diversifying the stocks in a portfolio. In modern portfolio theory this means including assets with negative covariance in the portfolio.

In the next section we'll learn about a normalized version of covariance, the correlation coefficient.

## Exercises

**Exercise 1.** Download the `hygrometer3.csv`

dataset from
williewheeler/datasets, and use a
software package to calculate the sample covariance between temperature and relative humidity. What does the
result tell you about the relationship between the two variables?

**Exercise 2.** What happens when you take the covariance of a variable with itself? How does
it compare to the variance?

**Exercise 3.** Create a dataset whose covariance you expect to be a large positive number.
Calculate the covariance to see if you are right.

**Exercise 4.** Create a dataset whose covariance you expect to be a large negative number.
Calculate the covariance to see if you are right.

**Exercise 5.** Create a dataset whose covariance you expect to be close to zero. Calculate the
covariance to see if you are right.