# Correlation Coefficient

Karl Pearson (above) developed the correlation coefficient, also known as Pearson's `r`. But
August Brevais published it first in 1844.

In the previous section on covariance, we learned how to judge whether
two numerical variables tend to "move together"—that is, whether they exhibit *covariance*. The
calculation itself is easy to understand as a mean comparison between the two variables in question.

We also noted that the covariance, unfortunately, suffers from being highly scale-dependent. This limits its usefulness when comparing the covariance across different pairs of variables. If the prices for stocks A and B have covariance \(Cov(A, B)\) and the prices for stocks C and D have covariance \(Cov(C, D)\), as a general rule there isn't a good way to compare \(Cov(A, B)\) and \(Cov(C, D)\) to see which pair of stock prices "moves together" more.

In this section we'll learn a scale-independent way of measuring whether two numerical variables tend to
move together: the *correlation coefficient*. The correlation coefficient is normalized such that its
values lie in the closed interval [-1, 1].

## Correlation coefficient definition

Like the covariance, there are both population and sample variants of the correlation coefficient. We'll consider both.

**Definition.** The *population correlation coefficient* \(\rho\) is a
straightforward normalization of the covariance:

where \(Cov(X, Y)\) is the population covariance and \(\sigma_X\) and \(\sigma_Y\) are population standard deviations.

Expanding this out and doing a bit of algebra, we have:

**Definition.** The *sample correlation coefficient* is analogous to the
population correlation coefficient, but we use sample covariances and variances instead. Here's the formula
after expanding things out:

where \(\bar{x}\) and \(\bar{y}\) are sample means and \(n\) is the sample size.

## Interpreting the correlation coefficient

First, here are some key points about the correlation coefficient:

- The correlation coefficient is always in the interval [-1, 1].
- Positive values indicate that the two variables tend to move linearly in the same direction.
- Negative values indicate that the two variables tend to move linearly in opposite directions.
- Values near zero mean that there's no
*linear*relationship between the two variables, even though there might still be a nonlinear relationship between them. - When the correlation coefficient is either 1 or -1, it means that the relationship is direct. That is, all the points as plotted on a scatterplot lie on a straight line. Intermediate values indicate more fuzziness to the relationship.

Interpreting the correlation coefficient is not always straightforward. Judgments about correlation are a matter of degree and context. In physical sciences where we're able to isolate variables and measure them with high-quality instruments, we might require correlations above 0.9 or higher before we consider variables to be correlated. On the other hand, in the social sciences it's usually more difficult to isolate and measure variables in this way, so we might consider a value of 0.8 to be evidence of strong correlation. In most contexts a correlation lower than 0.5 would be considered weak.

## Exercises

**Exercise 1.** Calculate the sample correlation coefficient for the two stocks in the
following dataset:

Date | AMZN Close | GOOG Close |
---|---|---|

2020-04-27 | $2,376.00 | $1,275.88 |

2020-04-28 | $2,314.08 | $1,233.67 |

2020-04-29 | $2,372.71 | $1,341.48 |

2020-04-30 | $2,474.00 | $1,348.66 |

2020-05-01 | $2,286.04 | $1,320.61 |

**Exercise 2.** Based on the very limited dataset in exercise 1, is there evidence of
correlation between AMZN and GOOG stock prices?