Calculate the Correlation Coefficient in Python

Your goal

You need to calculate Pearson's correlation coefficient \(r\) between two numerical variables in Python. The variables are stored in a Pandas DataFrame.

(For the coefficient of determination, \(r^2\), just calculate the correlation coefficient \(r\) and square it.)

Step-by-step tutorial

In Python, the two major libraries for calculating correlation coefficients are Pandas and NumPy. Both of them actually generate correlation matrices rather than an individual correlation coefficient, so you'll need to pluck the correlation coefficient from the matrix.

Let's use Pandas since we're already assuming a Pandas DataFrame.

>>> import pandas as pd
>>> precip = pd.read_csv("precip-central-park.csv")
>>> precip
     YEAR   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG   SEP   OCT   NOV   DEC  ANNUAL
0    1869  2.53  6.87  4.61  1.39  4.15  4.40  3.20  1.76  2.81  6.48  2.03  5.02   45.25
1    1870  4.41  2.83  3.33  5.11  1.83  2.82  3.76  3.07  2.52  4.97  2.42  2.18   39.25
2    1871  2.07  2.72  5.54  3.03  4.04  7.05  5.57  5.60  2.34  7.50  3.56  2.24   51.26
3    1872  1.88  1.29  3.74  2.29  2.68  2.93  7.83  6.29  2.95  3.35  4.08  3.18   42.49
4    1873  5.34  3.80  2.09  4.16  3.69  1.28  4.61  9.56  3.14  2.73  4.63  2.96   47.99
..    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...     ...
146  2015  5.23  2.04  4.72  2.08  1.86  4.79  3.98  2.35  3.28  3.91  2.01  4.72   40.97
147  2016  4.41  4.40  1.17  1.61  3.75  2.60  7.02  1.97  2.79  4.15  5.41  2.89   42.17
148  2017  4.83  2.48  5.25  3.84  6.38  4.76  4.19  3.34  2.00  4.18  1.58  2.21   45.04
149  2018  2.18  5.83  5.17  5.78  3.53  3.11  7.45  8.59  6.19  3.59  7.62  6.51   65.55
150  2019  3.58  3.14  3.87  4.55  6.82  5.46  5.77  3.70  0.95  6.15  1.95  7.09   53.03

[151 rows x 14 columns]
>>> jan_jul = precip[["JAN", "JUL"]]
>>> jan_jul
      JAN   JUL
0    2.53  3.20
1    4.41  3.76
2    2.07  5.57
3    1.88  7.83
4    5.34  4.61
..    ...   ...
146  5.23  3.98
147  4.41  7.02
148  4.83  4.19
149  2.18  7.45
150  3.58  5.77

[151 rows x 2 columns]
>>> jan_jul.corr()
          JAN       JUL
JAN  1.000000 -0.146235
JUL -0.146235  1.000000
>>> jan_jul.corr()["JAN"]["JUL"]
-0.14623455484237252