Correlation

Introduction to Statistics in Python

Maggie Matsui

Content Developer, DataCamp

Relationships between two variables

Scatter plot of sleep habits of mammals, showing total sleep per day vs REM sleep per day

  • x = explanatory/independent variable
  • y = response/dependent variable
Introduction to Statistics in Python

Correlation coefficient

  • Quantifies the linear relationship between two variables
  • Number between -1 and 1
  • Magnitude corresponds to strength of relationship
  • Sign (+ or -) corresponds to direction of relationship
Introduction to Statistics in Python

Magnitude = strength of relationship

0.99 (very strong relationship)

Scatterplot with points very close to an invisible line

Introduction to Statistics in Python

Magnitude = strength of relationship

0.99 (very strong relationship)

Scatterplot with points very close to an invisible line

              0.75 (strong relationship)

Scatterplot with points further from the invisible line

Introduction to Statistics in Python

Magnitude = strength of relationship

0.56 (moderate relationship)

Scatterplot with points even further from the invisible line

Introduction to Statistics in Python

Magnitude = strength of relationship

0.56 (moderate relationship)

Scatterplot with points even further from the invisible line

             0.21 (weak relationship)

Scatterplot with points that look almost totally randomly scattered

Introduction to Statistics in Python

Magnitude = strength of relationship

0.04 (no relationship)

Scatterplot with points that look totally randomly scattered

  • Knowing the value of x doesn't tell us anything about y
Introduction to Statistics in Python

Sign = direction

0.75:  as x increases, y increases

Scatterplot where y increases as x increases

-0.75:  as x increases, y decreases

Scatterplot where y decreases as x increases

Introduction to Statistics in Python

Visualizing relationships

import seaborn as sns

sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()

Scatter plot of sleep_rem vs. sleep_total

Introduction to Statistics in Python

Adding a trendline

import seaborn as sns
sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)

plt.show()

Scatter plot of sleep_rem vs. sleep_total with linear trendline

Introduction to Statistics in Python

Computing correlation

msleep['sleep_total'].corr(msleep['sleep_rem'])
0.751755

 

msleep['sleep_rem'].corr(msleep['sleep_total'])
0.751755
Introduction to Statistics in Python

Many ways to calculate correlation

  • Used in this course: Pearson product-moment correlation   ($r$)
    • Most common
    • $\bar{x} =$ mean of $x$
    • $\sigma_x =$ standard deviation of $x$

$$ r = \frac{1}{n - 1} \sum_{i=1}^{n} \frac{(x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \cdot \sigma_y}$$

  • Variations on this formula:
    • Kendall's tau
    • Spearman's rho
Introduction to Statistics in Python

Let's practice!

Introduction to Statistics in Python

Preparing Video For Download...