Correlation

Introduction to Statistics in R

Maggie Matsui

Content Developer, DataCamp

Relationships between two variables

Scatter plot of sleep habits of mammals, showing total sleep per day vs REM sleep per day

  • x = explanatory/independent variable
  • y = response/dependent variable
Introduction to Statistics in R

Correlation coefficient

  • Quantifies the linear relationship between two variables
  • Number between -1 and 1
  • Magnitude corresponds to strength of relationship
  • Sign (+ or -) corresponds to direction of relationship
Introduction to Statistics in R

Magnitude = strength of relationship

0.99 (very strong relationship)

Scatterplot with points very close to an invisible line

Introduction to Statistics in R

Magnitude = strength of relationship

0.99 (very strong relationship)

Scatterplot with points very close to an invisible line

0.75 (strong relationship)

Scatterplot with points further from the invisible line

Introduction to Statistics in R

Magnitude = strength of relationship

0.56 (moderate relationship)

Scatterplot with points even further from the invisible line

Introduction to Statistics in R

Magnitude = strength of relationship

0.56 (moderate relationship)

Scatterplot with points even further from the invisible line

0.21 (weak relationship)

Scatterplot with points that look almost totally randomly scattered

Introduction to Statistics in R

Magnitude = strength of relationship

0.04 (no relationship)

Scatterplot with points that look totally randomly scattered

  • Knowing the value of x doesn't tell us anything about y
Introduction to Statistics in R

Sign = direction

0.75: as x increases, y increases

Scatterplot where y increases as x increases

-0.75: as x increases, y decreases

Scatterplot where y decreases as x increases

Introduction to Statistics in R

Visualizing relationships

ggplot(df, aes(x, y)) +
  geom_point()

Scatterplot where y decreases as x increases

Introduction to Statistics in R

Adding a trendline

ggplot(df, aes(x, y)) +
  geom_point() +

geom_smooth(method = "lm", se = FALSE)

Scatterplot where y decreases as x increases with trendline

Introduction to Statistics in R

Computing correlation

cor(df$x, df$y)
-0.7472765

 

cor(df$y, df$x)
-0.7472765
Introduction to Statistics in R

Correlation with missing values

df$x
-3.2508382  -9.1599807   3.4515013   4.1505899          NA   11.9806140   ...
cor(df$x, df$y)
NA
cor(df$x, df$y, use = "pairwise.complete.obs")
-0.7471757
Introduction to Statistics in R

Many ways to calculate correlation

  • Used in this course: Pearson product-moment correlation   ($r$)
    • Most common
    • $\bar{x} =$ mean of $x$

$$ r =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}} $$

  • Variations on this formula:
    • Kendall's tau
    • Spearman's rho
Introduction to Statistics in R

Let's practice!

Introduction to Statistics in R

Preparing Video For Download...