Correlation caveats

Introduction to Statistics in R

Maggie Matsui

Content Developer, DataCamp

Non-linear relationships

scatterplot of variables with a quadratic relationship

$$r = 0.18$$

Introduction to Statistics in R

Non-linear relationships

What we see:

scatterplot of variables with a quadratic relationship with quadratic trendline

What the correlation coefficient sees:

scatterplot of variables with a quadratic relationship with a linear trendline

Introduction to Statistics in R

Correlation only accounts for linear relationships

Correlation shouldn't be used blindly

cor(df$x, df$y)
0.1786163

Always visualize your data

scatterplot of variables with a quadratic relationship

Introduction to Statistics in R

Mammal sleep data

msleep
   name                       vore  sleep_total awake  bodywt
 1 Cheetah                    carni        12.1  11.9  50    
 2 Owl monkey                 omni         17     7     0.48 
 3 Mountain beaver            herbi        14.4   9.6   1.35 
 4 Greater short-tailed shrew omni         14.9   9.1   0.019
 5 Cow                        herbi         4    20   600    
 6 Three-toed sloth           herbi        14.4   9.6   3.85 
 ... 
Introduction to Statistics in R

Body weight vs. awake time

Scatterplot of body weight vs awake time

cor(msleep$bodywt, msleep$awake)
0.3119801
Introduction to Statistics in R

Distribution of body weight

Histogram of bodywt variable

Introduction to Statistics in R

Log transformation

msleep %>%
  mutate(log_bodywt = log(bodywt)) %>%

ggplot(aes(log_bodywt, awake)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

 

cor(msleep$log_bodywt, msleep$awake)
0.5687943

Scatterplot of log bodywt vs awake

Introduction to Statistics in R

Other transformations

  • Log transformation (log(x))
  • Square root transformation (sqrt(x))
  • Reciprocal transformation (1 / x)

  • Combinations of these, e.g.:

    • log(x) and log(y)
    • sqrt(x) and 1 / y
Introduction to Statistics in R

Why use a transformation?

  • Certain statistical methods rely on variables having a linear relationship
    • Correlation coefficient
    • Linear regression
Introduction to Statistics in R

Correlation does not imply causation

x is correlated with y does not mean x causes y

Scatterplot of per capita margarine consumption in the US vs divorce rate in Maine. The variables are highly correlated with a correlation coefficient of 0.99

Introduction to Statistics in R

Confounding

  Coffee drinking (x) pointing to lung cancer (y)

Introduction to Statistics in R

Confounding

  Coffee drinking (x) pointing to lung cancer (y) with smoking (confounder) above

Introduction to Statistics in R

Confounding

  Coffee drinking (x) pointing to lung cancer (y) with smoking (confounder). Double arrow between smoking and coffee drinking, labeled "association".

Introduction to Statistics in R

Confounding

  Coffee drinking (x) pointing to lung cancer (y) with smoking (confounder). Double arrow between smoking and coffee drinking, labeled "association". Arrow from smoking to lung cancer labeled "causation"

Introduction to Statistics in R

Confounding

  Coffee drinking (x) with double arrow to to lung cancer (y) labeled "association". Double arrow between smoking and coffee drinking, labeled "association". Arrow from smoking to lung cancer labeled "causation".

  Holidays (x) points to retail sales (y). Special deals(confounder) has double arrow to holidays and single arrow to retail sales.

Introduction to Statistics in R

Let's practice!

Introduction to Statistics in R

Preparing Video For Download...