Correlation caveats

Introduction to Statistics in Python

Maggie Matsui

Content Developer, DataCamp

Non-linear relationships

scatterplot of variables with a quadratic relationship

$$r = 0.18$$

Introduction to Statistics in Python

Non-linear relationships

What we see:

scatterplot of variables with a quadratic relationship with quadratic trendline

What the correlation coefficient sees:

scatterplot of variables with a quadratic relationship with a linear trendline

Introduction to Statistics in Python

Correlation only accounts for linear relationships

Correlation shouldn't be used blindly

df['x'].corr(df['y'])
0.081094

Always visualize your data

scatterplot of variables with a quadratic relationship

Introduction to Statistics in Python

Mammal sleep data

print(msleep)
                 name       genus   vore         order  ... sleep_cycle  awake  brainwt   bodywt
1             Cheetah    Acinonyx  carni     Carnivora  ...         NaN   11.9      NaN   50.000
2          Owl monkey       Aotus   omni      Primates  ...         NaN    7.0  0.01550    0.480
3     Mountain beaver  Aplodontia  herbi      Rodentia  ...         NaN    9.6      NaN    1.350
4 Greater short-ta...     Blarina   omni  Soricomorpha  ...    0.133333    9.1  0.00029    0.019
5                 Cow         Bos  herbi  Artiodactyla  ...    0.666667   20.0  0.42300  600.000
..                ...         ...    ...           ...  ...         ...    ...      ...      ...
79         Tree shrew      Tupaia   omni    Scandentia  ...    0.233333   15.1  0.00250    0.104
80 Bottle-nosed do...    Tursiops  carni       Cetacea  ...         NaN   18.8      NaN  173.330
81              Genet     Genetta  carni     Carnivora  ...         NaN   17.7  0.01750    2.000
82         Arctic fox      Vulpes  carni     Carnivora  ...         NaN   11.5  0.04450    3.380
83            Red fox      Vulpes  carni     Carnivora  ...    0.350000   14.2  0.05040    4.230
Introduction to Statistics in Python

Body weight vs. awake time

Scatterplot of body weight vs awake time

msleep['bodywt'].corr(msleep['awake'])
0.3119801
Introduction to Statistics in Python

Distribution of body weight

Histogram of bodywt variable

Introduction to Statistics in Python

Log transformation

msleep['log_bodywt'] = np.log(msleep['bodywt'])

sns.lmplot(x='log_bodywt', y='awake', data=msleep, ci=None) plt.show()
msleep['log_bodywt'].corr(msleep['awake'])
0.5687943

Scatterplot of log bodywt vs awake

Introduction to Statistics in Python

Other transformations

  • Log transformation (log(x))
  • Square root transformation (sqrt(x))
  • Reciprocal transformation (1 / x)

  • Combinations of these, e.g.:

    • log(x) and log(y)
    • sqrt(x) and 1 / y
Introduction to Statistics in Python

Why use a transformation?

  • Certain statistical methods rely on variables having a linear relationship
    • Correlation coefficient
    • Linear regression

 

Introduction to Linear Modeling in Python

Introduction to Statistics in Python

Correlation does not imply causation

                x is correlated with y does not mean x causes y

Scatterplot of per capita margarine consumption in the US vs divorce rate in Maine. The variables are highly correlated with a correlation coefficient of 0.99

Introduction to Statistics in Python

Confounding

  Coffee drinking (x) pointing to lung cancer (y)

Introduction to Statistics in Python

Confounding

  Coffee drinking (x) pointing to lung cancer (y) with smoking (confounder) above

Introduction to Statistics in Python

Confounding

  Coffee drinking (x) pointing to lung cancer (y) with smoking (confounder). Double arrow between smoking and coffee drinking, labeled "association".

Introduction to Statistics in Python

Confounding

  Coffee drinking (x) pointing to lung cancer (y) with smoking (confounder). Double arrow between smoking and coffee drinking, labeled "association". Arrow from smoking to lung cancer labeled "causation"

Introduction to Statistics in Python

Confounding

  Coffee drinking (x) with double arrow to to lung cancer (y) labeled "association". Double arrow between smoking and coffee drinking, labeled "association". Arrow from smoking to lung cancer labeled "causation".

  Holidays (x) points to retail sales (y). Special deals(confounder) has double arrow to holidays and single arrow to retail sales.

Introduction to Statistics in Python

Let's practice!

Introduction to Statistics in Python

Preparing Video For Download...