Introduction to Statistics in Python
Maggie Matsui
Content Developer, DataCamp
$$r = 0.18$$
What we see:
What the correlation coefficient sees:
Correlation shouldn't be used blindly
df['x'].corr(df['y'])
0.081094
Always visualize your data
print(msleep)
name genus vore order ... sleep_cycle awake brainwt bodywt
1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230
msleep['bodywt'].corr(msleep['awake'])
0.3119801
msleep['log_bodywt'] = np.log(msleep['bodywt'])
sns.lmplot(x='log_bodywt', y='awake', data=msleep, ci=None) plt.show()
msleep['log_bodywt'].corr(msleep['awake'])
0.5687943
log(x)
)sqrt(x)
)Reciprocal transformation (1 / x
)
Combinations of these, e.g.:
log(x)
and log(y)
sqrt(x)
and 1 / y
x
is correlated with y
does not mean x
causes y
Introduction to Statistics in Python