More Pandas

Python for R Users

Daniel Chen

Instructor

Missing data

NaN missing values from from numpy
np.NaN, np.NAN, np.nan are all the same as the NA R value
check missing with pd.isnull
- Check non-missing with pd.notnull
- pd.isnull is an alias for pd.isna

Working with missing data

df

           name  treatment_a  treatment_b
0    John Smith          NaN            2
1      Jane Doe         16.0           11
2  Mary Johnson          3.0            1

a_mean = df['treatment_a'].mean()
a_mean

9.5

Fillna

df['a_fill'] = df['treatment_a'].fillna(a_mean)
df

           name  treatment_a  treatment_b  a_fill
0    John Smith          NaN            2     9.5
1      Jane Doe         16.0           11    16.0
2  Mary Johnson          3.0            1     3.0

More Pandas

Applying custom functions
Groupby operations
Tidying data

Apply your own functions

Built-in functions
Custom functions
apply method
Pass in an axis

R

df = data.frame('a' = c(1, 2, 3),
                'b' = c(4, 5, 6))
apply(df, 2, mean)

a b 
2 5

apply(df, 1, mean)

2.5 3.5 4.5

Python

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
                   'B':[4, 5, 6]})
df.apply(np.mean, axis=0)

A    2.0
B    5.0
dtype: float64

df.apply(np.mean, axis=1)

0    2.5
1    3.5
2    4.5
dtype: float64

Tidy

Reshaping and tidying our data
Hadley Wickham, Tidy Data Paper
- Each row is an observation
- Each column is a variable
- Each type of observational unit forms a table

Tidy Data Paper: http://vita.had.co.nz/papers/tidy-data.pdf

Tidy melt

df

           name  treatment_a  treatment_b
0    John Smith          NaN            2
1      Jane Doe         16.0           11
2  Mary Johnson          3.0            1

df_melt = pd.melt(df, id_vars='name')
df_melt

           name     variable  value
0    John Smith  treatment_a    NaN
1      Jane Doe  treatment_a   16.0
2  Mary Johnson  treatment_a    3.0
3    John Smith  treatment_b    2.0
...

Tidy pivot_table

df_melt_pivot = pd.pivot_table(df_melt,
                               index='name',
                               columns='variable',
                               values='value')
df_melt_pivot

variable      treatment_a  treatment_b
name                                  
Jane Doe             16.0         11.0
John Smith            NaN          2.0
Mary Johnson          3.0          1.0

Reset index

df_melt_pivot.reset_index()

variable          name  treatment_a  treatment_b
0             Jane Doe         16.0         11.0
1           John Smith          NaN          2.0
2         Mary Johnson          3.0          1.0

Groupby

groupby: split-apply-combine
split data into separate partitions
apply a function on each partition
combine the results

Performing a groupby

           name     variable  value
0    John Smith  treatment_a    NaN
1      Jane Doe  treatment_a   16.0
2  Mary Johnson  treatment_a    3.0
3    John Smith  treatment_b    2.0
4      Jane Doe  treatment_b   11.0
5  Mary Johnson  treatment_b    1.0

df_melt.groupby('name')['value'].mean()

name
Jane Doe        13.5
John Smith       2.0
Mary Johnson     2.0
Name: value, dtype: float64

Let's practice!

Python for R Users