More Pandas

Python for R Users

Daniel Chen

Instructor

Missing data

  • NaN missing values from from numpy
  • np.NaN, np.NAN, np.nan are all the same as the NA R value
  • check missing with pd.isnull
    • Check non-missing with pd.notnull
    • pd.isnull is an alias for pd.isna
Python for R Users

Working with missing data

df
           name  treatment_a  treatment_b
0    John Smith          NaN            2
1      Jane Doe         16.0           11
2  Mary Johnson          3.0            1
a_mean = df['treatment_a'].mean()
a_mean
9.5
Python for R Users

Fillna

df['a_fill'] = df['treatment_a'].fillna(a_mean)
df
           name  treatment_a  treatment_b  a_fill
0    John Smith          NaN            2     9.5
1      Jane Doe         16.0           11    16.0
2  Mary Johnson          3.0            1     3.0
Python for R Users

More Pandas

  • Applying custom functions
  • Groupby operations
  • Tidying data
Python for R Users

Apply your own functions

  • Built-in functions
  • Custom functions
  • apply method
  • Pass in an axis
Python for R Users
R
df = data.frame('a' = c(1, 2, 3),
                'b' = c(4, 5, 6))
apply(df, 2, mean)
a b 
2 5 
apply(df, 1, mean)
2.5 3.5 4.5
Python
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
                   'B':[4, 5, 6]})
df.apply(np.mean, axis=0)
A    2.0
B    5.0
dtype: float64
df.apply(np.mean, axis=1)
0    2.5
1    3.5
2    4.5
dtype: float64
Python for R Users

Tidy

  • Reshaping and tidying our data
  • Hadley Wickham, Tidy Data Paper
    • Each row is an observation
    • Each column is a variable
    • Each type of observational unit forms a table

Tidy Data Paper: http://vita.had.co.nz/papers/tidy-data.pdf

Python for R Users

Tidy melt

df
           name  treatment_a  treatment_b
0    John Smith          NaN            2
1      Jane Doe         16.0           11
2  Mary Johnson          3.0            1
df_melt = pd.melt(df, id_vars='name')
df_melt
           name     variable  value
0    John Smith  treatment_a    NaN
1      Jane Doe  treatment_a   16.0
2  Mary Johnson  treatment_a    3.0
3    John Smith  treatment_b    2.0
...
Python for R Users

Tidy pivot_table

df_melt_pivot = pd.pivot_table(df_melt,
                               index='name',
                               columns='variable',
                               values='value')
df_melt_pivot
variable      treatment_a  treatment_b
name                                  
Jane Doe             16.0         11.0
John Smith            NaN          2.0
Mary Johnson          3.0          1.0
Python for R Users

Reset index

df_melt_pivot.reset_index()
variable          name  treatment_a  treatment_b
0             Jane Doe         16.0         11.0
1           John Smith          NaN          2.0
2         Mary Johnson          3.0          1.0
Python for R Users

Groupby

  • groupby: split-apply-combine
  • split data into separate partitions
  • apply a function on each partition
  • combine the results
Python for R Users

Performing a groupby

           name     variable  value
0    John Smith  treatment_a    NaN
1      Jane Doe  treatment_a   16.0
2  Mary Johnson  treatment_a    3.0
3    John Smith  treatment_b    2.0
4      Jane Doe  treatment_b   11.0
5  Mary Johnson  treatment_b    1.0
df_melt.groupby('name')['value'].mean()
name
Jane Doe        13.5
John Smith       2.0
Mary Johnson     2.0
Name: value, dtype: float64
Python for R Users

Let's practice!

Python for R Users

Preparing Video For Download...