Grouped summary statistics

Data Manipulation in Julia

Katerina Zahradova

Instructor

What we know now

# Calculate mean minimum wage
combine(wages, :effective_min_wage_2020_dollars => mean)
1x1 DataFrame
Row | effective_min_wage_2020_dollars_mean
    | Float64
____|_______________________
   1| 8.37093
Data Manipulation in Julia

What we know know

# Filter and calculate mean
first(combine(filter(r -> r.region =="W", wages), :effective_min_wage_2020_dollars => mean))
first(combine(filter(r -> r.region =="S", wages), :effective_min_wage_2020_dollars => mean))
first(combine(filter(r -> r.region =="NE", wages), :effective_min_wage_2020_dollars => mean))
DataFrameRow (1 columns)
Row | effective_min_wage_2020_dollars_mean
    | Float64
____|_______________________
   1| 8.75413
...
Data Manipulation in Julia

Using combine() and groupby()

# Group by region
wages_by_region = groupby(wages, :region)

# Compute average per group combine(wages_by_region, :effective_min_wage_2020_dollars => mean)
4x2 DataFrame
Row region    effective_min_wage_2020_dollars_mean
    String    Float64
__________________________
1    S        8.15458
2    W        8.59119
3    NE       8.75413
4    MW       8.1514
Data Manipulation in Julia

Combining combine() and groupby()

# Combine them together
combine(groupby(wages, :region), :effective_min_wage_2020_dollars => mean)
4x2 DataFrame
Row region    effective_min_wage_2020_dollars_mean
    String    Float64
__________________________
1    S        8.15458
2    W        8.59119
3    NE       8.75413
4    MW       8.1514
Data Manipulation in Julia

Using combine() and groupby()

# Rename the column
combine(groupby(wages, :region), 
        :effective_min_wage_2020_dollars => mean => :average_min_wage_2020_dollars)
4x2 DataFrame
Row region    average_min_wage_2020_dollars
    String    Float64
__________________________
1    S        8.15458
2    W        8.59119
3    NE       8.75413
4    MW       8.1514
Data Manipulation in Julia

Multiple functions on one column

# Use multiple functions on a single column
combine(groupby(wages, :region), 
                    :effective_min_wage_2020_dollars .=> [mean, median, maximum])
4x4 DataFrame
Row  region  effective_min_wage_2020_dollars_mean  ...
     String  Float64                               ...
____________________________________________________________________________________
1    S       8.15458                               ...
2    W       8.59119                               ...
3    NE      8.75413                               ...
4    MW      8.1514                                ...
Data Manipulation in Julia

Multiple functions on one column

# Use multiple functions on a single column
combine(groupby(wages, :region), :effective_min_wage_2020_dollars .=> [mean, median] .=> [:average, :median])
4x2 DataFrame
Row region  average  median
    String  Float64  Float64
____________________________
1    S      8.15458  8.0
2    W      8.59119  8.34
...
# DON'T forget the dot!
combine(groupby(wages, :region), :effective_min_wage_2020_dollars => [mean, median])
ArgumentError: Unrecognized column selector ...
Data Manipulation in Julia

Multiple columns with one function

combine(groupby(wages, :region), [:state_min_wage, :federal_min_wage] .=> mean)
4x2 DataFrame
Row region  state_min_wage_mean  federal_min_wage_mean
    String  Float64              Float64
______________________________________________________
1   S       2.73128              4.35566
2    W       4.26638              4.35566
...
# DON'T forget the dot
combine(groupby(wages, :region), [:state_min_wage, :federal_min_wage] => mean)
MethodError: objects of type ...
Data Manipulation in Julia

Multiple columns with multiple functions

# Functions as 1-row matrix
combine(groupby(wages, :region), [:state_min, :federal_min] .=> [mean median])
Row region  state_min_mean  federal_min_mean  state_min_median  federal_min_median
_________________________________________________________________________________
1   S       2.73128         4.35566           2.0                4.25
...
# Functions as a vector
combine(groupby(wages, :region), [:state_min, :federal_min] .=> [mean, median])
Row region  state_min_mean  federal_min_median 
______________________________________________
1   S       2.73128         4.25
...
Data Manipulation in Julia

Possible functions

Functions that can be used:

  • Usual statistics functions as sum(), mean(), minimum(), ...
  • User predefined functions (broadcasted)
  • Anonymous functions, wrapped in ByRow()
  • Special DataFrames function: nrow, proprow, ...
Data Manipulation in Julia

Cheat sheet

# Grouped DataFrame gdf
# 1 column + 1 function
combine(gdf, :c => f => :new_c)

# 1 column + 2+ functions
combine(gdf, :c .=> [f1, f2, ...] .=> [:new_c_f1, :new_c_f2, ...])

# 2+ columns + 1 function
combine(gdf, [:c1, :c2, ...] .=> f .=> [:new_c1_f, :new_c2_f, ...])

# 2+ columns + 2+ functions - all combinations
combine(gdf, [:c1, :c2, ...] .=> [f1 f2 ...] .=> [:c1_f1, :c2_f1, ..., :c1_f2, ...])

# 2+ columns + 2+ functions - pairwise
combine(gdf, [:c1, :c2, ...] .=> [f1, f2, ...] .=> [:new_c1_f1, :new_c2_f2, ...])
Data Manipulation in Julia

Let's practice!

Data Manipulation in Julia

Preparing Video For Download...