Data Manipulation in Julia
Katerina Zahradova
Instructor
# Calculate mean minimum wage
combine(wages, :effective_min_wage_2020_dollars => mean)
1x1 DataFrame
Row | effective_min_wage_2020_dollars_mean
| Float64
____|_______________________
1| 8.37093
# Filter and calculate mean
first(combine(filter(r -> r.region =="W", wages), :effective_min_wage_2020_dollars => mean))
first(combine(filter(r -> r.region =="S", wages), :effective_min_wage_2020_dollars => mean))
first(combine(filter(r -> r.region =="NE", wages), :effective_min_wage_2020_dollars => mean))
DataFrameRow (1 columns)
Row | effective_min_wage_2020_dollars_mean
| Float64
____|_______________________
1| 8.75413
...
# Group by region wages_by_region = groupby(wages, :region)
# Compute average per group combine(wages_by_region, :effective_min_wage_2020_dollars => mean)
4x2 DataFrame
Row region effective_min_wage_2020_dollars_mean
String Float64
__________________________
1 S 8.15458
2 W 8.59119
3 NE 8.75413
4 MW 8.1514
# Combine them together
combine(groupby(wages, :region), :effective_min_wage_2020_dollars => mean)
4x2 DataFrame
Row region effective_min_wage_2020_dollars_mean
String Float64
__________________________
1 S 8.15458
2 W 8.59119
3 NE 8.75413
4 MW 8.1514
# Rename the column
combine(groupby(wages, :region),
:effective_min_wage_2020_dollars => mean => :average_min_wage_2020_dollars)
4x2 DataFrame
Row region average_min_wage_2020_dollars
String Float64
__________________________
1 S 8.15458
2 W 8.59119
3 NE 8.75413
4 MW 8.1514
# Use multiple functions on a single column
combine(groupby(wages, :region),
:effective_min_wage_2020_dollars .=> [mean, median, maximum])
4x4 DataFrame
Row region effective_min_wage_2020_dollars_mean ...
String Float64 ...
____________________________________________________________________________________
1 S 8.15458 ...
2 W 8.59119 ...
3 NE 8.75413 ...
4 MW 8.1514 ...
# Use multiple functions on a single column
combine(groupby(wages, :region), :effective_min_wage_2020_dollars .=> [mean, median] .=> [:average, :median])
4x2 DataFrame
Row region average median
String Float64 Float64
____________________________
1 S 8.15458 8.0
2 W 8.59119 8.34
...
# DON'T forget the dot!
combine(groupby(wages, :region), :effective_min_wage_2020_dollars => [mean, median])
ArgumentError: Unrecognized column selector ...
combine(groupby(wages, :region), [:state_min_wage, :federal_min_wage] .=> mean)
4x2 DataFrame
Row region state_min_wage_mean federal_min_wage_mean
String Float64 Float64
______________________________________________________
1 S 2.73128 4.35566
2 W 4.26638 4.35566
...
# DON'T forget the dot
combine(groupby(wages, :region), [:state_min_wage, :federal_min_wage] => mean)
MethodError: objects of type ...
# Functions as 1-row matrix
combine(groupby(wages, :region), [:state_min, :federal_min] .=> [mean median])
Row region state_min_mean federal_min_mean state_min_median federal_min_median
_________________________________________________________________________________
1 S 2.73128 4.35566 2.0 4.25
...
# Functions as a vector
combine(groupby(wages, :region), [:state_min, :federal_min] .=> [mean, median])
Row region state_min_mean federal_min_median
______________________________________________
1 S 2.73128 4.25
...
Functions that can be used:
sum()
, mean()
, minimum()
, ...ByRow()
nrow
, proprow
, ...# Grouped DataFrame gdf
# 1 column + 1 function
combine(gdf, :c => f => :new_c)
# 1 column + 2+ functions
combine(gdf, :c .=> [f1, f2, ...] .=> [:new_c_f1, :new_c_f2, ...])
# 2+ columns + 1 function
combine(gdf, [:c1, :c2, ...] .=> f .=> [:new_c1_f, :new_c2_f, ...])
# 2+ columns + 2+ functions - all combinations
combine(gdf, [:c1, :c2, ...] .=> [f1 f2 ...] .=> [:c1_f1, :c2_f1, ..., :c1_f2, ...])
# 2+ columns + 2+ functions - pairwise
combine(gdf, [:c1, :c2, ...] .=> [f1, f2, ...] .=> [:new_c1_f1, :new_c2_f2, ...])
Data Manipulation in Julia