Split-Apply-Combine

Scalable Data Processing in R

Michael Kane

Assistant Professor, Yale University

Split-Apply-Combine

  • Split: split()
  • Apply: Map()
  • Combine: Reduce()
Scalable Data Processing in R

Partition using split()

The split() function partitions data

  • First argument is a vector or data.frame to split
  • Second argument is a factor or integer whose values define the partitions
Scalable Data Processing in R
# Get the rows corresponding to each of the years in the mortgage data
 year_splits <- split(1:nrow(mort), mort[,"year"])
# year_splits is a list
class(year_splits)
"list"
# The years that we've split over
names(year_splits)
"2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015"
# The first few rows corresponding to the year 2010
year_splits[["2010"]][1:10]
1  6  7 10 21 23 24 27 29 38
Scalable Data Processing in R

Compute using Map()

The Map() function processes the partitions

  • First argument is the function to apply to each partition
  • Second argument is the partitions
Scalable Data Processing in R

Compute using Map()

col_missing_count <- function(mort) {
   apply(mort, 2, function(x) sum(x == 9))} 
# For each of the years count the number of missing values for 
# all columns
missing_by_year <- Map(
   function(x) col_missing_count(mort[x, ]),
   year_splits)

missing_by_year[["2008"]]
enterprise         record_number                   msa 
        0                    12                     0 
# ...
Scalable Data Processing in R

Combine using Reduce()

The Reduce() function combines the results for all partitions

  • First argument is the function to combine with
  • Second argument is the partitioned data
Scalable Data Processing in R
# Calculate the total missing values by column
Reduce(`+`, missing_by_year)
enterprise         record_number                   msa 
         0                    64                     0 
# ... 
# Label the rownames with the year
mby <- Reduce(rbind, missing_by_year)
row.names(mby) <- names(year_splits)
mby[1:3, 1:3]
     enterprise record_number msa
2008          0            12   0
2009          0             8   0
2010          0            10   0

Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...