Split-Apply-Combine

Scalable Data Processing in R

Michael Kane

Assistant Professor, Yale University

Split-Apply-Combine

Split: split()
Apply: Map()
Combine: Reduce()

Partition using split()

The split() function partitions data

First argument is a vector or data.frame to split
Second argument is a factor or integer whose values define the partitions

# Get the rows corresponding to each of the years in the mortgage data
 year_splits <- split(1:nrow(mort), mort[,"year"])
# year_splits is a list
class(year_splits)

"list"

# The years that we've split over
names(year_splits)

"2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015"

# The first few rows corresponding to the year 2010
year_splits[["2010"]][1:10]

1  6  7 10 21 23 24 27 29 38

Compute using Map()

The Map() function processes the partitions

First argument is the function to apply to each partition
Second argument is the partitions

Compute using Map()

col_missing_count <- function(mort) {
   apply(mort, 2, function(x) sum(x == 9))} 
# For each of the years count the number of missing values for 
# all columns
missing_by_year <- Map(
   function(x) col_missing_count(mort[x, ]),
   year_splits)

missing_by_year[["2008"]]

enterprise         record_number                   msa 
        0                    12                     0 
# ...

Combine using Reduce()

The Reduce() function combines the results for all partitions

First argument is the function to combine with
Second argument is the partitioned data

# Calculate the total missing values by column
Reduce(`+`, missing_by_year)

enterprise         record_number                   msa 
         0                    64                     0

# ... 
# Label the rownames with the year
mby <- Reduce(rbind, missing_by_year)
row.names(mby) <- names(year_splits)
mby[1:3, 1:3]

     enterprise record_number msa
2008          0            12   0
2009          0             8   0
2010          0            10   0

Let's practice!

Scalable Data Processing in R