Scalable Data Processing in R
Michael Kane
Assistant Professor, Yale University
split()
Map()
Reduce()
The split()
function partitions data
data.frame
to splitfactor
or integer
whose values define the partitions# Get the rows corresponding to each of the years in the mortgage data
year_splits <- split(1:nrow(mort), mort[,"year"])
# year_splits is a list
class(year_splits)
"list"
# The years that we've split over
names(year_splits)
"2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015"
# The first few rows corresponding to the year 2010
year_splits[["2010"]][1:10]
1 6 7 10 21 23 24 27 29 38
The Map()
function processes the partitions
col_missing_count <- function(mort) {
apply(mort, 2, function(x) sum(x == 9))}
# For each of the years count the number of missing values for
# all columns
missing_by_year <- Map(
function(x) col_missing_count(mort[x, ]),
year_splits)
missing_by_year[["2008"]]
enterprise record_number msa
0 12 0
# ...
The Reduce()
function combines the results for all partitions
# Calculate the total missing values by column
Reduce(`+`, missing_by_year)
enterprise record_number msa
0 64 0
# ...
# Label the rownames with the year
mby <- Reduce(rbind, missing_by_year)
row.names(mby) <- names(year_splits)
mby[1:3, 1:3]
enterprise record_number msa
2008 0 12 0
2009 0 8 0
2010 0 10 0
Scalable Data Processing in R