Introduction

Scalable Data Processing in R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

bigmemory

All data must be stored on a single disk
Data must be represented as a matrix

iotools

Data can multiple types - i.e., data frames
Stored across multiple machines
Processes data in "chunks"

Process one chunk at a time sequentially

Limits resource usage by controlling chunk size
Allows results to be carried over

Process each chunk independently

Corresponds to split-compute-combine
No information can be shared between chunks
Allows parallel and distributed processing

Mapping and Reducing for More Complex Operations

# Create a random vector
x <- rnorm(100)
# Find the mean
mean(x)

-0.01996644

# Take the sum of chunks of 
# the vector
sl <- Map(function(v) {
         c(sum(v), length(v))}, 
  list(x[1:25], x[26:100]))

# Add the sums and lengths
slr <- Reduce(`+`, sl)
# Find the mean
slr[1]/slr[2]

-0.01996644

Not all things fit into Split-Apply-Combine

Operations that require all the data at once, can't be computed using the Split-Apply-Combine approach.

Example: Median

However ..

Many regression routines can be written in terms of split-apply-combine

Let's practice!

Scalable Data Processing in R