Introduction

Scalable Data Processing in R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

bigmemory

  • All data must be stored on a single disk
  • Data must be represented as a matrix
Scalable Data Processing in R

iotools

  • Data can multiple types - i.e., data frames
  • Stored across multiple machines
  • Processes data in "chunks"
Scalable Data Processing in R

Process one chunk at a time sequentially

  • Limits resource usage by controlling chunk size
  • Allows results to be carried over
Scalable Data Processing in R

Process each chunk independently

  • Corresponds to split-compute-combine
  • No information can be shared between chunks
  • Allows parallel and distributed processing
Scalable Data Processing in R

Mapping and Reducing for More Complex Operations

# Create a random vector
x <- rnorm(100)
# Find the mean
mean(x)
-0.01996644
# Take the sum of chunks of 
# the vector
sl <- Map(function(v) {
         c(sum(v), length(v))}, 
  list(x[1:25], x[26:100]))

# Add the sums and lengths
slr <- Reduce(`+`, sl)
# Find the mean
slr[1]/slr[2]
-0.01996644
Scalable Data Processing in R

Not all things fit into Split-Apply-Combine

Operations that require all the data at once, can't be computed using the Split-Apply-Combine approach.

Example: Median

Scalable Data Processing in R

However ..

Many regression routines can be written in terms of split-apply-combine

Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...