chunk.apply

Scalable Data Processing in R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

chunk.apply()

  • Abstracts the looping process
  • Enables Parallel execution
  • iotools is the basis of hmr, which allows you to process data on the Apache Hadoop infrastructure
Scalable Data Processing in R

mstrsplit() reads chunks as matrices

# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",

# A function to process each of the chunk function(chunk) { # Turn the chunk into a matrix m <- mstrsplit(chunk, type = "numeric", sep = ",") # Return the column sums colSums(m) }, # Maximum chunk size in bytes CH.MAX.SIZE = 1e5)
# Get the total sum colSums(chunk_col_sums)
Scalable Data Processing in R

dstrsplit() reads chunks as data frames

# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",

 # A function to process each of the chunk
 function(chunk) {
   # Turn the chunk into a data frame
   d <- dstrsplit(chunk, col_types = rep("numeric", 3), sep = ",")
   # Return the column sums
   colSums(d)
 }, 
 # Maximum chunk size in bytes
 CH.MAX.SIZE = 1e5)

# Get the total sum
colSums(chunk_col_sums)
Scalable Data Processing in R

Parallelizing chunk.apply()

# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",

 # A function to process each of the chunk
 function(chunk) {

   # Turn the chunk into a data frame
   d <- dstrsplit(chunk, col_types = rep("numeric", 3), sep = ",")
   colSums(d)
 }, 
 # 2 processors read and process data
 CH.PARALLEL = 2)

# Get the total sum
colSums(chunk_col_sums)
Scalable Data Processing in R

Note about parallelization

  • Increasing the number of processors won't always speed up your code
  • There are usually diminishing returns when you add additional processors on a single machine
Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...