chunk.apply

Scalable Data Processing in R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

chunk.apply()

Abstracts the looping process
Enables Parallel execution
iotools is the basis of hmr, which allows you to process data on the Apache Hadoop infrastructure

mstrsplit() reads chunks as matrices

# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",

 # A function to process each of the chunk
 function(chunk) {
   # Turn the chunk into a matrix
   m <- mstrsplit(chunk, type = "numeric", sep = ",")
   # Return the column sums
   colSums(m)
 }, 
 # Maximum chunk size in bytes
 CH.MAX.SIZE = 1e5)

# Get the total sum
colSums(chunk_col_sums)

dstrsplit() reads chunks as data frames

# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",

 # A function to process each of the chunk
 function(chunk) {
   # Turn the chunk into a data frame
   d <- dstrsplit(chunk, col_types = rep("numeric", 3), sep = ",")
   # Return the column sums
   colSums(d)
 }, 
 # Maximum chunk size in bytes
 CH.MAX.SIZE = 1e5)

# Get the total sum
colSums(chunk_col_sums)

Parallelizing chunk.apply()

# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",

 # A function to process each of the chunk
 function(chunk) {

   # Turn the chunk into a data frame
   d <- dstrsplit(chunk, col_types = rep("numeric", 3), sep = ",")
   colSums(d)
 }, 
 # 2 processors read and process data
 CH.PARALLEL = 2)

# Get the total sum
colSums(chunk_col_sums)

Note about parallelization

Increasing the number of processors won't always speed up your code
There are usually diminishing returns when you add additional processors on a single machine

Let's practice!

Scalable Data Processing in R