Scalable Data Processing in R
Simon Urbanek
Member of R-Core, Lead Inventive Scientist, AT&T Labs Research
iotools
is the basis of hmr
, which allows you to process data on the Apache Hadoop infrastructure# Use chunk.apply to get chunks of rows from foo.csv chunk_col_sums <- chunk.apply("foo.csv",
# A function to process each of the chunk function(chunk) { # Turn the chunk into a matrix m <- mstrsplit(chunk, type = "numeric", sep = ",") # Return the column sums colSums(m) }, # Maximum chunk size in bytes CH.MAX.SIZE = 1e5)
# Get the total sum colSums(chunk_col_sums)
# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",
# A function to process each of the chunk
function(chunk) {
# Turn the chunk into a data frame
d <- dstrsplit(chunk, col_types = rep("numeric", 3), sep = ",")
# Return the column sums
colSums(d)
},
# Maximum chunk size in bytes
CH.MAX.SIZE = 1e5)
# Get the total sum
colSums(chunk_col_sums)
# Use chunk.apply to get chunks of rows from foo.csv
chunk_col_sums <- chunk.apply("foo.csv",
# A function to process each of the chunk
function(chunk) {
# Turn the chunk into a data frame
d <- dstrsplit(chunk, col_types = rep("numeric", 3), sep = ",")
colSums(d)
},
# 2 processors read and process data
CH.PARALLEL = 2)
# Get the total sum
colSums(chunk_col_sums)
Scalable Data Processing in R