Scalable Data Processing in R
Simon Urbanek
Member of R-Core, Lead Inventive Scientist, AT&T Labs Research
# Create a random vector
x <- rnorm(100)
# Find the mean
mean(x)
-0.01996644
# Take the sum of chunks of
# the vector
sl <- Map(function(v) {
c(sum(v), length(v))},
list(x[1:25], x[26:100]))
# Add the sums and lengths
slr <- Reduce(`+`, sl)
# Find the mean
slr[1]/slr[2]
-0.01996644
Operations that require all the data at once, can't be computed using the Split-Apply-Combine approach.
Example: Median
Many regression routines can be written in terms of split-apply-combine
Scalable Data Processing in R