Introduction

Pemrosesan Data yang Dapat Diskalakan di R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

bigmemory

  • All data must be stored on a single disk
  • Data must be represented as a matrix
Pemrosesan Data yang Dapat Diskalakan di R

iotools

  • Data can multiple types - i.e., data frames
  • Stored across multiple machines
  • Processes data in "chunks"
Pemrosesan Data yang Dapat Diskalakan di R

Process one chunk at a time sequentially

  • Limits resource usage by controlling chunk size
  • Allows results to be carried over
Pemrosesan Data yang Dapat Diskalakan di R

Process each chunk independently

  • Corresponds to split-compute-combine
  • No information can be shared between chunks
  • Allows parallel and distributed processing
Pemrosesan Data yang Dapat Diskalakan di R

Mapping and Reducing for More Complex Operations

# Create a random vector
x <- rnorm(100)
# Find the mean
mean(x)
-0.01996644
# Take the sum of chunks of 
# the vector
sl <- Map(function(v) {
         c(sum(v), length(v))}, 
  list(x[1:25], x[26:100]))

# Add the sums and lengths
slr <- Reduce(`+`, sl)
# Find the mean
slr[1]/slr[2]
-0.01996644
Pemrosesan Data yang Dapat Diskalakan di R

Not all things fit into Split-Apply-Combine

Operations that require all the data at once, can't be computed using the Split-Apply-Combine approach.

Example: Median

Pemrosesan Data yang Dapat Diskalakan di R

However ..

Many regression routines can be written in terms of split-apply-combine

Pemrosesan Data yang Dapat Diskalakan di R

Let's practice!

Pemrosesan Data yang Dapat Diskalakan di R

Preparing Video For Download...