Congratulations!

Scalable Data Processing in R

Michael J. Kane and Simon Urbanek

Instructors, DataCamp

Split-Apply-Combine

  • Break the data into parts
  • Compute on the parts
  • Combine the results
Scalable Data Processing in R

Split-Apply-Combine: Advantages

  • Manageable parts don't overwhelm your computer
  • Approach is easy to parallelize
  • Process sequentially
  • Process on serveral machines in a cluster
Scalable Data Processing in R

Split-Apply-Combine: R

  • split() partitions set of row numbers or data.frame

  • Map() computes on parts

  • Reduce() combines results

Scalable Data Processing in R

bigmemory

bigmemory

  • Good for larger data sets that can be represented as dense matrices and might be too big for RAM
  • Looks like a regular R matrix
Scalable Data Processing in R

iotools

iotools

  • Good for much larger data that can be processed in sequential chunks
  • Supports data.frame and matrix
Scalable Data Processing in R

Scalable Data Processing in R

Good luck!

Scalable Data Processing in R

Preparing Video For Download...