What is Scalable Data Processing?

Scalable Data Processing in R

Michael J. Kane and Simon Urbanek

Instructors, DataCamp

In this course ..

  • Work with data that is too large for your computer
  • Write Scalable code
  • Import and process data in chunks
Scalable Data Processing in R

RAM

All R objects are stored in RAM

Scalable Data Processing in R

Scalable Data Processing in R

How Big Can Variables Be?

"R is not well-suited for working with data larger than 10-20% of a computer's RAM." - The R Installation and Administration Manual

Scalable Data Processing in R

Swapping is inefficient

  • If computer runs out of RAM, data is moved to disk
  • Since the disk is much slower than RAM, execution time increases
Scalable Data Processing in R

Scalable solutions

  • Move a subset into RAM
  • Process the subset
  • Keep the result and discard the subset
Scalable Data Processing in R

Why is my code slow?

  • Complexity of calculations

  • Carefully consider disk operations to write fast, scalable code

Scalable Data Processing in R

Benchmarking Performance

library(microbenchmark)

microbenchmark( rnorm(100), rnorm(10000) )
Unit: microseconds
         expr    min      lq     mean  median      uq     max neval
   rnorm(100)   7.84   8.440   9.5459   8.773   9.355   29.56   100
 rnorm(10000) 679.51 683.706 755.5693 690.876 712.416 2949.03   100
Scalable Data Processing in R

Let's practice!

Scalable Data Processing in R

Preparing Video For Download...