What is Scalable Data Processing?

Pemrosesan Data yang Dapat Diskalakan di R

Michael J. Kane and Simon Urbanek

Instructors, DataCamp

In this course ..

  • Work with data that is too large for your computer
  • Write Scalable code
  • Import and process data in chunks
Pemrosesan Data yang Dapat Diskalakan di R

RAM

All R objects are stored in RAM

Pemrosesan Data yang Dapat Diskalakan di R

Pemrosesan Data yang Dapat Diskalakan di R

How Big Can Variables Be?

"R is not well-suited for working with data larger than 10-20% of a computer's RAM." - The R Installation and Administration Manual

Pemrosesan Data yang Dapat Diskalakan di R

Swapping is inefficient

  • If computer runs out of RAM, data is moved to disk
  • Since the disk is much slower than RAM, execution time increases
Pemrosesan Data yang Dapat Diskalakan di R

Scalable solutions

  • Move a subset into RAM
  • Process the subset
  • Keep the result and discard the subset
Pemrosesan Data yang Dapat Diskalakan di R

Why is my code slow?

  • Complexity of calculations

  • Carefully consider disk operations to write fast, scalable code

Pemrosesan Data yang Dapat Diskalakan di R

Benchmarking Performance

library(microbenchmark)

microbenchmark( rnorm(100), rnorm(10000) )
Unit: microseconds
         expr    min      lq     mean  median      uq     max neval
   rnorm(100)   7.84   8.440   9.5459   8.773   9.355   29.56   100
 rnorm(10000) 679.51 683.706 755.5693 690.876 712.416 2949.03   100
Pemrosesan Data yang Dapat Diskalakan di R

Let's practice!

Pemrosesan Data yang Dapat Diskalakan di R

Preparing Video For Download...