What is Scalable Data Processing?

Elaborazione scalabile dei dati in R

Michael J. Kane and Simon Urbanek

Instructors, DataCamp

In this course ..

  • Work with data that is too large for your computer
  • Write Scalable code
  • Import and process data in chunks
Elaborazione scalabile dei dati in R

RAM

All R objects are stored in RAM

Elaborazione scalabile dei dati in R

Elaborazione scalabile dei dati in R

How Big Can Variables Be?

"R is not well-suited for working with data larger than 10-20% of a computer's RAM." - The R Installation and Administration Manual

Elaborazione scalabile dei dati in R

Swapping is inefficient

  • If computer runs out of RAM, data is moved to disk
  • Since the disk is much slower than RAM, execution time increases
Elaborazione scalabile dei dati in R

Scalable solutions

  • Move a subset into RAM
  • Process the subset
  • Keep the result and discard the subset
Elaborazione scalabile dei dati in R

Why is my code slow?

  • Complexity of calculations

  • Carefully consider disk operations to write fast, scalable code

Elaborazione scalabile dei dati in R

Benchmarking Performance

library(microbenchmark)

microbenchmark( rnorm(100), rnorm(10000) )
Unit: microseconds
         expr    min      lq     mean  median      uq     max neval
   rnorm(100)   7.84   8.440   9.5459   8.773   9.355   29.56   100
 rnorm(10000) 679.51 683.706 755.5693 690.876 712.416 2949.03   100
Elaborazione scalabile dei dati in R

Let's practice!

Elaborazione scalabile dei dati in R

Preparing Video For Download...