Measuring the benefits

Parallel Programming in R

Nabeel Imam

Data Scientist

Toy example

numbers <- 1:1000000


# Sequential sqroots <- lapply(numbers, sqrt)
# Parallel cl <- makeCluster(4) sqroots <- parLapply(cl, numbers, sqrt) stopCluster(my_cluster)

Which will perform better?

Parallel Programming in R

Benchmarking performance

Run code several times to estimate average execution time

library(microbenchmark)


microbenchmark( "Sequential" = lapply(numbers, sqrt),
"Parallel" = { cl <- makeCluster(4) parLapply(cl, numbers, sqrt) stopCluster(my_cluster) },
times = 10 )

 

 

Unit: milliseconds
      expr     min    mean     max neval
Sequential  633.96  838.09  993.59    10
  Parallel 1136.95 1247.29 1557.58    10
  • Simple numerical operations rarely benefit from parallelization
  • Profiling gives line-by-line report, benchmarking gives overall execution times
Parallel Programming in R

The elephant in the room

sqroots <- sqrt(numbers)

An elephant sits on the couch in a living room and people acknowledge its presence.

Parallel Programming in R

Vectorization

sqroots <- sqrt(numbers)
  • Base R functions, like sqrt(), are vectorized.
  • Map a single function to many inputs
  • Very fast but only applicable to simple operations
microbenchmark(
  "Vectorized" = sqrt(numbers),
  "Sequential" = lapply(numbers, sqrt),
  "Parallel" = {
    cl <- makeCluster(4)
    parLapply(cl, numbers, sqrt)
    stopCluster(my_cluster)
  },
  times = 10)
Unit: milliseconds
      expr       min      mean      max neval
Vectorized    2.3904    9.2071   66.303    10
Sequential  352.1166  771.7491 1004.753    10
  Parallel 1191.3176 1377.6926 1700.316    10
Parallel Programming in R

The bootstrap

Sampling from the current data with replacement

print(ls_df)
$`2001`
   Country             Life_expectancy  Year
 1 Afghanistan                    56.3  2001
 2 Albania                        74.3  2001
 3 Algeria                        71.1  2001
...
$`2002`
   Country             Life_expectancy  Year
 1 Afghanistan                    56.8  2002
 2 Albania                        74.6  2002
 3 Algeria                        71.6  2002
...
Parallel Programming in R

Classic version

df <- ls_df$`2001`


estimates <- rep(0, 10000)
for (i in 1:10000) { b <- sample(df$Life_expectancy, replace = T)
estimates[i] <- mean(b) }

A histogram of bootstrapped estimates of the global average life expectancy in 2001, showing the classic bell curve.

  • Confidence interval using quantiles: quantile(estimates, c(0.025, 0.975))
Parallel Programming in R

The good news

Bootstraps can be parallelized

estimates <- rep(0, 10000)

for (i in 1:10000) {

  b <- sample(df$Life_expectancy,
              replace = T)

  estimates[i] <- mean(b)
  }
boot_dist <- function (df) {

  estimates <- rep(0, 10000)

  for (i in 1:10000) {
    b <- sample(df$Life_expectancy, replace = T)
    estimates[i] <- mean(b)
  }

  return(estimates)
}


cl <- makeCluster(4) ls_dists <- parLapply(cl, ls_df, boot_dist) stopCluster(cl)
Parallel Programming in R

The benefits

microbenchmark(
  "lapply" = lapply(ls_df, boot_dist),
  "parLapply" = {
    cl <- makeCluster(4)
    parLapply(cl, ls_df, boot_dist)
    stopCluster(cl)
  },
  times = 10
)
Unit: seconds
     expr    min   mean    max neval
   lapply 3.6938 4.2184 4.5267    10
parLapply 1.9006 2.5166 2.7292    10

How to get there:

  • Profile existing code, identify slowest part
  • Parallelize/optimize this step
  • Benchmark and compare
Parallel Programming in R

Let's practice!

Parallel Programming in R

Preparing Video For Download...