Reproducibility in parallel

Parallel Programming in R

Nabeel Imam

Data Scientist

What is reproducibility?

Same input produces the same results every time we run the code
  • Code can be tested
  • Results can be replicated by others

Lines of code produce a pie chart. A second run of the same lines of code produce exactly the same pie chart.

Parallel Programming in R

The customer lucky draw

print(customer_ids)
$USA
   [1] 465500 612953 106420 279492 376941 163474 164493 801983 898941 406844 829157 ...
$Canada
   [1] 140521 398164 817703 715385 771801 656814 721270 719120 425819 774558 111418 ...
$Mexico
   [1] 714842 486725 706765 858020 790364 390760 198667 419197 352989 202494 756636 ...
$UK
   [1] 886285 151731 274940 779966 375535 431644 880434 649074 765423 449147 408041 ...
Parallel Programming in R

The customer lucky draw

lucky_draw <- function (ids) {
  sample(ids, 1)
}


cl <- makeCluster(4)
set.seed(1234)
parLapply(cl, customer_ids, lucky_draw) stopCluster(cl)
$USA
[1] 673576

$Canada
[1] 164613

$Mexico
[1] 769658

$UK
[1] 683102
Parallel Programming in R

The reproducibility problem

Winners from first run

$USA
[1] 673576

$Canada
[1] 164613

$Mexico
[1] 769658

$UK
[1] 683102

Winners from second run

$USA
[1] 638051

$Canada
[1] 133431

$Mexico
[1] 522137

$UK
[1] 856141
Parallel Programming in R

Solution

cl <- makeCluster(4)

# A seed for all worker processes in cluster clusterSetRNGStream(cl, 1234)
parLapply(cl, customer_ids, lucky_draw) stopCluster(cl)
Parallel Programming in R

Multiple runs with same results

Winners from first run

$USA
[1] 421408

$Canada
[1] 877562

$Mexico
[1] 460786

$UK
[1] 658513

Winners from second run

$USA
[1] 421408

$Canada
[1] 877562

$Mexico
[1] 460786

$UK
[1] 658513
Parallel Programming in R

Multiple runs with same results

First run

cl <- makeCluster(4)

clusterSetRNGStream(cl, 1234)


run1 <- parLapply(cl, customer_ids, lucky_draw) stopCluster(cl)

Second run

cl <- makeCluster(4)

clusterSetRNGStream(cl, 1234)

run2 <- parLapply(cl, customer_ids, lucky_draw)
stopCluster(cl)


identical(run1, run2)
[1] TRUE
Parallel Programming in R

Reproducible results with furrr

First run
config <- furrr_options(seed = 1234)


plan(multisession, workers = 4) run1 <- future_map(customer_ids, lucky_draw, .options = config) plan(sequential)
Second run
plan(multisession, workers = 4)

run2 <- future_map(customer_ids, lucky_draw,
                  # Using the same configuration
                  .options = config)
plan(sequential)

identical(run1, run2)
[1] TRUE
Parallel Programming in R

Reproducible results with foreach

First run
install.packages("doRNG")
library(doRNG)


cl <- makeCluster(4) registerDoParallel(cl)
registerDoRNG(1234)
run1 <- foreach(i = customer_ids) %dopar% { lucky_draw(i) } stopCluster(cl)
Second run
cl <- makeCluster(4)
registerDoParallel(cl)
registerDoRNG(1234) # Same seed

run2 <- foreach(i = customer_ids) %dopar% {
  lucky_draw(i)
}
stopCluster(cl)

identical(run1, run2)
[1] TRUE
Parallel Programming in R

When to think about reproducibility

  • Direct call to random number generators
    • rnorm, rbinom, etc
  • Sampling randomly
    • Bootstraps
    • sample_n() from dplyr
Parallel Programming in R

Let's practice!

Parallel Programming in R

Preparing Video For Download...