Monitoring and managing memory

Parallel Programming in R

Nabeel Imam

Data Scientist

The queue and the space

Three tellers attend to customers at a bank, while some customers wait for their turn.

Parallel Programming in R

The parallel flow

A parallel flow is shown, where a task is divided into multiple smaller subtasks. These subtasks are performed by different cores and the results are combined.

Parallel Programming in R

The parallel flow

The parallel workflow is housed inside the random access memory.

Parallel Programming in R

The births data

print(ls_files)
 [1] "./births/AK.csv"
 [2] "./births/AL.csv"
 [3] "./births/AR.csv"
 [4] "./births/AZ.csv"
 [5] "./births/CA.csv"
 [6] "./births/CO.csv"
 [7] "./births/CT.csv"
 [8] "./births/DC.csv"
 [9] "./births/DE.csv"
 [10] "./births/FL.csv"
...
Parallel Programming in R

Mapping with futures

plan(multisession, workers = 2)

ls_df <- future_map(ls_files, read.csv)
plan(sequential)
print(ls_df)
[[1]]
   state month plurality weight_gain_pounds mother_age
      AK     1         1                 30         43
   ...
[[2]]
   state month plurality weight_gain_pounds mother_age
      AL    10         1                 60         33
   ...
...
Parallel Programming in R

Profiling with two workers

profvis({
  plan(multisession, workers = 2)
  ls_df <- future_map(ls_files, read.csv)
  plan(sequential)
})

A code profiling output generated from the profvis function. Reading CSV files in parallel with two workers using future_map utilizes 1.6 megabytes of memory, while the other lines of code do not register any usage.

Parallel Programming in R

Profiling with four workers

profvis({
  plan(multisession, workers = 4)
  ls_df <- future_map(ls_files, read.csv)
  plan(sequential)
})

A code profiling output generated from the profvis function. Reading CSV files in parallel with four workers using future_map utilizes 3.1 megabytes of memory, while planning the multisession uses 0.3 megabytes.

Parallel Programming in R

Behind the scenes

A map of USA, showing the country divided into four regions, corresponding to the West, Midwest, South, and Northeast. Each region corresponds to a list of CSV files containing data for each state in the region.

Parallel Programming in R

Managing memory by chunking

config <- furrr_options(chunk_size = 26)

plan(multisession, workers = 4) ls_df <- future_map(ls_files, read.csv,
.options = config) plan(sequential)
Parallel Programming in R

Managing memory by chunking

profvis({
  config <- furrr_options(chunk_size = 26)
  plan(multisession, workers = 4)
  ls_df <- future_map(ls_files, read.csv,
             .options = config)
  plan(sequential)
})

A code profiling output generated from the profvis function. Reading CSV files in parallel with four workers using future_map utilizes 2.5 megabytes of memory using a chunk size of 26.

Parallel Programming in R

Chunking with parallel

cl <- makeCluster(4)


ls_df <- parLapply(cl, ls_files, read.csv)
stopCluster(cl)

Reading CSV files in parallel using parLapply utilizes 2.4 megabytes of memory, all of it used in the parLapply call.

Parallel Programming in R

Chunking with parallel

cl <- makeCluster(4)
ls_df <- parLapply(cl, ls_files, read.csv,

chunk.size = 26)
stopCluster(cl)

Reading CSV files in parallel using parLapply utilizes only 1 megabyte of memory when the chunk size is set to 26.

Parallel Programming in R

When to chunk?

  • Chunking is performed optimally by default
  • With large data objects and running low on memory
    • Try using fewer cores if feasible
    • Experiment with a few chunk sizes to get to optimum
Parallel Programming in R

Let's practice!

Parallel Programming in R

Preparing Video For Download...