Monitoring and managing memory

Parallel Programming in R

Nabeel Imam

Data Scientist

The queue and the space

Three tellers attend to customers at a bank, while some customers wait for their turn.

The parallel flow

A parallel flow is shown, where a task is divided into multiple smaller subtasks. These subtasks are performed by different cores and the results are combined.

The parallel flow

The parallel workflow is housed inside the random access memory.

The births data

print(ls_files)

 [1] "./births/AK.csv"
 [2] "./births/AL.csv"
 [3] "./births/AR.csv"
 [4] "./births/AZ.csv"
 [5] "./births/CA.csv"
 [6] "./births/CO.csv"
 [7] "./births/CT.csv"
 [8] "./births/DC.csv"
 [9] "./births/DE.csv"
 [10] "./births/FL.csv"
...

Mapping with futures

plan(multisession, workers = 2)

ls_df <- future_map(ls_files, read.csv)

plan(sequential)


print(ls_df)

[[1]]
   state month plurality weight_gain_pounds mother_age
      AK     1         1                 30         43
   ...
[[2]]
   state month plurality weight_gain_pounds mother_age
      AL    10         1                 60         33
   ...
...

Profiling with two workers

profvis({
  plan(multisession, workers = 2)
  ls_df <- future_map(ls_files, read.csv)
  plan(sequential)
})

A code profiling output generated from the profvis function. Reading CSV files in parallel with two workers using future_map utilizes 1.6 megabytes of memory, while the other lines of code do not register any usage.

Profiling with four workers

profvis({
  plan(multisession, workers = 4)
  ls_df <- future_map(ls_files, read.csv)
  plan(sequential)
})

A code profiling output generated from the profvis function. Reading CSV files in parallel with four workers using future_map utilizes 3.1 megabytes of memory, while planning the multisession uses 0.3 megabytes.

Behind the scenes

A map of USA, showing the country divided into four regions, corresponding to the West, Midwest, South, and Northeast. Each region corresponds to a list of CSV files containing data for each state in the region.

Managing memory by chunking

config <- furrr_options(chunk_size = 26)

plan(multisession, workers = 4)
ls_df <- future_map(ls_files, read.csv,

                    .options = config)
plan(sequential)

Managing memory by chunking

profvis({
  config <- furrr_options(chunk_size = 26)
  plan(multisession, workers = 4)
  ls_df <- future_map(ls_files, read.csv,
             .options = config)
  plan(sequential)
})

A code profiling output generated from the profvis function. Reading CSV files in parallel with four workers using future_map utilizes 2.5 megabytes of memory using a chunk size of 26.

Chunking with parallel

cl <- makeCluster(4)


ls_df <- parLapply(cl, ls_files, read.csv)


stopCluster(cl)

Reading CSV files in parallel using parLapply utilizes 2.4 megabytes of memory, all of it used in the parLapply call.

Chunking with parallel

cl <- makeCluster(4)
ls_df <- parLapply(cl, ls_files, read.csv,

                   chunk.size = 26)

stopCluster(cl)

Reading CSV files in parallel using parLapply utilizes only 1 megabyte of memory when the chunk size is set to 26.

When to chunk?

Chunking is performed optimally by default
With large data objects and running low on memory
- Try using fewer cores if feasible
- Experiment with a few chunk sizes to get to optimum

Let's practice!

Parallel Programming in R