Parallelization in R

Parallel Programming in R

Nabeel Imam

Data Scientist

A practical example

The data

print(file_list)
 [1] "./uni_data_country/Argentina.csv"
 [2] "./uni_data_country/Armenia.csv"
 [3] "./uni_data_country/Australia.csv"
 [4] "./uni_data_country/Austria.csv"
 [5] "./uni_data_country/Azerbaijan.csv"
 [6] "./uni_data_country/Bahrain.csv"
 [7] "./uni_data_country/Bangladesh.csv"
 [8] "./uni_data_country/Belarus.csv"
 [9] "./uni_data_country/Belgium.csv"
[10] "./uni_data_country/Bolivia.csv"
...

Three university buildings are shown with graduating students, each assigned a rank from one to three.

Parallel Programming in R

Add a column

for (file in file_list) {

  df <- read.csv(file)


df$top100 <- NA for (r in 1:nrow(df)) { df$top100[r] <- df$world_rank[r] <= 100 }
write.csv(df, file) }
Parallel Programming in R

Profiling

Code

library(profvis)


profvis({
for (file in file_list) { df <- read.csv(file) df$top100 <- NA for (r in 1:nrow(df)) { df$top100[r] <- df$Rank[r] <= 100 } write.csv(df, file) }
})

Output

Profiling output from profvis(). From sample code, reading data takes 40 milliseconds, and filling in values of the column `top100` takes 80 milliseconds. All other steps are near instantaneous.

Parallel Programming in R

Let's parallelize

The loop

  for (file in file_list) {

    df <- read.csv(file)
    df$top100 <- NA

    for (r in 1:nrow(df)) {
      df$top100[r] <- df$Rank[r] <= 100
    }
    write.csv(df, file)
  }

Function

add_col <- function(file_path) {

  df <- read.csv(file_path)
  df$top100 <- NA

  for (r in 1:nrow(df)) {
    df$top100[r] <- df$Rank[r] <= 100
  }
  write.csv(df, file_path)
}


cl <- makeCluster(6)
dummy <- parLapply(cl, file_list, add_col) stopCluster(cl)
Parallel Programming in R

Practical considerations: number of cores

Detecting cores

detectCores()
[1] 8

Parallelized code

cl <- makeCluster(detectCores() - 2)


dummy <- parLapply(cl, file_list, add_col) stopCluster(cl)
Parallel Programming in R

Practical considerations: cluster type

PSOCK cluster (default)

cl <- makeCluster(detectCores() - 2)
  • Creates copies of current R session
  • Cores do not share memory
  • Works on any OS (Windows, Mac, Linux)

FORK cluster

cl <- makeCluster(detectCores() - 2,
                  type = "FORK")
  • Creates subprocesses from R session
  • Cores share memory (faster than PSOCK)
  • Does not work on Windows
Parallel Programming in R

Let's exercise!

Parallel Programming in R

Preparing Video For Download...