Debuggen in parallel

Parallel programmeren in R

Nabeel Imam

Data Scientist

Wat is debuggen?

Programmeurs vinden een bug in code op een computerscherm en proberen die te verwijderen.

Parallel programmeren in R

Bestanden parallel lezen

print(file_list)
 [1] "./stocks/2011.csv"
 [2] "./stocks/2012.csv"
 [3] "./stocks/2013.csv"
 [4] "./stocks/2014.csv"
 [5] "./stocks/2015.csv"
 ...
Parallel programmeren in R

De filterfunctie

filterCSV <- function (filepath) {

  # Read CSV
  df <- read.csv(filepath)

  # Filter data
  df <- df %>%
    dplyr::filter(Company == "Tesla")

  # Write to back to same path
  write.csv(df, filepath)
}
Parallel programmeren in R

De parallelle apply

cl <- makeCluster(4)

clusterEvalQ(cl, library(dplyr))
dummy <- parLapply(cl, file_list, filterCSV)
stopCluster(cl)
Error in checkForRemoteErrors(val) : 
  one node produced an error: ℹ In argument: `Company == "Tesla"`.
Caused by error:
! object 'Company' not found
Parallel programmeren in R

De sequentiële run

short_list <- file_list[1:5]


dummy <- lapply(short_list, filterCSV)
read.csv(short_list[1])
         Date  Open  High   Low Close Adj.Close   Volume Company Year
1  2011-01-03 5.368 5.400 5.180 5.324     5.324  6415000   Tesla 2011
2  2011-01-04 5.332 5.390 5.204 5.334     5.334  5937000   Tesla 2011
3  2011-01-05 5.296 5.380 5.238 5.366     5.366  7233500   Tesla 2011
...
Parallel programmeren in R

Vind de fout

Foutmelding

Error in checkForRemoteErrors(val) : 
  one node produced an error: 
  In argument: `Company == "Tesla"`.
Caused by error:
! object 'Company' not found

Programmeurs onderzoeken een groot rood uitroepteken op het scherm.

Parallel programmeren in R

Vind de fout

filterCSV <- function (filepath) {

  # Read CSV
  df <- read.csv(filepath)

  # Filter data
  df <- df %>%
    dplyr::filter(Company == "Tesla")

  # Write to back to same path
  write.csv(df, filepath)
}
filterCSV_debug <- function (filepath) {

  df <- read.csv(filepath)

print(
# Plak bestandspad en kolomnamen paste(filepath, ":",
# Voeg kolomnamen samen tot één string paste0(colnames(df), collapse = ","))
)
df <- df %>% dplyr::filter(Company == "Tesla") write.csv(df, filepath) }
Parallel programmeren in R

Vind de fout

cl <- makeCluster(4)
clusterEvalQ(cl, library(dplyr))

dummy <- parLapply(cl, file_list, filterCSV_debug)
stopCluster(cl)
Error in checkForRemoteErrors(val) : 
  one node produced an error: ℹ In argument: `Company == "Microsoft"`.
Caused by error:
! object 'Company' not found
Parallel programmeren in R

Vind de fout

cl <- makeCluster(4, outfile = "log.txt") # Log print messages into "log.txt"

clusterEvalQ(cl, library(dplyr)) parLapply(cl, file_list, filterCSV_debug) stopCluster(cl)
Error in checkForRemoteErrors(val) : 
  one node produced an error: ℹ In argument: `Company == "Tesla"`.
Caused by error:
! object 'Company' not found
Parallel programmeren in R

Logs bekijken

Er staat tekst met CSV-bestandspaden en bijbehorende kolomnamen. Het pad voor de gegevens van 2017 is gemarkeerd en mist een kolom genaamd "Company".

Parallel programmeren in R

Debuggen met foreach

cl <- makeCluster(4,
                  # Geef een bestandsnaam op om print-berichten te loggen
                  outfile = "log.txt")

registerDoParallel(cl)

foreach(f = file_list,
        .packages = "dplyr") %dopar% {
  filterCSV_debug(f)
}

stopCluster(cl)
Parallel programmeren in R

Het fijne aan furrr

plan(multisession, workers = 4)
future_map(file_list, filterCSV_debug)
plan(sequential)
Parallel programmeren in R

Het fijne aan furrr

[1] "./stocks/2011.csv : Date,Open,High,Low,Close,Adj.Close,Volume,Company,Year"
[1] "./stocks/2012.csv : Date,Open,High,Low,Close,Adj.Close,Volume,Company,Year"
[1] "./stocks/2013.csv : Date,Open,High,Low,Close,Adj.Close,Volume,Company,Year"
[1] "./stocks/2014.csv : Date,Open,High,Low,Close,Adj.Close,Volume,Company,Year"
[1] "./stocks/2015.csv : Date,Open,High,Low,Close,Adj.Close,Volume,Company,Year"
[1] "./stocks/2016.csv : Date,Open,High,Low,Close,Adj.Close,Volume,Company,Year"
[1] "./stocks/2017.csv : Date,Open,High,Low,Close,Adj.Close,Volume,Year"
Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error in `dplyr::filter()`:
ℹ In argument: `Company == "Tesla"`.
Caused by error:
! object 'Company' not found
Parallel programmeren in R

De stappen

Bij fouten in parallel
  • Draai sequentieel op een subset van de input
  • Lees de fout en print gerichte meldingen
  • Lokaliseer de fout via prints of logs
  • Los de fout op
Parallel programmeren in R

Laten we oefenen!

Parallel programmeren in R

Preparing Video For Download...