Should we parallelize?

Parallel Programming in R

Nabeel Imam

Data Scientist

Let's construct a building

   

Building a floor on top of the last one: sequential

 

Installing windows to finished structure: parallel

A building is under construction. Floors can only be built in sequence, while windows can be installed in parallel.

Parallel Programming in R

The sequential-parallel scale

Common computational tasks are placed on a scale, ranging from sequential at one end and parallel on the other. Creating new variables is near the parallel end, while cumulative sum is near the sequential end.

Parallel Programming in R

A classic numerical operation

Calculating the square roots of a million numbers

numbers <- 1:1000000


start <- Sys.time() sq_roots <- lapply(numbers, sqrt) end <- Sys.time()
end - start
Time difference of 1.044573 secs
Parallel Programming in R

How could we parallelize the square root?

A flow chart for calculating square roots in parallel. Whole numbers from one to a million are divided into five groups, each of length 200,000.

Parallel Programming in R

How could we parallelize the square root?

The split data is sent to a cluster, which is composed of more than one core. Each group of numbers is sent to one core for the square root calculation. If all available cores are busy, any new groups will wait for a core to be free.

Parallel Programming in R

How could we parallelize the square root?

The square roots are collected from each core and combined to give a million square roots.

Parallel Programming in R

A parallelized numerical operation

The square roots of a million numbers in parallel

library(parallel)


my_cluster <- makeCluster(3)
start <- Sys.time() sq_roots <- parLapply(my_cluster, numbers, sqrt) end <- Sys.time()
stopCluster(my_cluster)
end - start
Time difference of 0.8416824 secs
Parallel Programming in R

Not as fast as we expected

A flow chart for the parallel calculation of square roots of numbers one to a million.

Parallel Programming in R

Not as fast as we expected

Parallel execution involves several extra tasks. The first is splitting the data.

Parallel Programming in R

Not as fast as we expected

After splitting, each subgroup of the data needs to be copied to the cores in the cluster.

Parallel Programming in R

Not as fast as we expected

After computation, output from each core needs to be collected to give the final result.

Parallel Programming in R

Not as fast as we expected

Some computational resources are spend in orchestrating the whole process.

Parallel Programming in R

So, should we parallelize?

For a sufficiently complex task, consider:

Pros

  • Faster than sequential
  • More cost-efficient in the long run

 

Cons

  • Requires special programming skills (but you are all set!)
  • High memory usage
Parallel Programming in R

Let's practice!

Parallel Programming in R

Preparing Video For Download...