Scaling batch processing

Streaming Concepts

Mike Metzger

Data Engineer

What is scaling?

  • Improving performance
    • Processing more quickly
      • Less time to process the same amount of data
    • Processing more data
      • More data processed in the same amount of time
Streaming Concepts

Vertical scaling

  • Better computing
    • Faster CPU
    • Faster IO
    • More memory
  • Typically the easiest kind of scaling
    • Least complexity
    • Rarely requires changing underlying programs / algorithms

CPU

Memory

1 Images courtesy https://unsplash.com/@jeremy0
Streaming Concepts

Vertical scaling cons

  • Inherently limited
  • Can be expensive / low ROI
  • Industry improvements are not guaranteed
Streaming Concepts

Horizontal scaling

  • Splitting a task into multiple parts
    • More computers
    • Could also be more CPUs
  • Best done on tasks that are "embarrassingly parallel"
    • Tasks that can be easily divided among workers
  • Can be very cost effective
  • Can have near-linear performance improvements for certain types of processes

CPU1

CPU2

CPU3

CPU4

Streaming Concepts

Horizontal scaling cons

  • Complexity
    • Requires a processing framework (like Apache Spark or Dask)
    • Requires more extensive networking
  • Ongoing management
  • Can be expensive depending on requirements
  • "Non-parallel" tasks
Streaming Concepts

Let's practice!

Streaming Concepts

Preparing Video For Download...