Scaling batch processing
Streaming Concepts
Mike Metzger
Data Engineer
What is scaling?
Improving
performance
Processing
more quickly
Less time to process the same amount of data
Processing
more data
More data processed in the same amount of time
Vertical scaling
Better computing
Faster CPU
Faster IO
More memory
Typically the
easiest
kind of scaling
Least complexity
Rarely requires changing underlying programs / algorithms
1
Images courtesy https://unsplash.com/@jeremy0
Vertical scaling cons
Inherently
limited
Can be
expensive
/ low ROI
Industry improvements are
not guaranteed
Horizontal scaling
Splitting
a task into multiple parts
More computers
Could also be more CPUs
Best done on tasks that are
"embarrassingly parallel"
Tasks that can be easily divided among workers
Can be very
cost effective
Can have
near-linear performance improvements
for certain types of processes
Horizontal scaling cons
Complexity
Requires a processing framework (like Apache Spark or Dask)
Requires more extensive networking
Ongoing management
Can be
expensive
depending on requirements
"Non-parallel"
tasks
Let's practice!
Streaming Concepts
Preparing Video For Download...