Serving modes

MLOps Deployment and Life Cycling

Nemanja Radojkovic

Senior Machine Learning Engineer

model as software

user perspective

service like any other

food service

model service

Serving and serving mode

Providing prediction service == Model serving
Implementation of a specific type of serving == Serving mode

Choose carefully!

when should the service run

scheduled

on demand

batch prediction 1

batch pred 2

batch definition

also known as

Batch prediction: Keep it simple

Batch prediction is the simplest
If use case allows it, go for it
Good fit: monthly generation of sales forecasts

on demand 1

on demand synonyms

on demand time importance

tech term

request time

response time

Acceptable latency

What is acceptable?

< 1 hour?
< 1 minute?
< 1 second?
< 1 millisecond?

Near-real time prediction a.k.a. Stream processing

Acceptable latency ~= X minutes

Also known as stream processing (requests and responses form "data streams")

Real-time prediction

Acceptable latency < 1 sec

Example:

Credit card fraud detection
Late prediction as good as useless

When latency is a priority

Weaker, but faster model more valuable than a stronger, but slower one
Models deployed to end user devices to reduce latency => "edge deployment"
- ML-infused smartphone apps:
  - navigation apps
  - unlocking via facial recognition
  - image filters

Let's practice!

MLOps Deployment and Life Cycling

Preparing Video For Download...