Monitoring and alerting

MLOps Deployment and Life Cycling

Nemanja Radojkovic

Senior Machine Learning Engineer

outside world

MLOps Deployment and Life Cycling

bugs in the service

MLOps Deployment and Life Cycling

catching bugs

MLOps Deployment and Life Cycling

many moving pieces

MLOps Deployment and Life Cycling

points of failure 2

MLOps Deployment and Life Cycling

alert

MLOps Deployment and Life Cycling

look here

MLOps Deployment and Life Cycling

loggin 1

MLOps Deployment and Life Cycling

loggin 2

MLOps Deployment and Life Cycling

loggin 3

MLOps Deployment and Life Cycling

data pipeline 1

MLOps Deployment and Life Cycling

data validation 2

MLOps Deployment and Life Cycling

data profiles validation

MLOps Deployment and Life Cycling

Statistical validation

Can be:

  • too sensitive
  • not informative enough

 

Risk

  • Too many alerts
  • "Alert fatigue"
  • Important alerts going unnoticed
MLOps Deployment and Life Cycling

inform everyone

MLOps Deployment and Life Cycling

Learn from your history

After treating the incident => Record root cause and resolution steps

Example from Google[1]:

  • 10 years of incidents recorded an analyzed
  • > 2/3 were not ML-related!
1 How ML Breaks: A Decade of Outages for One Large ML Pipeline, https://www.usenix.org/conference/opml20/presentation/papasian
MLOps Deployment and Life Cycling

centralized monitoring

MLOps Deployment and Life Cycling

Let's practice!

MLOps Deployment and Life Cycling

Preparing Video For Download...