Monitoring and alerting

Deployment e ciclo di vita in MLOps

Nemanja Radojkovic

Senior Machine Learning Engineer

outside world

Deployment e ciclo di vita in MLOps

bugs in the service

Deployment e ciclo di vita in MLOps

catching bugs

Deployment e ciclo di vita in MLOps

many moving pieces

Deployment e ciclo di vita in MLOps

points of failure 2

Deployment e ciclo di vita in MLOps

alert

Deployment e ciclo di vita in MLOps

look here

Deployment e ciclo di vita in MLOps

loggin 1

Deployment e ciclo di vita in MLOps

loggin 2

Deployment e ciclo di vita in MLOps

loggin 3

Deployment e ciclo di vita in MLOps

data pipeline 1

Deployment e ciclo di vita in MLOps

data validation 2

Deployment e ciclo di vita in MLOps

data profiles validation

Deployment e ciclo di vita in MLOps

Statistical validation

Can be:

  • too sensitive
  • not informative enough

 

Risk

  • Too many alerts
  • "Alert fatigue"
  • Important alerts going unnoticed
Deployment e ciclo di vita in MLOps

inform everyone

Deployment e ciclo di vita in MLOps

Learn from your history

After treating the incident => Record root cause and resolution steps

Example from Google[1]:

  • 10 years of incidents recorded an analyzed
  • > 2/3 were not ML-related!
1 How ML Breaks: A Decade of Outages for One Large ML Pipeline, https://www.usenix.org/conference/opml20/presentation/papasian
Deployment e ciclo di vita in MLOps

centralized monitoring

Deployment e ciclo di vita in MLOps

Let's practice!

Deployment e ciclo di vita in MLOps

Preparing Video For Download...