Monitoring and troubleshooting AWS
John Q. Martin
Principal Consultant

Like a hospital: reception → ward → consultant
Top - Executive Summary
Middle - Service Health
Bottom - Detailed Diagnostics

Cover these four and you cover the vast majority of problems you will encounter.

Use RED as your starting point, add saturation for deeper diagnostics.

["AWS/ApplicationELB",
"RequestCount",
{"stat": "Sum"}]
"metrics": [
["AWS/Lambda", "Errors",
{"stat":"Sum","id":"errors"}],
[".", "Invocations",
{"stat":"Sum","id":"invocations"}],
[{"expression":
"(errors/invocations)*100",
"label":"Error Rate %"}]
]
["AWS/ApplicationELB",
"TargetResponseTime",
{"stat": "p50"}],
["AWS/ApplicationELB",
"TargetResponseTime",
{"stat": "p95"}],
["AWS/ApplicationELB",
"TargetResponseTime",
{"stat": "p99"}]
Recent errors query:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20
Error count by type:
filter @message like /ERROR/
| stats count(*) as errors
by errorType
| sort errors desc


Like ER triage: stabilize → diagnose → treat → verify

Analysis: Resources overwhelmed by traffic surge

Analysis: Database is the bottleneck, slow queries exhausting the connection pool

A small problem in Service A cascaded through the dependency chain

Monitoring and troubleshooting AWS