Monitoring and troubleshooting AWS
John Q. Martin
Principal Consultant
Auto-generated from your trace data: no manual setup

At a glance: green = on time · amber = delayed · red = stopped

When you open a service map, go straight to anything that isn't green.
Click any node to see: response time distribution, HTTP status breakdown, error types.

API Gateway (50ms)
-> OrderService (50ms)
-> PaymentService (100ms)
-> External API (200ms)
Total: 400ms
External API = 50% of total latency
Compare alternative path:
OrderService -> DynamoDB
Total: 120ms (fast)


Query 1 (10-30ms)
-> Query 2 (30-50ms)
-> Query 3 (50-70ms)
-> Query 4 (70-90ms)

Get user list (10-20ms)
Get user 1 details (20-30ms)
Get user 2 details (30-40ms)
... ×100 sequential queries
Fix: batch queries, eager loading
Fix: batch requests, message queues, caching

Fix: provisioned concurrency, optimize initialization
Service A [Error]
-> Service B [Timeout 5s]
-> Service C [Timeout 5s]
-> Database [Timeout 5s]
Matching timeout blocks stacked through the trace hierarchy
Fix: timeouts, circuit breakers, fallbacks

Compare groups:
response_time > 1000ms, sort by duration
# High latency alarm
aws cloudwatch put-metric-alarm \
--alarm-name HighLatency \
--metric-name ResponseTime \
--namespace AWS/XRay \
--statistic Average \
--period 300 \
--threshold 1000 \
--comparison-operator GreaterThanThreshold
# High error count alarm
aws cloudwatch put-metric-alarm \
--alarm-name HighErrorCount \
--metric-name ErrorCount \
--namespace AWS/XRay \
--statistic Sum \
--period 300 \
--threshold 50 \
--comparison-operator GreaterThanThreshold
Monitoring and troubleshooting AWS