Application health dashboards

Monitoring and troubleshooting AWS

John Q. Martin

Principal Consultant

Dashboard design: hierarchy of information

 

Three-tier pyramid: a narrow dark top tier, a wider teal middle tier, and a widest green bottom tier

Like a hospital: reception → ward → consultant

 

Top - Executive Summary

  • Overall system health
  • Key business metrics, SLA, current incidents

Middle - Service Health

  • Error rates, latency percentiles, throughput

Bottom - Detailed Diagnostics

  • Traces, logs, resource utilization, dependency health
Monitoring and troubleshooting AWS

The four golden signals

 

The four golden signals of monitoring latency traffic errors and saturation

Cover these four and you cover the vast majority of problems you will encounter.

Monitoring and troubleshooting AWS

The RED method

 

RED for request-driven services:

RED method pattern showing rate errors and duration for request driven services

 

Relationship to golden signals:

  • RED is a focused subset of the four golden signals
  • Designed for services that handle user requests
  • Saturation (4th golden signal) added for infrastructure monitoring

Use RED as your starting point, add saturation for deeper diagnostics.

Monitoring and troubleshooting AWS

Three data sources

 

Diagram of three dashboard data sources CloudWatch metrics X-Ray traces and CloudWatch Logs

Monitoring and troubleshooting AWS

CloudWatch metrics widgets

Request volume:

["AWS/ApplicationELB",
 "RequestCount",
 {"stat": "Sum"}]

Error rate (metric math):

"metrics": [
  ["AWS/Lambda", "Errors",
   {"stat":"Sum","id":"errors"}],
  [".", "Invocations",
   {"stat":"Sum","id":"invocations"}],
  [{"expression":
    "(errors/invocations)*100",
    "label":"Error Rate %"}]
]

Response time percentiles:

["AWS/ApplicationELB",
 "TargetResponseTime",
 {"stat": "p50"}],
["AWS/ApplicationELB",
 "TargetResponseTime",
 {"stat": "p95"}],
["AWS/ApplicationELB",
 "TargetResponseTime",
 {"stat": "p99"}]
  • Request volume from ALB RequestCount
  • Error rate uses metric math - errors / invocations x 100
  • Response time as p50, p95, p99 percentiles
Monitoring and troubleshooting AWS

X-Ray and logs widgets

 

X-Ray widgets:

  • Service map - application topology with health indicators
  • Trace statistics - average response time, error rate, fault rate
  • Service latency - per-service response time comparison

 

CloudWatch Logs widgets:

Recent errors query:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

Error count by type:

filter @message like /ERROR/
| stats count(*) as errors
    by errorType
| sort errors desc
Monitoring and troubleshooting AWS

Complete dashboard layout

 

Health dashboard layout with summary widgets on top service map in the middle and error logs below

Monitoring and troubleshooting AWS

Incident troubleshooting workflow

 

Emergency-room triage clipboard with colored priority bands and a stethoscope

Like ER triage: stabilize → diagnose → treat → verify

  1. Identify - red/yellow indicators, error spikes
  2. Scope - which services? when? worsening?
  3. Correlate - errors + latency, traffic + resources
  4. Drill down - service map → traces → logs
  5. Root cause - dependency, DB, memory, config
  6. Fix - circuit breaker, scale, rollback, failover
  7. Verify - error rate normal, map green, alarms clear
Monitoring and troubleshooting AWS

Scenario 1: sudden traffic spike

Monitoring dashboard during a traffic spike: line charts spiking upward, gauges pinned in the red zone, red alert banners

What you see on the dashboard:

  • Request volume 10× normal
  • Error rate 25%
  • Latency 3000ms
  • CPU 95%

Analysis: Resources overwhelmed by traffic surge

 

Actions:

  • Scale out immediately
  • Enable auto-scaling
  • Implement rate limiting
  • Add caching to reduce backend pressure
Monitoring and troubleshooting AWS

Scenario 2: database bottleneck

Monitoring dashboard showing a database bottleneck: a cracked red database icon, CPU and connections gauges maxed in red, a steeply climbing query-latency chart, connection-pool bar nearly full

What you see on the dashboard:

  • DB CPU 95%
  • DB connections 95/100
  • Query latency 5000ms
  • App latency 5500ms

Analysis: Database is the bottleneck, slow queries exhausting the connection pool

 

Actions:

  • Identify slow queries in logs
  • Add indexes, optimize queries
  • Increase connection pool size
  • Scale the database
Monitoring and troubleshooting AWS

Scenario 3: cascading failure

 

What you see on the service map:

Service map cascade: node A amber with client errors, B and C red with server faults, D dark and fully down, arrows flowing down the chain

 

Analysis:

A small problem in Service A cascaded through the dependency chain

Actions:

  • Circuit breakers to stop the cascade
  • Timeouts at each service level
  • Fallback mechanisms for failed dependencies
  • Fix the root cause in Service A
Monitoring and troubleshooting AWS

Dashboard best practices

 

Six recommended dashboard best practices for application health monitoring

Monitoring and troubleshooting AWS

Video summary and course completion

  • Design using hierarchy of information and the four golden signals
  • Combine CloudWatch metrics, X-Ray traces, and logs in one view
  • Seven-step incident workflow: identify → scope → correlate → drill down → root cause → fix → verify

You can now:

  • Monitor with CloudWatch metrics, logs, and dashboards
  • Configure alarms and notifications with SNS and SQS
  • Implement distributed tracing with X-Ray
  • Build application health dashboards for operational excellence
Monitoring and troubleshooting AWS

Let's practice!

Monitoring and troubleshooting AWS

Preparing Video For Download...