Application health dashboards

Monitoring and troubleshooting AWS

John Q. Martin

Principal Consultant

Dashboard design: hierarchy of information

Three-tier pyramid: a narrow dark top tier, a wider teal middle tier, and a widest green bottom tier

Like a hospital: reception → ward → consultant

Top - Executive Summary

Overall system health
Key business metrics, SLA, current incidents

Middle - Service Health

Error rates, latency percentiles, throughput

Bottom - Detailed Diagnostics

Traces, logs, resource utilization, dependency health

The four golden signals

The four golden signals of monitoring latency traffic errors and saturation

Cover these four and you cover the vast majority of problems you will encounter.

The RED method

RED for request-driven services:

RED method pattern showing rate errors and duration for request driven services

Relationship to golden signals:

RED is a focused subset of the four golden signals
Designed for services that handle user requests
Saturation (4th golden signal) added for infrastructure monitoring

Use RED as your starting point, add saturation for deeper diagnostics.

Three data sources

Diagram of three dashboard data sources CloudWatch metrics X-Ray traces and CloudWatch Logs

CloudWatch metrics widgets

Request volume:

["AWS/ApplicationELB",
 "RequestCount",
 {"stat": "Sum"}]

Error rate (metric math):

"metrics": [
  ["AWS/Lambda", "Errors",
   {"stat":"Sum","id":"errors"}],
  [".", "Invocations",
   {"stat":"Sum","id":"invocations"}],
  [{"expression":
    "(errors/invocations)*100",
    "label":"Error Rate %"}]
]

Response time percentiles:

["AWS/ApplicationELB",
 "TargetResponseTime",
 {"stat": "p50"}],
["AWS/ApplicationELB",
 "TargetResponseTime",
 {"stat": "p95"}],
["AWS/ApplicationELB",
 "TargetResponseTime",
 {"stat": "p99"}]

Request volume from ALB RequestCount
Error rate uses metric math - errors / invocations x 100
Response time as p50, p95, p99 percentiles

X-Ray and logs widgets

X-Ray widgets:

Service map - application topology with health indicators
Trace statistics - average response time, error rate, fault rate
Service latency - per-service response time comparison

CloudWatch Logs widgets:

Recent errors query:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

Error count by type:

filter @message like /ERROR/
| stats count(*) as errors
    by errorType
| sort errors desc

Complete dashboard layout

Health dashboard layout with summary widgets on top service map in the middle and error logs below

Incident troubleshooting workflow

Emergency-room triage clipboard with colored priority bands and a stethoscope

Like ER triage: stabilize → diagnose → treat → verify

Identify - red/yellow indicators, error spikes
Scope - which services? when? worsening?
Correlate - errors + latency, traffic + resources
Drill down - service map → traces → logs
Root cause - dependency, DB, memory, config
Fix - circuit breaker, scale, rollback, failover
Verify - error rate normal, map green, alarms clear

Scenario 1: sudden traffic spike

Monitoring dashboard during a traffic spike: line charts spiking upward, gauges pinned in the red zone, red alert banners

What you see on the dashboard:

Request volume 10× normal
Error rate 25%
Latency 3000ms
CPU 95%

Analysis: Resources overwhelmed by traffic surge

Actions:

Scale out immediately
Enable auto-scaling
Implement rate limiting
Add caching to reduce backend pressure

Scenario 2: database bottleneck

Monitoring dashboard showing a database bottleneck: a cracked red database icon, CPU and connections gauges maxed in red, a steeply climbing query-latency chart, connection-pool bar nearly full

What you see on the dashboard:

DB CPU 95%
DB connections 95/100
Query latency 5000ms
App latency 5500ms

Analysis: Database is the bottleneck, slow queries exhausting the connection pool

Actions:

Identify slow queries in logs
Add indexes, optimize queries
Increase connection pool size
Scale the database

Scenario 3: cascading failure

What you see on the service map:

Service map cascade: node A amber with client errors, B and C red with server faults, D dark and fully down, arrows flowing down the chain

Analysis:

A small problem in Service A cascaded through the dependency chain

Actions:

Circuit breakers to stop the cascade
Timeouts at each service level
Fallback mechanisms for failed dependencies
Fix the root cause in Service A

Dashboard best practices

Six recommended dashboard best practices for application health monitoring

Video summary and course completion

Design using hierarchy of information and the four golden signals
Combine CloudWatch metrics, X-Ray traces, and logs in one view
Seven-step incident workflow: identify → scope → correlate → drill down → root cause → fix → verify

You can now:

Monitor with CloudWatch metrics, logs, and dashboards
Configure alarms and notifications with SNS and SQS
Implement distributed tracing with X-Ray
Build application health dashboards for operational excellence

Let's practice!

Monitoring and troubleshooting AWS