Log Insights and analysis

Monitoring and troubleshooting AWS

John Q. Martin

Principal Consultant

Logs Insights query structure

Logs Insights query structure showing the fields filter stats sort and limit pipe chained commands

Query languages: same query, three ways

CloudWatch Logs Insights QL

fields @timestamp, @message, level
| filter level = "ERROR"
| filter @timestamp > ago(1h)
| sort @timestamp desc
| limit 20

OpenSearch PPL

source = `/aws/lambda/my-function`
| where level = 'ERROR' and @timestamp > ago(1h)
| fields @timestamp, @message, level
| sort - @timestamp
| head 20

Query languages: OpenSearch SQL

OpenSearch SQL

SELECT `@timestamp`, `@message`, level
FROM `/aws/lambda/my-function`
WHERE level = 'ERROR'
  AND `@timestamp` > ago(1h)
ORDER BY `@timestamp` DESC
LIMIT 20

Standard SELECT / FROM / WHERE syntax
Adds JOINs across log groups
Pick whichever fits your team

Scenario 1: Finding error spikes

fields @timestamp, @message, level
| filter level = "ERROR"
| stats count() as error_count by bin(5m)
| sort @timestamp desc

Enhanced: group by error type

fields @timestamp, error_type, error_message
| filter level = "ERROR"
| stats count() as count by error_type, bin(5m)
| sort count desc

Scenario 2: Slow API endpoints

fields @timestamp, endpoint, response_time, status_code
| filter response_time > 1000
| stats avg(response_time) as avg_time,
        max(response_time) as max_time,
        count() as slow_requests
        by endpoint
| sort avg_time desc

Drill-down

fields @timestamp, endpoint, response_time, user_id, request_id
| filter endpoint = "/api/users" and response_time > 1000
| sort response_time desc
| limit 20

Scenario 3: Failed authentication

fields @timestamp, user_id, ip_address, action
| filter action = "login_failed"
| stats count() as failed_attempts by user_id, ip_address
| sort failed_attempts desc
| limit 50

Time-based analysis

fields @timestamp, user_id, ip_address
| filter action = "login_failed"
| stats count() as attempts by ip_address, bin(1h)
| filter attempts > 10
| sort attempts desc

Scenarios 4 & 5: Database timeouts and memory leaks

Database timeouts

fields @timestamp, @message
| filter @message like /database/ and @message like /timeout|error|failed/
| parse @message "timeout after * seconds" as timeout_duration
| stats count() as timeout_count,
        avg(timeout_duration) as avg_timeout
        by bin(5m)

Memory leak detection

fields @timestamp, memory_used_mb, heap_size_mb
| stats avg(memory_used_mb) as avg_memory,
        max(memory_used_mb) as max_memory
        by bin(1h)
| sort @timestamp asc

Scenario 6: Request tracing

fields @timestamp, @message, request_id, service, action
| filter request_id = "abc123def456"
| sort @timestamp asc

Multi-service trace

fields @timestamp, service, action, duration_ms, status
| filter request_id = "abc123def456"
| sort @timestamp asc
| display @timestamp, service, action, duration_ms, status

Advanced techniques: parse, Regex

Parsing unstructured logs

fields @timestamp, @message
| parse @message "[*] User * failed to access resource * from IP *"
    as level, user, resource, ip

Regex parsing

| parse @message /Request completed in (?<duration>\d+)ms with status (?<status>\d+)/

Advanced techniques: Calculated Fields

fields @timestamp, requests, errors
| fields error_rate = (errors / requests) * 100

fields @timestamp, status_code
| fields status_category =
    case(status_code < 300, "success",
         status_code < 500, "client_error",
         status_code >= 500, "server_error")
| stats count() as request_count by status_category

Query optimization

Four query optimization rules set a time range filter before aggregating limit results and prefer aggregations

Anomaly detection and correlation

Statistical anomaly detection

Calculate baseline: avg and stddev over 7 days
Compare current period against baseline
Flag values > 2 standard deviations as anomalous

Error rate correlation

fields @timestamp, level
| stats count() as total_requests,
        sum(case(level = "ERROR", 1, 0)) as errors
        by bin(5m)
| fields error_rate = (errors / total_requests) * 100

Summary

CloudWatch Logs Insights: interactive query service, no infrastructure needed
Three query languages: Logs Insights QL, OpenSearch PPL, OpenSearch SQL
Six troubleshooting scenarios: error spikes, slow endpoints, failed auth, DB timeouts, memory leaks, request tracing
Advanced techniques: parse for unstructured text, bin() for time-series, pct() for percentiles
Correlation and anomaly detection: error rate analysis, statistical baseline comparison
From investigation to automation: metric filters and alarms close the loop

Let's practice!