Log Insights and analysis

Monitoring and troubleshooting AWS

John Q. Martin

Principal Consultant

Logs Insights query structure

 

 

Logs Insights query structure showing the fields filter stats sort and limit pipe chained commands

Monitoring and troubleshooting AWS

Query languages: same query, three ways

CloudWatch Logs Insights QL

fields @timestamp, @message, level
| filter level = "ERROR"
| filter @timestamp > ago(1h)
| sort @timestamp desc
| limit 20

OpenSearch PPL

source = `/aws/lambda/my-function`
| where level = 'ERROR' and @timestamp > ago(1h)
| fields @timestamp, @message, level
| sort - @timestamp
| head 20
Monitoring and troubleshooting AWS

Query languages: OpenSearch SQL

OpenSearch SQL

SELECT `@timestamp`, `@message`, level
FROM `/aws/lambda/my-function`
WHERE level = 'ERROR'
  AND `@timestamp` > ago(1h)
ORDER BY `@timestamp` DESC
LIMIT 20

 

  • Standard SELECT / FROM / WHERE syntax
  • Adds JOINs across log groups
  • Pick whichever fits your team
Monitoring and troubleshooting AWS

Scenario 1: Finding error spikes

fields @timestamp, @message, level
| filter level = "ERROR"
| stats count() as error_count by bin(5m)
| sort @timestamp desc
Enhanced: group by error type
fields @timestamp, error_type, error_message
| filter level = "ERROR"
| stats count() as count by error_type, bin(5m)
| sort count desc
Monitoring and troubleshooting AWS

Scenario 2: Slow API endpoints

fields @timestamp, endpoint, response_time, status_code
| filter response_time > 1000
| stats avg(response_time) as avg_time,
        max(response_time) as max_time,
        count() as slow_requests
        by endpoint
| sort avg_time desc
Drill-down
fields @timestamp, endpoint, response_time, user_id, request_id
| filter endpoint = "/api/users" and response_time > 1000
| sort response_time desc
| limit 20
Monitoring and troubleshooting AWS

Scenario 3: Failed authentication

fields @timestamp, user_id, ip_address, action
| filter action = "login_failed"
| stats count() as failed_attempts by user_id, ip_address
| sort failed_attempts desc
| limit 50
Time-based analysis
fields @timestamp, user_id, ip_address
| filter action = "login_failed"
| stats count() as attempts by ip_address, bin(1h)
| filter attempts > 10
| sort attempts desc
Monitoring and troubleshooting AWS

Scenarios 4 & 5: Database timeouts and memory leaks

Database timeouts
fields @timestamp, @message
| filter @message like /database/ and @message like /timeout|error|failed/
| parse @message "timeout after * seconds" as timeout_duration
| stats count() as timeout_count,
        avg(timeout_duration) as avg_timeout
        by bin(5m)
Memory leak detection
fields @timestamp, memory_used_mb, heap_size_mb
| stats avg(memory_used_mb) as avg_memory,
        max(memory_used_mb) as max_memory
        by bin(1h)
| sort @timestamp asc
Monitoring and troubleshooting AWS

Scenario 6: Request tracing

fields @timestamp, @message, request_id, service, action
| filter request_id = "abc123def456"
| sort @timestamp asc
Multi-service trace
fields @timestamp, service, action, duration_ms, status
| filter request_id = "abc123def456"
| sort @timestamp asc
| display @timestamp, service, action, duration_ms, status
Monitoring and troubleshooting AWS

Advanced techniques: parse, Regex

Parsing unstructured logs
fields @timestamp, @message
| parse @message "[*] User * failed to access resource * from IP *"
    as level, user, resource, ip
Regex parsing
| parse @message /Request completed in (?<duration>\d+)ms with status (?<status>\d+)/
Monitoring and troubleshooting AWS

Advanced techniques: Calculated Fields

 

fields @timestamp, requests, errors
| fields error_rate = (errors / requests) * 100

fields @timestamp, status_code
| fields status_category =
    case(status_code < 300, "success",
         status_code < 500, "client_error",
         status_code >= 500, "server_error")
| stats count() as request_count by status_category
Monitoring and troubleshooting AWS

Query optimization

 

Four query optimization rules set a time range filter before aggregating limit results and prefer aggregations

Monitoring and troubleshooting AWS

Anomaly detection and correlation

Statistical anomaly detection
  • Calculate baseline: avg and stddev over 7 days
  • Compare current period against baseline
  • Flag values > 2 standard deviations as anomalous
Error rate correlation
fields @timestamp, level
| stats count() as total_requests,
        sum(case(level = "ERROR", 1, 0)) as errors
        by bin(5m)
| fields error_rate = (errors / total_requests) * 100
Monitoring and troubleshooting AWS

Summary

 

  • CloudWatch Logs Insights: interactive query service, no infrastructure needed
  • Three query languages: Logs Insights QL, OpenSearch PPL, OpenSearch SQL
  • Six troubleshooting scenarios: error spikes, slow endpoints, failed auth, DB timeouts, memory leaks, request tracing
  • Advanced techniques: parse for unstructured text, bin() for time-series, pct() for percentiles
  • Correlation and anomaly detection: error rate analysis, statistical baseline comparison
  • From investigation to automation: metric filters and alarms close the loop
Monitoring and troubleshooting AWS

Let's practice!

Monitoring and troubleshooting AWS

Preparing Video For Download...