Handling the event lifecycle: retries, DLQs and destinations

Serverless Applications with AWS Lambda

Claudio Canales

Senior DevOps Engineer

The event lifecycle at a glance

Lambda invokes your handler.
The handler either succeeds or fails.
When it fails, retries and routing decide what happens next.

Event lifecycle flow

Two ways failures show up

Synchronous

Caller waits and receives an error response.

Asynchronous

Caller is acknowledged first.
Lambda retries in the background.
Failure handling depends on the invocation mode.

Sync vs async failure paths

Retries are normal

Retries are often a feature, not a bug.
A transient failure might succeed on the next attempt.
Retries can cause duplicate processing; your handler must account for it.

Retry redialing analogy

Retries over time

Retries can recover from transient issues.
But the same event may run multiple times.
Idempotency and clear error handling are essential.

Retry attempts timeline

When retries are dangerous

Retries are risky when work is not idempotent.
Examples: charging a card, sending an email.
Use idempotency keys and safe updates so duplicates do not cause harm.

Idempotency goal

DLQ (Dead-Letter Queue)

A safe place for events that still fail after retries.
Often an SQS queue, AWS's managed message queue.
Inspect the payload, fix the issue, and re-drive.

DLQ lost-and-found analogy

DLQ vs destinations

DLQ lost-and-found analogy

Captures failed events after retries.
Use for investigation.

Destinations routing diagram

Route outcomes on success or failure.
Build explicit success and failure paths.

Destinations: success and failure routes

On success, send a result to onSuccess.
On failure, send details to onFailure.
This makes the next step explicit.

Destinations routing diagram

Tuning retry policy

Tune how many times Lambda retries.
Limit event age to avoid processing stale data.
More retries improve reliability but increase duplicates and delay.

Retry policy controls

Maximum event age: an expiration date

Maximum event age is an expiration policy.
If an event is too old, processing it may be pointless.
A trade-off: fewer late events, more timely behavior.

Event age expiration analogy

Observability: where to look

Logs answer what happened.
Metrics answer how often it is happening.
Alarms help you catch spikes quickly.

Logs vs metrics

What to do with failed events

Inspect the payload and error.
Fix the root cause.
Re-drive the event, then monitor errors and throughput.

Failed events recovery cycle

Key takeaways

Reliability comes from retries, routing, and observability.
Synchronous errors reach the caller.
Asynchronous errors need retries plus DLQs or destinations.
This keeps failures visible.

Reliability formula diagram

Let's practice!

Serverless Applications with AWS Lambda