Designing fault-tolerant and resilient applications on AWS

Developing applications on AWS

Ricardo Sueiras

Principal Technologist

Building for resilience

building for resilience

everything will fail

Temporary failures

temp failures

Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.

Timeouts

timeouts

Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.
Timeouts: external service takes too long to respond.

Permanent errors

perm errors

Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.
Timeouts: external service takes too long to respond.
Permanent errors: request is fundamentally broken, retrying will not help.

API limits

api limits

Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.
Timeouts: external service takes too long to respond.
Permanent errors: request is fundamentally broken, retrying will not help.
API rate limits: too many requests, expect HTTP 429 Too Many Requests.

Retry strategies

Failures in distributed systems are often temporary.
A retried request often succeeds.
Add retry logic carefully, blind retries can make things worse.
Exponential backoff: increase the delay between retries.
Jitter: add randomness to avoid retry spikes.
Retry limits: prevent infinite loops.

retry strategies

AWS SDK native capabilities

AWS SDKs handle retry logic automatically.
Built in retry includes exponential backoff and jitter.
Failed or throttled requests are retried with increasing delays.
Reduces the amount of error handling code you need to write.

sdk capabilities

Managing retry logic

dont retry

Retry logic isn't always the answer.
HTTP 4xx errors: server understood and rejected your request.
Non idempotent requests: retrying can cause duplicate or inconsistent actions.

Managing timeouts

Manage timeouts

Retrying without limits can amplify problems under load.
Combine retries with timeouts.
A timeout sets the max time you'll wait for a response.
Every external service call should have a timeout.

Circuit breakers

When a service keeps failing, retries alone aren't enough.
Continuing to send requests can overload the failing service.
The circuit breaker pattern temporarily stops requests to an unhealthy dependency.
Closed state: requests flow normally.

circuit breakers

Circuit breakers

When failures exceed a threshold, the circuit opens.
Subsequent requests are blocked.

circuit breaker

Circuit breakers

After a cooldown, the circuit enters a half open state to test recovery.
Service responds successfully: normal operation resumes.
Still failing: the circuit stays open.
Implement circuit breakers at the application level.

circuit breaker

Dead letter queues

dql

Messages that keep failing can block the system.
Move them out of the main processing flow.
Failed messages land in a Dead Letter Queue.
Key part of building resilient applications.
However, understand trade-offs

AWS API limits

API limits

You interact with AWS services via APIs.
Each service defines its own API rate limits.
Exceeding limits triggers throttling (HTTP 429 Too Many Requests).
Design your application to handle throttling gracefully.

Integrating with third-parties services

Third party services introduce uncertainty.
You have no control over their performance or availability.
Set timeouts to prevent long waits.
Use retries with backoff for temporary issues.
Isolate dependencies to limit blast radius.

third parties

Integrating with third-parties

Consider shifting from synchronous to asynchronous communication.
Your system keeps processing even if the external service is delayed.

managing 3rd parties

Let's practice!

Developing applications on AWS