Designing fault-tolerant and resilient applications on AWS
Developing applications on AWS
Ricardo Sueiras
Principal Technologist
Building for resilience
Temporary failures
Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.
Timeouts
Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.
Timeouts: external service takes too long to respond.
Permanent errors
Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.
Timeouts: external service takes too long to respond.
Permanent errors: request is fundamentally broken, retrying will not help.
API limits
Understand the common failure modes:
Temporary errors go away on their own, usually safe to retry.
Timeouts: external service takes too long to respond.
Permanent errors: request is fundamentally broken, retrying will not help.
API rate limits: too many requests, expect HTTP 429 Too Many Requests.
Retry strategies
Failures in distributed systems are often temporary.
A retried request often succeeds.
Add retry logic carefully, blind retries can make things worse.
Exponential backoff: increase the delay between retries.
Jitter: add randomness to avoid retry spikes.
Retry limits: prevent infinite loops.
AWS SDK native capabilities
AWS SDKs handle retry logic automatically.
Built in retry includes exponential backoff and jitter.
Failed or throttled requests are retried with increasing delays.
Reduces the amount of error handling code you need to write.
Managing retry logic
Retry logic isn't always the answer.
HTTP 4xx errors: server understood and rejected your request.
Non idempotent requests: retrying can cause duplicate or inconsistent actions.
Managing timeouts
Retrying without limits can amplify problems under load.
Combine retries with timeouts.
A timeout sets the max time you'll wait for a response.
Every external service call should have a timeout.
Circuit breakers
When a service keeps failing, retries alone aren't enough.
Continuing to send requests can overload the failing service.
The circuit breaker pattern temporarily stops requests to an unhealthy dependency.
Closed state: requests flow normally.
Circuit breakers
When failures exceed a threshold, the circuit opens.
Subsequent requests are blocked.
Circuit breakers
After a cooldown, the circuit enters a half open state to test recovery.
Service responds successfully: normal operation resumes.
Still failing: the circuit stays open.
Implement circuit breakers at the application level.
Dead letter queues
Messages that keep failing can block the system.
Move them out of the main processing flow.
Failed messages land in a Dead Letter Queue.
Key part of building resilient applications.
However, understand trade-offs
AWS API limits
You interact with AWS services via APIs.
Each service defines its own API rate limits.
Exceeding limits triggers throttling (HTTP 429 Too Many Requests).
Design your application to handle throttling gracefully.
Integrating with third-parties services
Third party services introduce uncertainty.
You have no control over their performance or availability.
Set timeouts to prevent long waits.
Use retries with backoff for temporary issues.
Isolate dependencies to limit blast radius.
Integrating with third-parties
Consider shifting from synchronous to asynchronous communication.
Your system keeps processing even if the external service is delayed.
Let's practice!
Developing applications on AWS
Preparing Video For Download...