Designing fault-tolerant and resilient applications on AWS

Developing applications on AWS

Ricardo Sueiras

Principal Technologist

Building for resilience

building for resilience

Developing applications on AWS

everything will fail

Developing applications on AWS

Temporary failures

 

temp failures

  • Understand the common failure modes:
  • Temporary errors go away on their own, usually safe to retry.
Developing applications on AWS

Timeouts

 

timeouts

  • Understand the common failure modes:
  • Temporary errors go away on their own, usually safe to retry.
  • Timeouts: external service takes too long to respond.
Developing applications on AWS

Permanent errors

 

perm errors

  • Understand the common failure modes:
  • Temporary errors go away on their own, usually safe to retry.
  • Timeouts: external service takes too long to respond.
  • Permanent errors: request is fundamentally broken, retrying will not help.
Developing applications on AWS

API limits

 

api limits

  • Understand the common failure modes:
  • Temporary errors go away on their own, usually safe to retry.
  • Timeouts: external service takes too long to respond.
  • Permanent errors: request is fundamentally broken, retrying will not help.
  • API rate limits: too many requests, expect HTTP 429 Too Many Requests.
Developing applications on AWS

Retry strategies

  • Failures in distributed systems are often temporary.
  • A retried request often succeeds.
  • Add retry logic carefully, blind retries can make things worse.
  • Exponential backoff: increase the delay between retries.
  • Jitter: add randomness to avoid retry spikes.
  • Retry limits: prevent infinite loops.

 

retry strategies

Developing applications on AWS

AWS SDK native capabilities

  • AWS SDKs handle retry logic automatically.
  • Built in retry includes exponential backoff and jitter.
  • Failed or throttled requests are retried with increasing delays.
  • Reduces the amount of error handling code you need to write.

 

sdk capabilities

Developing applications on AWS

Managing retry logic

 

dont retry

  • Retry logic isn't always the answer.
  • HTTP 4xx errors: server understood and rejected your request.
  • Non idempotent requests: retrying can cause duplicate or inconsistent actions.
Developing applications on AWS

Managing timeouts

 

Manage timeouts

  • Retrying without limits can amplify problems under load.
  • Combine retries with timeouts.
  • A timeout sets the max time you'll wait for a response.
  • Every external service call should have a timeout.
Developing applications on AWS

Circuit breakers

  • When a service keeps failing, retries alone aren't enough.
  • Continuing to send requests can overload the failing service.
  • The circuit breaker pattern temporarily stops requests to an unhealthy dependency.
  • Closed state: requests flow normally.

 

circuit breakers

Developing applications on AWS

Circuit breakers

  • When failures exceed a threshold, the circuit opens.
  • Subsequent requests are blocked.

 

circuit breaker

Developing applications on AWS

Circuit breakers

  • After a cooldown, the circuit enters a half open state to test recovery.
  • Service responds successfully: normal operation resumes.
  • Still failing: the circuit stays open.
  • Implement circuit breakers at the application level.

 

circuit breaker

Developing applications on AWS

Dead letter queues

 

dql

  • Messages that keep failing can block the system.
  • Move them out of the main processing flow.
  • Failed messages land in a Dead Letter Queue.
  • Key part of building resilient applications.
  • However, understand trade-offs
Developing applications on AWS

AWS API limits

 

API limits

  • You interact with AWS services via APIs.
  • Each service defines its own API rate limits.
  • Exceeding limits triggers throttling (HTTP 429 Too Many Requests).
  • Design your application to handle throttling gracefully.
Developing applications on AWS

Integrating with third-parties services

  • Third party services introduce uncertainty.
  • You have no control over their performance or availability.
  • Set timeouts to prevent long waits.
  • Use retries with backoff for temporary issues.
  • Isolate dependencies to limit blast radius.

 

third parties

Developing applications on AWS

Integrating with third-parties

  • Consider shifting from synchronous to asynchronous communication.
  • Your system keeps processing even if the external service is delayed.

 

managing 3rd parties

Developing applications on AWS

Let's practice!

Developing applications on AWS

Preparing Video For Download...