Failing Gracefully: Mitigating Risks

Building Scalable Agentic Systems

Korey Stegared-Pace

Senior AI Cloud Advocate, Microsoft

Tool call failures

Solutions

Defining clear parameters and usage

Tool output validation (unit testing!)

Verification checks on tool selection

MCP!

Problems

Tool call failures - retry mechanism

External services may break or be temporarily slow

Retry
- Exponential backoff → slowly decreasing retries
- Inform users of issues using callbacks

Tool call failures - caching

Tool outputs are cached as a fallback

Works for: Static data with infrequent updates

Won't work for: tools that serve real-time information

Tool call failures - queue management

Tool calls may be reliant on each another

Example: flights must be booked before the taxi

Unsuccessful tool calls can be moved down the queue

Authentication

Tools may require access and permissions to private data

Example: IT support agent
- Valid access: system logs
- Blocked access: passwords, location, etc.

Authentication - unique agent identifiers

Authentication - isolated environments

Authentication - guardrails and action restraints

Let's practice!

Building Scalable Agentic Systems

Preparing Video For Download...