WHO AM I?
NAME
WORK
SOCIALS
PASSIONS
Nilesh Gule
Avanade
@nileshgule
Photography
Cricket
Code with Passion, Strive for
Excellence
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
API Management - GenAI Gateway
Azure-Samples/AI-Gateway: APIM
Slide 7
Slide 7 text
Challenges in managing GenAI APIs
• Track Token usage across multiple applications
• Ensure single app doesn’t consume whole TPM quota
• Secure API keys across multiple applications
• Distribute load across multiple endpoints
• Ensure committed capacity in PTUs is exhausted before falling back to PAYG instance
Slide 8
Slide 8 text
Provisioned Throughput Units (PTU)
• Allows to specify the amount of throughput required in a model deployment.
• Granted to subscription as quota
• Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the
subscription and region
• PTU provides
• Predictable performance
• Allocated processing capacity
• Cost savings
Understanding costs associated with provisioned throughput units (PTU)
Slide 9
Slide 9 text
Token Metrics Emitting
• Sends Token Merics usage to Applications Insights
• Provides overview of utilization of Azure OpenAI models
across multiple applications or API consumers
GenAI Gateway Capabilities in Azure API Management
Slide 10
Slide 10 text
Token Rate Limiting
• Manage and enforce limits per API consumer based on the
usage of API Tokens
GenAI Gateway Capabilities in Azure API Management
Slide 11
Slide 11 text
Load Balancer and Circuit Breaker
• Helps to spread load across multiple Azure OpenAI endpoints
• Round-robin, weighted or priority based load distribution
strategy
GenAI Gateway Capabilities in Azure API Management
Slide 12
Slide 12 text
Semantic Caching
GenAI Gateway Capabilities in Azure API Management
• Optimize Token usage by leveraging semantic caching
• Stores completions for prompts with similar meanings
Slide 13
Slide 13 text
Summary
• Track Token usage across multiple applications
• Emit Token Metrics policy
• Ensure single app doesn’t consume whole TPM quota
• Token Limit Policy
• Secure API keys across multiple applications
• Subscription keys
• Distribute load across multiple endpoints
• Backend pool load balancing and circuit breaker
Nilesh Gule
ARCHITECT | MICROSOFT MVP
“Code with Passion and
Strive for Excellence”
nileshgule @nileshgule Nilesh Gule
NileshGule
www.handsonarchitect.com
https://www.youtube.com/@nilesh-gule