Upgrade to Pro — share decks privately, control downloads, hide ads and more …

API Management in the AI Era

API Management in the AI Era

The slide decks from the Azure Singapore meetup presentaiton on 10 July 2025.
The live demo covers features related to AI Gateway of Azure API Management and showcases live demos related to
- Token rate limiting
- Token metrics emmiting and
- Backend pool load balancing features.

https://www.meetup.com/mssgug/events/308207674/

Avatar for Nilesh Gule

Nilesh Gule

July 10, 2025
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. $whoami { “name” : “Nilesh Gule”, “website” : “https://www.HandsOnArchitect.com", “github”

    : “https://GitHub.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “YouTube” : “https://www.YouTube.com/@nilesh-gule” “likes” : “Technical Evangelism, Cricket”, }
  2. AI Gateway capabilities of Azure API Management AI Gateway Security

    & safety • Keyless managed identities • AI Apps & Agents Authorizations -New • Content Safety -GA • Credential Manager Resiliency • Weight load balancing • Priority routing to provisioned capacity models • Backend pools with circuit breaker • Session aware load balancing -GA Scalability • Token rate limits and token quotas • Semantic Caching -GA • Model load balancing • Multi-regional deployments Traffic mediation & control • Azure AI Foundry & Azure OpenAI • OpenAI compatible models -GA • Responses API -GA • WebSocket’s for Realtime APIs • MCP server pass-trough - Soon • Expose APIs as built-in MCP server - Preview Developer velocity • Wizard policy configuration experience • Self-service with the Developer Portal • API Center Copilot Studio connector - Preview • Policy Toolkit Observability • Token counting per consumer • Prompts and completions logging -GA • Built-in reporting dashboard -GA Governance • Policy engine with custom expressions • API Center MCP server registry - Preview • Federated API Management GA GA GA Soon GA GA GA GA Preview Preview GA Preview New
  3. Challenges in managing the GenAI APIs Track Token usage Ensure

    Tokens are used properly across multiple applications. Manage TPM quota Endure single app doesn’t consume the whole TPM quota. Secure API keys Secure API keys across multiple applications. Distribute load across multiple endpoints Ensure committed PTU is exhausted before falling back to the PAYG instance.
  4. Provisioned Throughput Units (PTU) • Allows to specify the amount

    of throughput required in a model deployment. • Granted to subscription as quota • Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the subscription and region • PTU provides • Predictable performance • Allocated processing capacity • Cost savings Understanding costs associated with provisioned throughput units (PTU)
  5. Token Metrics Emitting • Sends Token Merics usage to Applications

    Insights • Provides overview of utilization of Azure OpenAI models across multiple applications or API consumers GenAI Gateway Capabilities in Azure API Management
  6. Token Rate Limiting • Manage and enforce limits per API

    consumer based on the usage of API Tokens GenAI Gateway Capabilities in Azure API Management
  7. Load Balanced Pool and Circuit Breaker • Helps to spread

    load across multiple Azure OpenAI endpoints • Round-robin, weighted or priority based load distribution strategy GenAI Gateway Capabilities in Azure API Management
  8. Semantic Caching GenAI Gateway Capabilities in Azure API Management •

    Optimize Token usage by leveraging semantic caching • Stores completions for prompts with similar meanings
  9. Summary • Track Token usage across multiple applications • Emit

    Token Metrics policy • Ensure single app doesn’t consume whole TPM quota • Token Limit Policy • Secure API keys across multiple applications • Subscription keys • Distribute load across multiple endpoints • Backend pool load balancing and circuit breaker
  10. Resources • Azure OpenAI Gateway topologies • Azure OpenAI Token

    Limit Policy • LLM Token Limit Policy • Azure OpenAI Emit Token Metric Policy • LLM Emit Token Metric Policy • Houssem Dellai Youtube videos • GenAI Labs • Designing and implementing GenAI gateway solution
  11. Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

    Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com https://www.youtube.com/@nilesh-gule
  12. Source Code & slide deck Nilesh Gule fork - GenAI

    Labs https://github.com/NileshGule/AI-Gateway GenAI Labs https://aka.ms/apim/genai/labs https://speakerdeck.com/nileshgule/ https://www.slideshare.net/nileshgule/
  13. Q&A