API Managment in the AI Ear

Slide 1

Slide 1 text

MELBOURNE EDITION

Slide 2

Slide 2 text

2025 SPONSORS

Slide 3

Slide 3 text

API MANAGEMENT IN THE AI ERA

Slide 4

Slide 4 text

WHO AM I? NAME WORK SOCIALS PASSIONS Nilesh Gule Avanade @nileshgule Photography Cricket Code with Passion, Strive for Excellence

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

API Management - GenAI Gateway Azure-Samples/AI-Gateway: APIM

Slide 7

Slide 7 text

Challenges in managing GenAI APIs • Track Token usage across multiple applications • Ensure single app doesn’t consume whole TPM quota • Secure API keys across multiple applications • Distribute load across multiple endpoints • Ensure committed capacity in PTUs is exhausted before falling back to PAYG instance

Slide 8

Slide 8 text

Provisioned Throughput Units (PTU) • Allows to specify the amount of throughput required in a model deployment. • Granted to subscription as quota • Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the subscription and region • PTU provides • Predictable performance • Allocated processing capacity • Cost savings Understanding costs associated with provisioned throughput units (PTU)

Slide 9

Slide 9 text

Token Metrics Emitting • Sends Token Merics usage to Applications Insights • Provides overview of utilization of Azure OpenAI models across multiple applications or API consumers GenAI Gateway Capabilities in Azure API Management

Slide 10

Slide 10 text

Token Rate Limiting • Manage and enforce limits per API consumer based on the usage of API Tokens GenAI Gateway Capabilities in Azure API Management

Slide 11

Slide 11 text

Load Balancer and Circuit Breaker • Helps to spread load across multiple Azure OpenAI endpoints • Round-robin, weighted or priority based load distribution strategy GenAI Gateway Capabilities in Azure API Management

Slide 12

Slide 12 text

Semantic Caching GenAI Gateway Capabilities in Azure API Management • Optimize Token usage by leveraging semantic caching • Stores completions for prompts with similar meanings

Slide 13

Slide 13 text

Summary • Track Token usage across multiple applications • Emit Token Metrics policy • Ensure single app doesn’t consume whole TPM quota • Token Limit Policy • Secure API keys across multiple applications • Subscription keys • Distribute load across multiple endpoints • Backend pool load balancing and circuit breaker

Slide 14

Slide 14 text

Resources • Azure OpenAI Gateway topologies • Azure OpenAI Token Limit Policy • LLM Token Limit Policy • Azure OpenAI Emit Token Metric Policy • LLM Emit Token Metric Policy • Houssem Dellai Youtube videos • GenAI Labs • Designing and implementing GenAI gateway solution

Slide 15

Slide 15 text

Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com https://www.youtube.com/@nilesh-gule

Slide 16

Slide 16 text

Source Code & slide deck Nilesh Gule fork - GenAI Labs https://github.com/NileshGule/AI-Gateway GenAI Labs https://aka.ms/apim/genai/labs https://speakerdeck.com/nileshgule/ https://www.slideshare.net/nileshgule/

Slide 17

Slide 17 text

Q&A