ABCS25: Azure OpenAI Unplugged: Real-World Lessons from My Latest GenAI Project by Dieter Gobeyn

Agenda Introduction My GenAI Journey Quality measurements Model Evaluation Cost
Reduction Strategies Azure OpenAI limits & Security Json Integration Q&A Session 4

Dieter Gobeyn Azure MVP - hands-on Azure Cloud Solutions Architect
& Azure certified since 2012 Passionate about Cloud & Integration Public speaker and blogger Belgian-born, now based in London Scuba diver - hiker – world traveler 5

A Community Project Built on Azure OpenAI Data Filtering: Ignore
typos, small rewordings, and URL tweaks. 1 Classify data: Identify meaningful updates and provide clear summaries. 2 Estimate Impact & Urgency: Assess how changes affect users and implementations. 3 Regroup or Reclassify by feature: Organize updates for easier navigation. 4 Summarize Text: Provide concise, structured content. 5 Generate HTML: Newsletter-ready html links 6 8 Keeping it cost-efficient without breaking the bank!

Project Architecture 9 Data Collection Report Generation

GenAI Output Quality vs. Development Effort 0 1 2 3
4 5 6 7 8 9 10 Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Quality output Time spent A Subjective Learning curve 10

Prompting Techniques • Master a New Skill – Just like
learning how to Google effectively, mastering prompting is essential. • Invest Time Wisely – Spend a few hours upfront to get the basics right. • Know the Rules – Read OpenAI model guidelines to understand best practices and limitations. • Great references: • Prompt engineering techniques: Microsoft learn • Advanced prompt techniques: https://www.promptingguide.ai/ 11

12 AI Hallucinations

• AI is non-deterministic—some level of unpredictability is inherent. •
Expect hallucinations and fabricated facts. These can be mitigated but not eliminated entirely. • My observation: Reducing inaccuracies is an ongoing process, not a one-time fix. (MY) Rule of Thumb • If you can write an algorithm instead of using an LLM, do it. • 100% consistent and cost-effective results. • Fewer total instructions → Less confusion → Higher quality outputs. 15 GENAI IS NOT 100% ACCURATE IS THAT GOOD ENOUGH FOR YOUR USE CASE? Quality Considerations

Practical Prompting Strategies Prompt Engineering: Writing Effective Prompts • Be
Clear & Specific → Avoid vague wording; make your request explicit. • Provide Context → More context ~ fewer hallucinations. • Refine Iteratively → Adjust prompts to improve accuracy. Prompt Chaining: Structuring Multi-Step Tasks • Step-by-Step Breakdown → Handle complex queries in (multiple) smaller steps. • Verification Points → Insert checkpoints for self-validation. • Balance Accuracy vs. Cost → More steps improve results but increase cost. If everything else fails: Evaluator Optimizer. (~”are you sure?”) 16

Model Evaluation Assessing Model Performance & Quality • Trade-off cost
vs. quality (e.g., GPT-4o mini is ~15-20x more expensive than GPT-4). • Standardize expectations with structured input/output testing—think unit tests in .NET for GenAI. • Azure frequently retires models—ensure adaptability and a migration plan. Evaluation Approaches • Custom Unit tests for AI behaviour validation • Azure OpenAI Evaluation 17

Cost Reduction Strategies 18 Optimizing Cost vs. Quality • Higher-cost
models offer better accuracy and up-to-date knowledge, but do you always need it? • Lower-cost models can be sufficient for many tasks, reducing expenses. Smart Cost Management • Mix & match models based on functionality • Use simpler models for routine tasks and premium models only where needed • Regularly evaluate model performance vs. cost to ensure efficiency

Token Usage • Azure charges per token used • On
average, 1 token = ~4 English characters • Fewer tokens = lower cost & faster responses Optimisation Tips • Use clear, concise prompts to minimise token consumption • Optimise outputs to avoid unnecessary verbosity • Manage rate limits (see later..) 19

LLMs: Autocomplete with a Spark of Magic 20

(Advanced) Model settings Temperature • Controls the “creativity” or randomness
of LLM output. • Higher values (e.g., 0.7) → More varied and imaginative responses. • Lower values (e.g., 0.2) → More deterministic, focused, and consistent. • (What Temperature actually controls is the scaling of the scores.) 21

(Advanced) Model settings Top_p (Nucleus Sampling) • Restricts choices to
the top percentage of probable tokens. • Top_p = 0.1 → Only considers tokens within the top 10% of the probability. • Balances creativity with coherent, context-driven responses. 22

Rate limiting • Azure enforces strict rate limits – it’s
not generous. • Two main key constraints: Requests per minute (RPM) & Tokens per minute (TPM). • Regional quota documentation is outdated – actual availability and quota limits may differ. • Quota increases require approval, but enterprise subscriptions start with higher limits. • Verify Azure OpenAI Service quotas & limits before scaling workloads. (in the portal) 23

Batch Jobs • 50% discount on Global Standard Pricing •
Asynchronous processing for grouped requests with a dedicated quota • 24-hour target turnaround (best effort, not guaranteed) Ideal for: Large-scale inference workloads, cost-efficient processing, and non-urgent requests. 24

Azure API Management as a Gateway for OpenAI 25 Single-Region
APIM Routes traffic across regions Active-Active Load Balancine Uses VNet peering/private endpoints Fault Handling Removes throttled/unhealthy instances, retries requests Health Checks Marks gateway unhealthy if no instances available

Smart Load Balancing for OpenAI Endpoints and Azure API Management
https://github.com/Azure-Samples/openai-apim-lb 26

Azure OpenAI Token Limit Policy (APIM) Purpose: Ensures fair &
efficient API usage by managing token consumption per key. Key Features • Enforcement: Blocks requests exceeding the token limit (429 Too Many Requests). • Real-Time Monitoring: Uses OpenAI metrics for accurate tracking. • Prompt Token Precalculation: Prevents unnecessary API calls. • Configurable Limits: Set per-minute token caps per key. • Custom Headers: Provides retry and remaining token info. <azure-openai-token-limit counter-key="@(context.Request.IpAddress)" tokens-per-minute="5000" estimate-prompt-tokens="false" remaining-tokens-variable-name="remainingTokens" /> 27 https://learn.microsoft.com/en-us/azure/api-management/azure-openai-token-limit-policy

Token Monitoring Custom Workbook: High-level summary of Azure OpenAI resources
across subscriptions, detailing network access patterns and regional distributions. Monitoring Capabilities: In-depth metrics on HTTP requests, token usage, Throughput Unit (PTU) utilization, and fine-tuning activities by model name, deployment, and region. Insights Integration: Through diagnostic settings Demo workbook & Demo standard https://techcommunity.microsoft.com/blog/fasttrackfor azureblog/azure-openai-insights-monitoring-ai-with- confidence/4026850 28

Introduction to JSON Mode in Azure OpenAI • Challenge: AI
models don’t guarantee a perfect Json structured output → Application crash. • Most AI models support JSON Mode → Implement JSON mode to enforce structured outputs • AzureOpenAi nuget lacks JSON support → but Semantic Kernel Does! 29

Semantic kernel what? 30 • Multi platform & Open-Source •
Connectivity to AI Services • Custom “functions” to empower AI • Integrated memory support • Orchestate AI to use features available • Enterprise Integration • Consider using Semantic Kernel from the start.

Q&A Session Open forum for questions and clarifications 31

ABCS25: Azure OpenAI Unplugged: Real-World Less...

ABCS25: Azure OpenAI Unplugged: Real-World Lessons from My Latest GenAI Project by Dieter Gobeyn

Azure Zurich User Group PRO

More Decks by Azure Zurich User Group

Other Decks in Technology

Featured

Transcript

Agenda Introduction My GenAI Journey Quality measurements Model Evaluation Cost

Dieter Gobeyn Azure MVP - hands-on Azure Cloud Solutions Architect

6

A Community Project Built on Azure OpenAI Data Filtering: Ignore

Project Architecture 9 Data Collection Report Generation

GenAI Output Quality vs. Development Effort 0 1 2 3

Prompting Techniques • Master a New Skill – Just like

12 AI Hallucinations

13 AI Hallucinations

14 AI Hallucinations

• AI is non-deterministic—some level of unpredictability is inherent. •

Practical Prompting Strategies Prompt Engineering: Writing Effective Prompts • Be

Model Evaluation Assessing Model Performance & Quality • Trade-off cost

Cost Reduction Strategies 18 Optimizing Cost vs. Quality • Higher-cost

Token Usage • Azure charges per token used • On

LLMs: Autocomplete with a Spark of Magic 20

(Advanced) Model settings Temperature • Controls the “creativity” or randomness

(Advanced) Model settings Top_p (Nucleus Sampling) • Restricts choices to

Rate limiting • Azure enforces strict rate limits – it’s

Batch Jobs • 50% discount on Global Standard Pricing •

Azure API Management as a Gateway for OpenAI 25 Single-Region

Smart Load Balancing for OpenAI Endpoints and Azure API Management

Azure OpenAI Token Limit Policy (APIM) Purpose: Ensures fair &

Token Monitoring Custom Workbook: High-level summary of Azure OpenAI resources

Introduction to JSON Mode in Azure OpenAI • Challenge: AI

Semantic kernel what? 30 • Multi platform & Open-Source •

Q&A Session Open forum for questions and clarifications 31