Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CAF Cost Optimization Guidebook

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

CAF Cost Optimization Guidebook

Establishing cost management practices for AI.

More Decks by Korkrid Kyle Akepanidtaworn

Other Decks in Technology

Transcript

  1. Azure AI Landing Zone: Design with a Cost- Efficiency Mindset

    Korkrid Kyle Akepanidtaworn AI Apps & Agents Factory Lead SPT MACC Delivery Lead 13 March 2026
  2. Agenda 01 Pre AI LZ: Cost Models and Considerations 02

    Post AI LZ: Cost Optimization 03 Q&A
  3. Develop Cost-Management Discipline 6 1. Establish a foundational cost model

    early Before tracking or optimizing spend, define a baseline cost model that clearly maps workloads, consumption drivers, and pricing constructs. This creates a shared financial language across engineering, architecture, and finance. 2. Budget holistically—not just for infrastructure Ensure budgets account for the full lifecycle cost of solutions, including core features, operational support, training, governance overhead, and future scale. Underestimating non-compute costs is a common source of budget overruns 3. Promote upstream cost visibility and accountability Encourage continuous communication from architects to application owners so cost implications are considered during design—not discovered in production. Early alignment enables better design trade-offs and prevents reactive cost controls later. © Kyle Akepanidtaworn
  4. Design Framework Agents Models Tools Gateway Compute Data Network Identity

    Monitoring Governance Resource Organization Platform Ops Reliability Security Cost Optimization Operational Excellence Performance Efficiency © Kyle Akepanidtaworn
  5. Design Framework Agents Models Tools Gateway Compute Data Network Identity

    Monitoring Governance Resource Organization Platform Ops Reliability Security Cost Optimization Operational Excellence Performance Efficiency © Kyle Akepanidtaworn
  6. AI Landing Zone with Platform Landing Zone (Brownfield) AI Landing

    Zone without Platform Landing Zone (Greenfield) Reference Architectures Enterprise-Scale and Production Ready to accelerate AI use cases © Kyle Akepanidtaworn
  7. bicep-registry-modules/avm at main · Azure/bicep-registry-modules Resource Modules - Updated the

    cognitive service module to allow for Foundry deployment. Pattern Modules - Created the AI Foundry pattern module (leverages the cognitive services module) Pattern Modules Contributed to the AI Landing Zone AVM Pattern (leverages the AI Foundry pattern module) Solution Nesting – Leveraging AVM Pattern Modules © Kyle Akepanidtaworn
  8. Workload Planning & Forecasting Baseline Establishment • Conduct a 1-week

    or 2-week pilot, ensuring detailed logging of the Azure components and documenting all deployed resources (as a Factory Solution Architect, you need to know the difference between greenfield and brownfield landing zone deployment.) • Track peak vs. average consumption. Set expectations with the customer in the right way. Forecasting Methods • Historical trend analysis with seasonality adjustments. • Business driver correlation (users, transactions, queries.) • Scenario modeling (optimistic, expected, pessimistic.) Growth Projections • Feature expansion impact (new use cases, prompt complexity.) • Model upgrades and their cost implications. • External factors (if any.) © Kyle Akepanidtaworn
  9. Outlook: Monte Carlo Simulation for Confidence Intervals Monte Carlo simulation

    is a computational technique that uses repeated random sampling to estimate the probability distribution of possible outcomes. For forecasting AI costs, it helps quantify uncertainty by running thousands of scenarios with randomly varied inputs. How it works: 1. Define input variables with probability distributions (not single values) 2. Randomly sample from each distribution 3. Calculate the outcome for that scenario 4. Repeat thousands of times 5. Analyze the distribution of results to get confidence intervals Why it's valuable for AI cost forecasting: • Captures compounding uncertainty (users × interactions × tokens) • Provides probability-based ranges rather than false precision • Helps set realistic budgets with appropriate contingency • Identifies which variables drive the most variance © Kyle Akepanidtaworn
  10. Example: Forecasting Monthly Azure OpenAI Costs Let's say you're forecasting

    token consumption but have uncertainty about user adoption and usage patterns: Inputs (with uncertainty): • Monthly users: 10,000 ± 3,000 (normal distribution) • Interactions per user: 5 ± 2 (normal distribution) • Tokens per interaction: 400 ± 100 (normal distribution) • Cost: $0.01 per 1,000 tokens Monte Carlo Process: Run 10,000 simulations, each time randomly picking values: • Simulation 1: 11,500 users × 4.2 interactions × 380 tokens = 18.4M tokens → $184 • Simulation 2: 8,200 users × 6.1 interactions × 450 tokens = 22.5M tokens → $225 • Simulation 3: 12,800 users × 3.8 interactions × 410 tokens = 19.9M tokens → $199 ... (repeat 10,000 times) Results: • Mean estimate: $210/month • 80% confidence interval: $140 - $295 • 95% confidence interval: $95 - $380 © Kyle Akepanidtaworn
  11. Example: Choosing Azure AI Search Tiers (SKUs) Tiers include Free,

    Basic, Standard, and Storage Optimized. Standard and Storage Optimized are available with several configurations and capacities. QPS stands for Queries Per Second (It measures the throughput of your search service.) Inputs (with uncertainty): • Monthly Data Ingest: Mean of 5 GB with a standard deviation of 2 GB (accounts for document volatility). • Peak Query Load: Mean of 10 QPS with a standard deviation of 5 QPS (accounts for traffic spikes). • Vector Storage Overhead: ±20% variation in memory consumption for embedding indexes. • Service Level Objective (SLO): Target 99.9% availability, requiring a minimum of 2 replicas for production workloads. Monte Carlo Process: Run 10,000 simulations, each time randomly picking values: • Simulation 1: 4GB data + 8 QPS → Both under limits → Stay S1 ($838/mo). • Simulation 2: 6GB data + 22 QPS → High traffic exceeds compute → Upgrade S2 ($3,356/mo). • Simulation 3: 28GB data + 5 QPS → Storage exceeds S1 partition limit → Scale S1 to 2 Partitions ($1,676/mo). Results: • Mean Monthly Cost: $838 (Standard S1 with 2 Replicas). • P95 "Worst Case" Cost: $3,356 (Forced upgrade to Standard S2 due to storage/latency overhead). • Recommended Tier: Standard S1 © Kyle Akepanidtaworn Choose a service tier - Azure AI Search | Microsoft Learn
  12. FinOps is an evolving cloud financial management discipline and cultural

    practice, that enables organizations to get maximum business value, by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions. – The FinOps Foundation © Kyle Akepanidtaworn
  13. What is Cloud spend and how can it be optimized?

    The simplified view Cloud Spend = Sum of ( USAGE x RATE) Rate optimization Usage optimization Optimization
  14. What is Cloud spend and how can it be optimized?

    Cloud spend = Sum of ( USAGE x RATE) Rate optimization CSP Contract Pay as you Go Price Committed Use Discounts Reservations Savings Plans CSP Selection CSP Regions Enterprise Contract Special Commitment discounts Usage optimization Waste removal Deallocating resources Right-sizing Cloud Rationalization Rehost Refactor Rearchitect Rebuild Replace Key optimization activities to influence Cloud spend EA MCA MACC Optimization Azure Hybrid Benefits Dev/Test Subscriptions Spot VM Pre-Purchase Commitments (e.g. Defender)
  15. Optimize Cloud spend: Quick wins vs long-term effort Cloud spend

    = Sum of ( USAGE x RATE) Rate optimization CSP Contract Pay as you Go Price Committed Use Discounts Reservations Savings Plans CSP Selection CSP Regions Enterprise Contract Special Commitment discounts Usage optimization Waste removal Deallocating resources Right-sizing Cloud Rationalization Rehost Refactor Rearchitect Rebuild Replace Key optimization activities to influence Cloud spend EA MCA MACC Optimization Dev/Test Subscriptions Spot VM Quick Win Short-/mid term Long term Effort Pre-Purchase Commitments (e.g. Defender) Azure Hybrid Benefits
  16. Azure Cost Management + Billing Audience: • FinOps & Service

    Owner • Azure technical Operations Access Right: • Read access to Subscription • Billing Reader Access Provided by: • Azure built-in Purpose: • Analysing Cost • Setting budgets • Setting cost alerts • View and buy Reservations Benefits: • Automatic Alerts for budget thresholds • CSV Data export Automation: • Can be leveraged via API • Power BI Connector Available Recommended Actions: • Create budgets and alerts for each Subscription Overview of Billing - Microsoft Cost Management | Microsoft Learn
  17. What is Azure Advisor? Advisor is a personalized cloud consultant

    that helps you follow best practices to optimize your Azure deployments. It analyzes your resource configuration and usage telemetry and then recommends solutions that can help you improve the cost effectiveness, performance, Reliability (formerly called High availability), and security of your Azure resources. With Advisor, you can: • Get proactive, actionable, and personalized best practices recommendations. • Improve the performance, security, and reliability of your resources, as you identify opportunities to reduce your overall Azure spend. • Get recommendations with proposed actions inline. Advisor score - Azure Advisor | Microsoft Learn
  18. Azure Advisor - Cost recommendations key areas AI Services Analytics

    Compute Databases Management and Governance Networking Reserved Instances & Savings Plans Storage Web
  19. Infrastructure Cost Considerations 1 AKS – application autoscaling, cluster autoscaling,

    node scaling, etc. 2 Serverless costs – Azure Functions etc. 3 Availability zones 4 Storage 5 Load balancing 6 Networking 7 Security products 8 Monitoring (e.g. sending data to Azure Monitor Logs) 9 Backups
  20. Advisor Score The Advisor score consists of an overall score,

    which can be further broken down into five category scores. One score for each category of Advisor represents the five pillars of the Well-Architected Framework. Advisor displays your overall Advisor score and a breakdown for Advisor categories, in percentages. A score of 100% in any category means all your resources assessed by Advisor follow the best practices that Advisor recommends. On the other end of the spectrum, a score of 0% means that none of your resources assessed by Advisor follow Advisor's recommendations.
  21. Shutdown recommendation Advisor identifies resources that haven't been used at

    all over the last 7 days and makes a recommendation to shut them down. • Recommendation criteria include CPU and Outbound Network utilization metrics. Memory isn't considered since we've found that CPU and Outbound Network utilization are sufficient. • The last 7 days of utilization data are analyzed. Note that you can change your lookback period in the configurations. • Metrics are sampled every 30 seconds, aggregated to 1 min and then further aggregated to 30 mins (we take the max of average values while aggregating to 30 mins). On virtual machine scale sets, the metrics from individual virtual machines are aggregated using the average of the metrics across instances. • A shutdown recommendation is created if: • P95th of the maximum value of CPU utilization summed across all cores is less than 3%. • P100 of average CPU in last 3 days (sum over all cores) <= 2% • Outbound Network utilization is less than 2% over a seven-day period.
  22. Key Recommendation: Resize SKU 1/3 Advisor recommends resizing virtual machines

    when it's possible to fit the current load on a more appropriate SKU, which is less expensive (based on retail rates). On virtual machine scale sets, Advisor recommends resizing when it's possible to fit the current load on a more appropriate cheaper SKU, or a lower number of instances of the same SKU. • Recommendation criteria include CPU, Memory and Outbound Network utilization. • The last 7 days of utilization data are analyzed. Note that you can change your lookback period in the configurations. • Metrics are sampled every 30 seconds, aggregated to 1 minute and then further aggregated to 30 minutes (taking the max of average values while aggregating to 30 minutes). On virtual machine scale sets, the metrics from individual virtual machines are aggregated using the average of the metrics for instance count recommendations, and aggregated using the max of the metrics for SKU change recommendations.
  23. Key Recommendation: Resize SKU 2/3 • An appropriate SKU (for

    virtual machines) or instance count (for virtual machine scale set resources) is determined based on the following criteria: • Performance of the workloads on the new SKU shouldn't be impacted. • Target for user-facing workloads: • P95 of CPU and Outbound Network utilization at 40% or lower on the recommended SKU • P100 of Memory utilization at 60% or lower on the recommended SKU • Target for non user-facing workloads: • P95 of the CPU and Outbound Network utilization at 80% or lower on the new SKU • P100 of Memory utilization at 80% or lower on the new SKU • The new SKU, if applicable, has the same Accelerated Networking and Premium Storage capabilities • The new SKU, if applicable, is supported in the current region of the Virtual Machine with the recommendation • The new SKU, if applicable, is less expensive
  24. Key Recommendation: Resize SKU 3/3 • Instance count recommendations also

    take into account if the virtual machine scale set is being managed by Service Fabric or AKS. For service fabric managed resources, recommendations take into account reliability and durability tiers. • Advisor determines if a workload is user-facing by analyzing its CPU utilization characteristics. The approach is based on findings by Microsoft Research. You can find more details here: Prediction-Based Power Oversubscription in Cloud Platforms - Microsoft Research. • Based on the best fit and the cheapest costs with no performance impacts, Advisor not only recommends smaller SKUs in the same family (for example D3v2 to D2v2), but also SKUs in a newer version (for example D3v2 to D2v3), or a different family (for example D3v2 to E3v2). • For virtual machine scale set resources, Advisor prioritizes instance count recommendations over SKU change recommendations because instance count changes are easily actionable, resulting in faster savings.
  25. Key Recommendation: Savings Plans & Reservation • Advisor analysed your

    compute usage over the last 30 days and recommend adding a savings plan to increase your savings. • Advisor analysed usage pattern over the selected term, look-back period, and recommend a Reserved Instance purchase that maximizes your savings. Covered recommendations are: • Virtual Machines • App Services • Databases: Cosmos DB, SQL PaaS, Maria DB, MySQL, PostgreSQL, Azure Synapse Analytics • Cache for REDIS • Storage: Blob Storage, Azure Files, NetApp Storage, Azure Managed Disk • Azure Dedicated Host • Data Factory and Azure Data Explorer • 3rd Party Solutions: Azure VMware Solution, Red Hat reserved instances, SapHana, SuseLinux, VMware Cloud Simple Caution: Currently Savings Plan and Reservation recommendation for VM’s might be cumulative!
  26. Identifying AI Cost Drivers Primary Cost Drivers • Model Selection:

    GPT-4 is 10-30x more expensive than GPT- 3.5 • Token Volume: Both input and output token consumption • Deployment Type: Global vs. Regional vs. Data Zone pricing • Feature Usage: Fine-tuning, embeddings, assistants, function calling Hidden Cost Factors • Retry logic increasing token consumption on failures • Verbose prompts with unnecessary context • Conversation history growing unboundedly • Fine-tuned model hosting charges (~$3/hour per model) • Infrastructure overhead: Key Vault, VNet endpoints, storage, monitoring Cost Attribution Challenges • Shared model deployments across multiple applications • Difficulty tracking token usage to specific features • Allocating PTU costs when utilization varies by team
  27. Complex Architecture leads to cost complexity Azure AI Foundry Services

    • Azure OpenAI Service: GPT-4, GPT-4o, GPT-3.5, o1, o3 models • Azure AI Foundry Models: DeepSeek, Llama, Grok, Mistral, FLUX • Azure Machine Learning: Custom model training and deployment • Azure Cognitive Services: Vision, Speech, Language, Decision APIs Deployment Types • Global: Access to models across all Azure regions with highest availability • Data Zone: Data residency within specific geographic zones • Regional: Dedicated capacity in specific Azure regions Cost-Relevant Characteristics • Each service has distinct pricing models and cost drivers • Model selection within a service dramatically impacts costs • Deployment type affects both performance and pricing
  28. Azure OpenAI Service – Pricing Examples (Mar 2025 USD) Language

    model Low High Units GPT-5 2025-08-07 Global $1,25 (input) $10.00 (output) per 1M tokens GPT-5-nano Global $0.05 (input) $0.40 (output) per 1M tokens Fine-tuning models Training Hosting Input/Output O4-mini (globsl) $100/hr $1.70 per hr $1,10/$4.40 per 1M tokens GPT 4.1 $25/1M tokens $1.70 per hr $2/$8 per 1M tokens Image models Low High Units Dall-E-3 $4.40 $13.20 per 100 images, depending on resolution & definition Video models Sora 2 Global $0.10 $0.50 per second, depending on resolution Embedding models Text-embedding-3-small $0.000022 per 1k tokens Text-embedding-3-large $0.000143 per 1k tokens Chat completions API Input Output Units GPT-4o-Mini-Audio-Preview $0.15/$10 $0.60/$20 per 1M tokens GPT-4o-Audio-Preview $2.50/$40 $10/$80 Region, model type, model, input tokens, output tokens, support level, licensing agreement all impact pricing Azure OpenAI Service - Pricing | Microsoft Azure
  29. Azure AI Search – Pricing Examples (Nov 2025 USD) Additional

    pricing for Custom entity lookup skill, Document cracking (image extraction), Semantic ranker
  30. Azure AI Search Import a representative sample of data to

    index: Import Wizards in the Azure portal - Azure AI Search | Microsoft Learn
  31. The Non-Foundry Search Service vs. The AI Foundry Search Service

    The architectural design pattern aligns with best practices. "To maximize reliability and minimize the blast radius of failures, strictly isolate the Foundry Agent Service dependencies from other workload components that use the same Azure services. Specifically, don't share AI Search, Azure Cosmos DB, or Storage resources between the agent service and other application components. Instead, provision dedicated instances for the agent service's required dependencies." Reliability in Azure AI Search - Azure AI Search | Microsoft Learn © Kyle Akepanidtaworn
  32. Azure AI Search FAQs Can I temporarily shut down a

    search service to save on costs? Search runs as a continuous service. Dedicated resources are always operational and allocated for your exclusive use for the lifetime of your service. To stop billing entirely, you must delete the service. Deleting a service is permanent and also deletes its associated data. Can I pause the service and stop billing? You can't pause a search service. In Azure AI Search, computing resources are allocated when the service is created. It's not possible to release and reclaim those resources on demand. Can I rename or move the service? Service name and region are fixed for the lifetime of the service. Can I change the billing rate (tier) of an existing search service? Existing services can switch between Basic and Standard (S1, S2, and S3) tiers. Your current service configuration can't exceed the limits of the target tier, and your region can't have capacity constraints on the target tier. For more information, see Change your pricing tier. Are "Azure Search," "Azure Cognitive Search," and "Azure AI Search" the same product? Yes. They're all the same product, with rebranding occurring in October 2019 and again in October 2023. You might occasionally see evidence of the former names at the programmatic level. © Kyle Akepanidtaworn Azure AI Search FAQ - Azure AI Search | Microsoft Learn
  33. Tokens One token generally corresponds to ~4 characters of text

    for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
  34. Token Economics (Token-omics) Understanding Tokens • 1 token ≈ 4

    characters in English (varies by language) • Both input (prompts) and output (completions) consume tokens • System messages, conversation history, and context all count as input Token Ratio Impact • Input-heavy workloads (RAG, document analysis) optimize differently than output-heavy • GPT-4.1: 1 output token = 4 input tokens for utilization calculation • GPT-5: 1 output token = 8 input tokens for utilization calculation • Cached tokens receive 100% discount from utilization Cost Calculation Example • 100 tokens input + 300 tokens output at GPT-4 Turbo pricing: • Input: 100 × $0.01/1K = $0.001 • Output: 300 × $0.03/1K = $0.009 • Total per request: $0.01 → $10,000 for 1M requests
  35. AI Workload Cost Optimization Consider using pre-built models in Microsoft

    Foundry to speed up deployment and reduce costs Use the right series of models Optimize token usage – combine or reduce requests eg do you need to send the previous conversation components (inc responses) or a summary or just the previous user inputs?) Right-size other application components Consider Model Router or Token Routing
  36. Recommended Optimization Flow Reservati ons Savings Plans Waste removal Deallocate

    resources Right- sizing Azure Hybrid Benefits Usage Optimization Rate Optimization Remove waste Action Not optimized total Spend Total Spend Reduce License cost Optimize rate Total Spend Reduced spent Not Optimized workload Total Spend Right-Size Reduced spent Total Spend Reduced spent Savings Total Spend Savings Total Spend Savings Deallocation Impact on Spend Optimize rate Optimized workload Optimized rate RI RI SP SP RI RI
  37. WAF-Aligned Optimization Trade-offs Cost vs. Performance • Smaller models: Lower

    cost, potentially lower quality • PTU sizing: Over-provision for performance, under-provision for cost • Caching: Reduces cost but may serve stale responses • Batch processing: Significant savings but adds latency Cost vs. Reliability • Multi-region deployment: Higher cost, better availability • Failover capacity: Reserved but potentially underutilized • Redundant deployments: Insurance against capacity constraints Cost vs. Security • Private endpoints: Additional network costs • Customer-managed keys: Operational overhead • Data residency requirements: May limit cost-optimal regions Optimization maximizes value, not just minimizes cost - consider all WAF pillars
  38. Summary A cost-optimized workload isn't necessarily a low-cost workload. There

    are significant trade-offs. Quick fixes might save money short-term, but for long-term savings, you need a solid plan. It should include prioritization, continuous monitoring, and repeatable processes that focus on optimization. As you prioritize business requirements to align with technology needs, you can adjust costs. But expect trade-offs in areas like security, scalability, resilience, and operability. If the cost of addressing the challenges in those areas is high and these principles aren't applied properly, you might make risky choices in favor of a cheaper solution. These choices can hurt your goals or reputation. Cost Optimization design principles Design review checklist for Cost Optimization Cost Optimization tradeoffs Cloud design patterns that support cost optimization Microsoft Confidential 48
  39. Q&A

  40. Thank you 谢谢 Gracias धन्यवाद ً اركش Merci Obrigado Спасибо

    ありがとうございます 감사합니다 Teşekkür ederim ת הדו አመሰግናለሁ ขอบคุณ Cảm ơn Terima kasih Salamat ధన్య వాదాలు நன ் றி ہیرکش ಧನ್ಯ ವಾದಗಳು നന്ദി આભાર ਧੰਨਵਾਦ ধন্যবাদ Danke Grazie
  41. Learn more a Azure Pricing Calculator https://azure.microsoft.com/pricing/calculator Azure OpenAI pricing

    https://azure.microsoft.com/pricing/details/cogniti ve-services/openai-service Plan to manage costs for Azure OpenAI Service https://learn.microsoft.com/azure/ai-services/openai/how- to/manage-costs Azure AI Search pricing https://azure.microsoft.com/pricing/details/search/ Plan and manage costs of an Azure AI Search service https://learn.microsoft.com/azure/search/search-sku- manage-costs
  42. Learn more Introduction to cost management for AI workloads https://learn.microsoft.com/training/modules/

    understand-cost-management-ai/ Establishing cost management practices for AI https://learn.microsoft.com/training/modules/e stablish-ai-cost-management-practices/ Manage cost efficiency of Azure and AI investments https://aka.ms/AzureEssentialsCostEfficiency Maximize the cost efficiency of AI Agents on Azure https://aka.ms/Learn-CostEfficientAIAgents
  43. Learn more The FinOps Foundation https://www.finops.org/ Adopt FinOps on Azure

    https://learn.microsoft.com/training/modules/adopt-finops-on- azure/?WT.mc_id=modinfra-146475-socuff FinOps with Azure (eBook) https://info.microsoft.com/ww-landing-finops-with-azure- bringing-finops-to-life-through-organizational-and-cultural- alignment.html FinOps Interactive Guides https://mslearn.cloudguides.com/guides/FinOps on Azure FinOps Review (assessment) https://aka.ms/FinOps-Assessment Microsoft FinOps blog https://aka.ms/Finops/TCblog