ABCS26: DevOps in Azure: Why Alert Emails are not a Monitoring Strategy by Alexander Sameli & Daniel Steinmann

1 Why alert emails are not a monitoring strategy Daniel
Steinmann & Alex Sameli 21.05.2026

2 Managed Service Provider in the logistics industry Our technology
stack: − C# / .NET / Angular − Microsoft SQL Server − Microsoft Azure − Terraform / Terragrunt

3 Our Azure infrastructure - PaaS only - 4 environments
(DEV, AT, UAT, PROD) - Geo- redundant, globally load - balanced workloads

5 Our Azure infrastructure in numbers − 350 Resource Groups
− 210 App Service Plans − 250 App Services − 220 Azure SQL Databases − 130 Key Vaults − 3 Resource Naming Conventions

6 Who is using our services?

7 Parcel Handling - 130’000 parcels daily - Sorting and
weighting - Loaded onto vehicles or trains - Operations hour during the night

8 Couriers - Tour data - Next stops - Delivery
confirmations - Drop - Off locations

9 Customers - Track deliveries - Change drop - off
locations - Sign delivery confirmations

10 Key objectives of monitoring and alerting No stacked parcels
No unhappy customers

Azure Front Door − Health Checks − Certificate Expiration 12
Deploy Availability Alerts with Terraform

Azure App Service Plan − Http Queue Length − Cpu
Usage − Memory Usage 14 Deploy Metric Alerts with Terraform

Azure App Service − Traces − Logs − Events −
Exceptions − Potential Memory Leak / Failure Anomalies 15 Deploy Observability Alerts with Terraform

Azure SQL Database − Deadlocks − Failed Connections − Used
Storage Percentage − Cpu Usage 17 Deploy Metric Alerts with Terraform

18 https://azure.github.io/azure - monitor - baseline - alerts/welcome/

Key Vault − Secret expiration 20 Deploy Event Grid subscription
Alerts with Terraform

21 Alerts are sent to action group ( our inboxes
) Life is good , or is it ?

Precision: The proportion of events detected that were significant .
Recall: The proportion of significant events detected. 22 Let's evaluate our alerting

https://inselgruppe.ch/fileadmin/Insel_Gruppe/Bilder/Mediendienst/News/WissKomm/Wisskomm_divers/2021_Int_J_Infect_Dis_Jegerlehner.pdf Precision: 𝑃 = 98% Recall: 𝑅 = 65% 100
100 98 65 23

25 - Average time spent on alerting per 2 weeks:
7 hours - Triage was less than optimal

27 Recall Precision 100 % 100% 0%

28 Why are exceptions not ideal for alerting? − Transient
− Abstraction over different root causes − Missing context

29 Transient exceptions − Filter them to improve precision

30 Abstraction over different root causes − Login failed for
user '<token - identified principal>’. − Cannot insert duplicate key row. − Schwerer Fehler für den aktuellen Befehl. Die Ergebnisse (wenn vorhanden) sollten verworfen werden. − Une erreur grave s'est produite pendant la commande actuelle.

31 Missing context

32 Recall Precision 100 % 100% 0%

33 https://sre.google/sre - book/

34 Resource alerts - Many false - positives (Precision) -
Unknown error sources (Recall)

35 Resource alerts

36 Service Level Indicator (SLI) Quantitative measure of service level
aspect - Throughput - Error rate - Latency Service Level Objective (SLO) SLI ≥ target in time - window Real - life SLOs of MdS Secure Token Service: - 99.9% of POST /connect/token requests will complete successfully over 30 days .(Error rate) - 99% of POST /connect/token requests will complete in less than 130ms over 30 days. (Latency) Service Level Agreement (SLA) Contract: Consequences of missing SLO targets for request - driven systems https://www.microsoft.com/licen sing/docs/view/Service - Level - Agreements - SLA - for - Online - Services

37 Secure Token Service (STS) /connect/token Error rate Latency POST
POST POST Monitor with SLO alerts 15k+ parcels stacked within 1h No navigation

39 SLI - Error rate (30d) SLO - 99.9% over
30 days 5′110 10′080′000 = 0.05% 𝟎. 𝟎𝟓% < 𝟎. 𝟏% Service endpoint - POST /connect/token ?

40 100% 0.1% SLO target SLO duration (30d) = Failed
request count Request count Error rate = 1 − SLO Error rate SLO: 99.9% of HTTP requests will complete successfully over 30 days Time

41 Time 100% 0.1% Error budget = SLO target ⋅
SLO duration SLO target = Error rate SLO target Burn rate 1 Burn rate SLO duration (30d) Burn rate Budget consumed in 0.5 60d 1 30d 6 10d 14.4 2d Budget consumption (30 - day SLO) Low Moderate High Error rate Error rate SLO: 99.9% of HTTP requests will complete successfully over 30 days

42 Time 1 Burn rate SLO duration (30d) Burn rate
High 6 14.4 Alert rule time window (10h) Thresholds Moderate Low = σ Burn rate Time window : Alert fired Alert severity SLI: Error rate SLO: 99.9% in 30d Avg. Throughput: 14k requests/h Avg. Error Budget: 10k errors/30d 20% 2% Error budget consumed Burn rate Budget consumed in 0.5 60d 1 30d 6 10d 14.4 2d Budget consumption (30 - day SLO) Low Moderate High 𝑡1 𝑡2 𝑡3 𝑡4

43 1 Burn rate SLO duration (30d) 6 14.4 Thresholds
Alert fires when 2- 10% of error budget is consumed, with severity based on burn rate. High Moderate Low Alert severity Burn rate Time 2% 10% Time window (1h) Time window (3d) Error budget consumed Burn rate Budget consumed in 0.5 60d 1 30d 6 10d 14.4 2d Budget consumption (30 - day SLO) Low Moderate High

Multiwindow, Multi - Burn - Rate Alerts 44 https://sre.google/workbook/alerting -
on- slos/ Priority T Burn rate Time window High 1h 14.4 1h Moderate 1d 6 6h Low 2w 1 3d Service Level Objective (SLO) - 99.9% of HTTP requests will complete successfully over 30 days . (Error Rate) - 99% of HTTP requests will complete in less than 130ms over 30 days. (Latency) KQL - 9 standardized alert rule parameters - 3 SLO parameters per endpoint - Optimized precision & recall out - of - the - box - Deterministic alert rules - Automated deployment

45 Resource alerts SLO alerts - Increased coverage through SLO
alerts (Recall) - Resource alerts as early - warnings only (Precision)

September 2024: 103 alerts investigated, 6 actions taken, 0 incidents
missed March 2026: 13 alerts investigated, 7 actions taken, 0 incidents missed 𝑃 = 6 103 ≈ 𝟔% 𝑅 = 100% 𝑃 = 7 13 ≈ 𝟓𝟒% 𝑅 = 100% 46 8x fewer developer interruptions Results No incidents missed

Remaining challenges • SLOs for non - request - driven
systems • Pipelines (Throughput, End - to - end latency) • Storage (Latency, Availability, Durability) • Shared understanding of SLOs • Integrate sampling • Faster iteration of alerting parameters 47

48 Life is good? Developer on alerting duty Yes, pretty
good.

ABCS26: DevOps in Azure: Why Alert Emails are n...

ABCS26: DevOps in Azure: Why Alert Emails are not a Monitoring Strategy by Alexander Sameli & Daniel Steinmann

More Decks by Azure Zurich User Group

Other Decks in Technology

Featured

Transcript