Lessons Learnt Building and Maintaining Microservices at Scale on the Azure Platform by Alex Bulankou

I Don’t Miss Those Calls at Night Lessons Learnt Building
and Maintaining Microservices at Scale Alex Bulankou @bulankou

Introduction • I’m Alex Bulankou. Engineering manager on Microsoft Application
Insights. My team owns SDKs and APM experience. • AppInsights: is an extensible APM service for developers on multiple platforms • We are “Azure first but not only Azure”: premium integration and enablement for Azure-hosted services, but support on-prem services and other cloud platforms • Before That: Engineer on Office 365, Bing, Visual Studio, .NET Framework • Nights & Weekends: https://www.stopbystop.com: web service and iOS/Android apps for road travelers. • Originally from Minsk, Belarus • Lived in US Pacific Northwest for the past 10 years

What is Microservice #1 In short, the microservice architectural style
is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies. • Products not projects (“You build, you run it) • Including live site support • Smart endpoints and dumb pipes • Microservices are smart, but messaging is reliable, generic and clueless • Decentralized governance • Use best tool for the job and shared libraries • Decentralized data management • Polyglot persistence, transactionless coordination, idempotency of operations • Infrastructure automation • Continuous delivery • Design for failure • Monitoring is critical • Evolutionary design • Independent replacement and upgradability Martin Fowler, https://martinfowler.com/articles/microservices.html

Our Product is a Distributed Application Build on Microservice Architecture
• Ingestion service accepts data from SDKs and deposits it to Event Hub. It also routes data to Alerting service • Note “It also” – usually indication something is fishy J • Loaders preprocess the data and load it from EventHub into Metrics and Analytics Stores. • Insights detection, export and query service talk to Analystics and Metrics Stores • Profile service maintains information about customer profiles and uses DocumentDB as store • Billing service encapsulates billing logic and communicates to Profile service • UI service provides presentation layer. It communicates with query service • We are developing more and more services on Node.JS, while continuing to support existing .NET services (“decentralized governance”) • We could easily to switch the store for profiles from Azure Storage to Azure Document DB (“decentralized data management”) SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store (codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>> Insights detection <<.NET>>

Our Path to Where we are • Our daily live
site triages and weekly live site reviews today are boring and uneventful • We have enough time to talk about Safe Deploy and engineers using SAW machines • 2 years ago it was different • 5-7 Sev1 incidents *daily* • DRIs generally skipping morning triage (catching up on sleep after the night) • No time for 5 Why exercises SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store (codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>> Insights detection <<.NET>>

Lesson #1: Plan your Triage and Diagnostics #1 • Design
your service monitoring with future triage and diagnostics experience in mind • Plan your alert metrics • Plan your daily live site triage. What charts will you show? • For performance percentiles is generally much more indicative metric. You can use “poor man’s percentiles” • For each chart there should be investigation threshold • Focus on # of users and # sessions affected • Plan your diagnostics experience • You cannot collect all data (storage cost, network cost) • When sampling, correlation is critical • Either collect operation with all correlated data, or don’t. Don’t include operation with partial information. Percentiles, not averages When to investigate? Focus on user impact

Application Insights (Production) Application Insights (Pre-Production) • Production environment is
monitored by Pre-production • PPE health is just as critical as PROD health • Pre-production environment is monitored by Production • Other environments (DEV) are monitored by Production PPE and PROD monitor each other Lesson #1: Plan your Triage and Diagnostics #2 Isolate your monitoring pipeline

Lesson #2: Smart Geo-distribution #1 Backend Frontend Frontend Frontend Frontend
• We optimize for best user experience • Basic physics demands proximity for best experience • 40ms for light to travel from Chicago to Beijing • When latency matters frontend edge nodes are placed as close to the customer as possible and are made easy to roll out • However be prepared for significant network costs

Lesson #2: Smart Geo-distribution #2 Backend Frontend Frontend Frontend Frontend
• End-user latency to frontend is not always critical • Our example: For our customers sending data from server SDKs latency doesn’t matter • However it matters for client-side SDKs • Best of both worlds: smart defaults with customizations Some clients are pointed to endpoint of the frontend closest to the backend

Lesson #3: Alert Routing and Consolidation #1 Storage Billing service
Product Catalog service Customer Data service Storage on-call engineer Billing on-call engineer Catalog on-call engineer Custome r DB on- call engineer • This is all too common • FE on-call engineers (DRIs) are woken up for BE problems • Issue #1: all sibling service DRIs are woken up due to the same problem • Issue #2: time is wasted while root cause is determined manually and escalated to correct DRI

Lesson #3: Alert Routing and Consolidation #2 Storage Billing service
Product Catalog service Customer Data service Storage on-call engineer Billing on-call engineer Catalog on-call engineer Custome r DB on- call engineer • Consolidation: identify alerts due to same root cause and only fire one • Routing: identify the root cause and automatically route the alert • In reality, you still end up with conference bridges with 20 people waiting and checking on the fix • But for simple mitigation alert consolidation brings sanity into DRI’s life

Lesson #4: Configuring Auto-scale #1 • Without auto-scale considerable time
is spent monitoring CPU/Memory and manual adjustment • Important for auto-scale: • Define max number of machines in the pool • To keep budget under control • Start with generous initial instance count • Use recent peak with buffer as default and re-evaluate periodically

Lesson #4: Configuring Auto-scale #2 • Max CPU: use between
60% and 80% • By how much to increase? • Increase is not instant • How quickly do you want to add instances? • If you need 100 machines with 80% capacity and you only have 80 now? • Consider provisioning time • For example: if provisioning time is 30 minutes and you want to be able to grow by 30 instances per hour, then 30*30/60=15, so you need to increase by 15 instances • Scaling up is more aggressive than scaling down = _ × ___ _ = ×___

Lesson #5: Randomizing Scheduled Disruptive Events • Any repeated pattern
executed by multiple actors has dangerous impact • Why is IIS AppPool recycle is set to 1740 minutes • 29 hours, smallest prime over 24 • Staggered and non-repeating pattern more than one day • Why is randomness built into exponential retry algorithms (Ethernet, TCPIP) • Suppose retries are caused by environment factor, we don’t want all clients to retry at the same interval

Lesson #6: Design for Resiliency #1 • Plan graceful degradation
• Consider a circuit breaker

Lesson #6: Design for Resiliency #2 • Bulkheads • Configure
throttling and thread pools per route/area • Inject failure • Netflix Chaos Monkey • Foster culture that encourages and rewards engineers trying to break the system

Lesson #7: Design your Caching • Improving caching is often
the easiest (and cheapest!) way to scale • Consider combination of caching approaches • Client-side • Proxy (e.g. CDN) • Lazy-access caching: useful when you want to have complete protection for the origin • Server side • Don’t over-complicate caching • Notoriously hard to debug • Beware of cache corruption • Have an easy way to discard cache

Lesson #8: Have Arrows on Data Flow Diagram Go One
Way #1 • Is your architecture DAG (directed acyclic graph)? • Having no circular dependencies in the architecture is very helpful for diagnosability and issue isolation • In our case we started with a data access routing service intended for front end services: it provided validation and smart routing Data Access service Billing service Product catalog service Customer data service FE service

Way #2 • However then we started using as access layer for backend services as well • E.G. when billing service needed to access customer data service, it went through data access service Data Access service Billing service Product catalog service Customer data service FE service BE service

Way #3 • Now when one service is down, all services are down • Compound impact of retries on the data access service • Thread pool exhaustion • Not having bulkheads makes the problem worse • Synchronous operations makes the problem even worse Data Access service Billing service Product catalog service Customer data service FE service BE service

Lesson #9: Use 5 Why’s Exercise During Live Site Review
• Cause-effect discovery technique • Developed by Toyota • Requires “blameless culture” • Important: • Distinguish causes from symptoms • There’s still one occurrence of this in my example, although I tried to get rid of most of them with #2 • Never leave “human error” as root cause. Assess the process, not people • You can see we could have left it at #4 1.Why? - The battery is dead. (First why) 2.Why? - The alternator is not functioning. (Second why) 3.Why? - The alternator belt has broken. (Third why) 4.Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why) 5.Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause) Real life example J 1.Why? - More than 5% of users experience failing pages (First why) 2.Why? - All functions calling Purchase service are failing and most of Purchase service endpoints are down as it cannot communicate to Storage service (Second why) 3.Why? - Certificate expired and was not replaced on time (Third why) 4.Why? - Joe is always handling cert updates and he is on vacation (Fourth why) 5.Why? - We don’t have instructions for everyone on how to update certs and alerts that fire to remind us when (Fifth why, a root cause) Example from Wikipedia

Microservices at Scale: Recap and take-aways • 7 Fowler’s principles
Products, not projects. Smart endpoints and dumb pipes. Decentralized governance. Decentralized data management. Infrastructure automation. Design for failure. Evolutionary design. • Lesson #1: Design your monitoring • Think through your future daily triage and diagnostics in advance • Lesson #2: Smart Geo-Distribution • What are you optimizing for • Lesson #3: Alert Routing and Consolidation • Your DRI comfort is ultimately better service to customers • Lesson #4: Auto-Scale • Scaling is never instant • Lesson #5 Randomize scheduled disruptive events • Remember 29 hours • Lesson #6: Design for resiliency • Lesson #7: Caching: cheapest way to scale • Don’t over-complicate it! • Lesson #8: Your architecture should be DAG • No cycles • Lesson #9: Use 5 Whys exercise during live site reviews • Don’t leave with human error

Other lessons – for further exploration • Don’t forget load
balancer • What endpoint is it using? • What happens if all regions are down and then one goes online? (hint, imagine pack of wild dogs and one gazelle) • If *clean* rollback is not possible, roll forward • Otherwise you are in a totally new state altogether • Hotfixes should be targeted and minimal • Don’t attempt to fix multiple problems with hotfix • Safe Deploy within reason • Flaky tests should be disabled and bugs opened • Don’t ignore and waste everyone’s time asking continuously to rerun • If you deployment train has multiple legs, start next one optimistically, while testing the previous one • But don’t expect testing to always succeed (don’t overpromise ETA) • For partner contributions, Pull Request is your gate • Much easier to ask for fixes or changes before merge • Make operations idempotent • Outcome doesn’t change after first application • CAP theorem – which one will you sacrifice: consistency, availability or partition tolerance?

Your examples?

Thank You!

Helper slides

What is Microservice #1 In short, the microservice architectural style
is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies. Martin Fowler, https://martinfowler.com/articles/microservices.html

Your frontend service SQL Your backend service External service Web
UI Application Insights Alerts APM: Azure portal Analytics API Continuous export JSSDK ASP.NET SDK Node.JS SDK AI SDK mongo

Your frontend service SQL Your backend service Web UI Application
Insights Alerts APM: Azure portal Analytics API Continuous export JSSDK ASP.NET SDK Node.JS SDK

SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store
(codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>>

SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store
(codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>> • Application Insights itself is our #0 canonical application • Dogfooding is critical – we want to use our own product for monitoring • Should we be sending data to the same pipeline? AI SDK AI SDK AI SDK AI SDK AI SDK AI SDK AI SDK Lesson #1: Plan your Triage and Diagnostics #2

Lessons Learnt Building and Maintaining Microse...

Lessons Learnt Building and Maintaining Microservices at Scale on the Azure Platform by Alex Bulankou

Azure Zurich User Group
PRO

More Decks by Azure Zurich User Group

Other Decks in Technology

Featured

Transcript

I Don’t Miss Those Calls at Night Lessons Learnt Building

Introduction • I’m Alex Bulankou. Engineering manager on Microsoft Application

What is Microservice #1 In short, the microservice architectural style

Our Product is a Distributed Application Build on Microservice Architecture

Our Path to Where we are • Our daily live

Lesson #1: Plan your Triage and Diagnostics #1 • Design

Application Insights (Production) Application Insights (Pre-Production) • Production environment is

Lesson #2: Smart Geo-distribution #1 Backend Frontend Frontend Frontend Frontend

Lesson #2: Smart Geo-distribution #2 Backend Frontend Frontend Frontend Frontend

Lesson #3: Alert Routing and Consolidation #1 Storage Billing service

Lesson #3: Alert Routing and Consolidation #2 Storage Billing service

Lesson #4: Configuring Auto-scale #1 • Without auto-scale considerable time

Lesson #4: Configuring Auto-scale #2 • Max CPU: use between

Lesson #5: Randomizing Scheduled Disruptive Events • Any repeated pattern

Lesson #6: Design for Resiliency #1 • Plan graceful degradation

Lesson #6: Design for Resiliency #2 • Bulkheads • Configure

Lesson #7: Design your Caching • Improving caching is often

Lesson #8: Have Arrows on Data Flow Diagram Go One

Lesson #8: Have Arrows on Data Flow Diagram Go One

Lesson #9: Have Arrows on Data Flow Diagram Go One

Lesson #9: Use 5 Why’s Exercise During Live Site Review

Microservices at Scale: Recap and take-aways • 7 Fowler’s principles

Other lessons – for further exploration • Don’t forget load

Your examples?

Thank You!

Helper slides

What is Microservice #1 In short, the microservice architectural style

Your frontend service SQL Your backend service External service Web

Your frontend service SQL Your backend service Web UI Application

SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store

SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store