Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learnt Building and Maintaining Microservices at Scale on the Azure Platform by Alex Bulankou

Lessons Learnt Building and Maintaining Microservices at Scale on the Azure Platform by Alex Bulankou

Our product, Microsoft Application Insights, is a service that collects telemetry data that customers are sending to us, processes and aggregates it, and makes aggregated data available for query. Internally we have a microservice architecture with each service of the pipeline responsible for doing its share and passing the data to the next service in the pipeline. This talk summarizes lessons we learnt as we improved our live site operations and stability from having on average 5-7 Sev1-2 incidents daily 2 years ago to the same number of incidents per month. Every incident we had offered unique opportunity to learn how to improve our system and also how to build our product such that it is easier to detect and diagnose similar issues for our customers.

Alex is a software development manager at Microsoft Application Insights, a service that helps to detect, triage and diagnose issues with web applications and services. Prior to that Alex was a developer on Visual Studio, Bing and Office 365 teams building UX and data visualizations for client and web platforms. As a hobby Alex is bootstrapping an Open Street Map service that helps road travellers find places to stop along the way to their destination: www.stopbystop.com.

Azure Zurich User Group
PRO

April 22, 2017
Tweet

More Decks by Azure Zurich User Group

Other Decks in Technology

Transcript

  1. I Don’t Miss Those Calls at Night Lessons Learnt Building

    and Maintaining Microservices at Scale Alex Bulankou @bulankou
  2. Introduction • I’m Alex Bulankou. Engineering manager on Microsoft Application

    Insights. My team owns SDKs and APM experience. • AppInsights: is an extensible APM service for developers on multiple platforms • We are “Azure first but not only Azure”: premium integration and enablement for Azure-hosted services, but support on-prem services and other cloud platforms • Before That: Engineer on Office 365, Bing, Visual Studio, .NET Framework • Nights & Weekends: https://www.stopbystop.com: web service and iOS/Android apps for road travelers. • Originally from Minsk, Belarus • Lived in US Pacific Northwest for the past 10 years
  3. What is Microservice #1 In short, the microservice architectural style

    is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies. • Products not projects (“You build, you run it) • Including live site support • Smart endpoints and dumb pipes • Microservices are smart, but messaging is reliable, generic and clueless • Decentralized governance • Use best tool for the job and shared libraries • Decentralized data management • Polyglot persistence, transactionless coordination, idempotency of operations • Infrastructure automation • Continuous delivery • Design for failure • Monitoring is critical • Evolutionary design • Independent replacement and upgradability Martin Fowler, https://martinfowler.com/articles/microservices.html
  4. Our Product is a Distributed Application Build on Microservice Architecture

    • Ingestion service accepts data from SDKs and deposits it to Event Hub. It also routes data to Alerting service • Note “It also” – usually indication something is fishy J • Loaders preprocess the data and load it from EventHub into Metrics and Analytics Stores. • Insights detection, export and query service talk to Analystics and Metrics Stores • Profile service maintains information about customer profiles and uses DocumentDB as store • Billing service encapsulates billing logic and communicates to Profile service • UI service provides presentation layer. It communicates with query service • We are developing more and more services on Node.JS, while continuing to support existing .NET services (“decentralized governance”) • We could easily to switch the store for profiles from Azure Storage to Azure Document DB (“decentralized data management”) SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store (codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>> Insights detection <<.NET>>
  5. Our Path to Where we are • Our daily live

    site triages and weekly live site reviews today are boring and uneventful • We have enough time to talk about Safe Deploy and engineers using SAW machines • 2 years ago it was different • 5-7 Sev1 incidents *daily* • DRIs generally skipping morning triage (catching up on sleep after the night) • No time for 5 Why exercises SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store (codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>> Insights detection <<.NET>>
  6. Lesson #1: Plan your Triage and Diagnostics #1 • Design

    your service monitoring with future triage and diagnostics experience in mind • Plan your alert metrics • Plan your daily live site triage. What charts will you show? • For performance percentiles is generally much more indicative metric. You can use “poor man’s percentiles” • For each chart there should be investigation threshold • Focus on # of users and # sessions affected • Plan your diagnostics experience • You cannot collect all data (storage cost, network cost) • When sampling, correlation is critical • Either collect operation with all correlated data, or don’t. Don’t include operation with partial information. Percentiles, not averages When to investigate? Focus on user impact
  7. Application Insights (Production) Application Insights (Pre-Production) • Production environment is

    monitored by Pre-production • PPE health is just as critical as PROD health • Pre-production environment is monitored by Production • Other environments (DEV) are monitored by Production PPE and PROD monitor each other Lesson #1: Plan your Triage and Diagnostics #2 Isolate your monitoring pipeline
  8. Lesson #2: Smart Geo-distribution #1 Backend Frontend Frontend Frontend Frontend

    • We optimize for best user experience • Basic physics demands proximity for best experience • 40ms for light to travel from Chicago to Beijing • When latency matters frontend edge nodes are placed as close to the customer as possible and are made easy to roll out • However be prepared for significant network costs
  9. Lesson #2: Smart Geo-distribution #2 Backend Frontend Frontend Frontend Frontend

    • End-user latency to frontend is not always critical • Our example: For our customers sending data from server SDKs latency doesn’t matter • However it matters for client-side SDKs • Best of both worlds: smart defaults with customizations Some clients are pointed to endpoint of the frontend closest to the backend
  10. Lesson #3: Alert Routing and Consolidation #1 Storage Billing service

    Product Catalog service Customer Data service Storage on-call engineer Billing on-call engineer Catalog on-call engineer Custome r DB on- call engineer • This is all too common • FE on-call engineers (DRIs) are woken up for BE problems • Issue #1: all sibling service DRIs are woken up due to the same problem • Issue #2: time is wasted while root cause is determined manually and escalated to correct DRI
  11. Lesson #3: Alert Routing and Consolidation #2 Storage Billing service

    Product Catalog service Customer Data service Storage on-call engineer Billing on-call engineer Catalog on-call engineer Custome r DB on- call engineer • Consolidation: identify alerts due to same root cause and only fire one • Routing: identify the root cause and automatically route the alert • In reality, you still end up with conference bridges with 20 people waiting and checking on the fix • But for simple mitigation alert consolidation brings sanity into DRI’s life
  12. Lesson #4: Configuring Auto-scale #1 • Without auto-scale considerable time

    is spent monitoring CPU/Memory and manual adjustment • Important for auto-scale: • Define max number of machines in the pool • To keep budget under control • Start with generous initial instance count • Use recent peak with buffer as default and re-evaluate periodically
  13. Lesson #4: Configuring Auto-scale #2 • Max CPU: use between

    60% and 80% • By how much to increase? • Increase is not instant • How quickly do you want to add instances? • If you need 100 machines with 80% capacity and you only have 80 now? • Consider provisioning time • For example: if provisioning time is 30 minutes and you want to be able to grow by 30 instances per hour, then 30*30/60=15, so you need to increase by 15 instances • Scaling up is more aggressive than scaling down = _ × ___ _ = ×___
  14. Lesson #5: Randomizing Scheduled Disruptive Events • Any repeated pattern

    executed by multiple actors has dangerous impact • Why is IIS AppPool recycle is set to 1740 minutes • 29 hours, smallest prime over 24 • Staggered and non-repeating pattern more than one day • Why is randomness built into exponential retry algorithms (Ethernet, TCPIP) • Suppose retries are caused by environment factor, we don’t want all clients to retry at the same interval
  15. Lesson #6: Design for Resiliency #1 • Plan graceful degradation

    • Consider a circuit breaker
  16. Lesson #6: Design for Resiliency #2 • Bulkheads • Configure

    throttling and thread pools per route/area • Inject failure • Netflix Chaos Monkey • Foster culture that encourages and rewards engineers trying to break the system
  17. Lesson #7: Design your Caching • Improving caching is often

    the easiest (and cheapest!) way to scale • Consider combination of caching approaches • Client-side • Proxy (e.g. CDN) • Lazy-access caching: useful when you want to have complete protection for the origin • Server side • Don’t over-complicate caching • Notoriously hard to debug • Beware of cache corruption • Have an easy way to discard cache
  18. Lesson #8: Have Arrows on Data Flow Diagram Go One

    Way #1 • Is your architecture DAG (directed acyclic graph)? • Having no circular dependencies in the architecture is very helpful for diagnosability and issue isolation • In our case we started with a data access routing service intended for front end services: it provided validation and smart routing Data Access service Billing service Product catalog service Customer data service FE service
  19. Lesson #8: Have Arrows on Data Flow Diagram Go One

    Way #2 • However then we started using as access layer for backend services as well • E.G. when billing service needed to access customer data service, it went through data access service Data Access service Billing service Product catalog service Customer data service FE service BE service
  20. Lesson #9: Have Arrows on Data Flow Diagram Go One

    Way #3 • Now when one service is down, all services are down • Compound impact of retries on the data access service • Thread pool exhaustion • Not having bulkheads makes the problem worse • Synchronous operations makes the problem even worse Data Access service Billing service Product catalog service Customer data service FE service BE service
  21. Lesson #9: Use 5 Why’s Exercise During Live Site Review

    • Cause-effect discovery technique • Developed by Toyota • Requires “blameless culture” • Important: • Distinguish causes from symptoms • There’s still one occurrence of this in my example, although I tried to get rid of most of them with #2 • Never leave “human error” as root cause. Assess the process, not people • You can see we could have left it at #4 1.Why? - The battery is dead. (First why) 2.Why? - The alternator is not functioning. (Second why) 3.Why? - The alternator belt has broken. (Third why) 4.Why? - The alternator belt was well beyond its useful service life and not replaced. (Fourth why) 5.Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause) Real life example J 1.Why? - More than 5% of users experience failing pages (First why) 2.Why? - All functions calling Purchase service are failing and most of Purchase service endpoints are down as it cannot communicate to Storage service (Second why) 3.Why? - Certificate expired and was not replaced on time (Third why) 4.Why? - Joe is always handling cert updates and he is on vacation (Fourth why) 5.Why? - We don’t have instructions for everyone on how to update certs and alerts that fire to remind us when (Fifth why, a root cause) Example from Wikipedia
  22. Microservices at Scale: Recap and take-aways • 7 Fowler’s principles

    Products, not projects. Smart endpoints and dumb pipes. Decentralized governance. Decentralized data management. Infrastructure automation. Design for failure. Evolutionary design. • Lesson #1: Design your monitoring • Think through your future daily triage and diagnostics in advance • Lesson #2: Smart Geo-Distribution • What are you optimizing for • Lesson #3: Alert Routing and Consolidation • Your DRI comfort is ultimately better service to customers • Lesson #4: Auto-Scale • Scaling is never instant • Lesson #5 Randomize scheduled disruptive events • Remember 29 hours • Lesson #6: Design for resiliency • Lesson #7: Caching: cheapest way to scale • Don’t over-complicate it! • Lesson #8: Your architecture should be DAG • No cycles • Lesson #9: Use 5 Whys exercise during live site reviews • Don’t leave with human error
  23. Other lessons – for further exploration • Don’t forget load

    balancer • What endpoint is it using? • What happens if all regions are down and then one goes online? (hint, imagine pack of wild dogs and one gazelle) • If *clean* rollback is not possible, roll forward • Otherwise you are in a totally new state altogether • Hotfixes should be targeted and minimal • Don’t attempt to fix multiple problems with hotfix • Safe Deploy within reason • Flaky tests should be disabled and bugs opened • Don’t ignore and waste everyone’s time asking continuously to rerun • If you deployment train has multiple legs, start next one optimistically, while testing the previous one • But don’t expect testing to always succeed (don’t overpromise ETA) • For partner contributions, Pull Request is your gate • Much easier to ask for fixes or changes before merge • Make operations idempotent • Outcome doesn’t change after first application • CAP theorem – which one will you sacrifice: consistency, availability or partition tolerance?
  24. Your examples?

  25. Thank You!

  26. Helper slides

  27. What is Microservice #1 In short, the microservice architectural style

    is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies. Martin Fowler, https://martinfowler.com/articles/microservices.html
  28. Your frontend service SQL Your backend service External service Web

    UI Application Insights Alerts APM: Azure portal Analytics API Continuous export JSSDK ASP.NET SDK Node.JS SDK AI SDK mongo
  29. Your frontend service SQL Your backend service Web UI Application

    Insights Alerts APM: Azure portal Analytics API Continuous export JSSDK ASP.NET SDK Node.JS SDK
  30. SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store

    (codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>>
  31. SDKs Ingestion Service <<Node.JS>> Loader Loader Metrics Store Analytics Store

    (codename Kusto) Query service <<Node.JS>> UI service <<ASP.NET>> API docs frontend <<Node.JS>> Billing Service <<ASP.NET>> Profile Service <<ASP.NET>> Alerting & Notification Service <<ASP.NET>> Azure DocumentDB Azure Event Hub Export service <<.NET>> • Application Insights itself is our #0 canonical application • Dogfooding is critical – we want to use our own product for monitoring • Should we be sending data to the same pipeline? AI SDK AI SDK AI SDK AI SDK AI SDK AI SDK AI SDK Lesson #1: Plan your Triage and Diagnostics #2