Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Systems Fail - Anurag Gupta at CTO Connection

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Shoreline Shoreline
May 11, 2021
23

Why Systems Fail - Anurag Gupta at CTO Connection

The move to managed services and declarative configuration systems has helped simplify a variety of tasks that used to require operator intervention. Yet, at most companies, operators remain overburdened and struggle to meet customer SLAs. There are many observability, incident management, and runbook automation tools out there to help with the issues that remain. But they all require a human in the loop. Toil remains high as does remediation time and the potential for human error.

In this session, I’ll describe real outages I saw at AWS, group them into characteristic causes, and describe how we reduced tickets, improved availability, and reduced costs while growing our fleet 1000x. You’ll walk away with concrete ideas that you can put into place to improve availability and reduce burnout.

Avatar for Shoreline

Shoreline

May 11, 2021
Tweet

Transcript

  1. 2 Agenda We’ll cover a few major categories of system

    failures and mitigation approaches for each Based on my background at AWS running analytic and database services over 8 years At AWS, operations leaders meet weekly to discuss 1. Prior week issues 2. Cause of errors 3. How to mitigate in future for that service and all others This talk is based on my experience there as we grew our database fleet a thousandfold Caveat: • This is a necessarily superficial discussion of a large, complex topic • Reach out on LinkedIn (awgupta) or email ([email protected]) if you’d like to talk more
  2. 3 Why systems fail – 1. we perturb them Deployments

    are the most common source of outage minutes for most companies 1. The blast radius is large 2. Changes are complex 3. Hard to get failure rate below 0.5% of deployments 4. Detecting, debugging, addressing failures takes time Frustrating - these are largely unforced errors. Through significant effort, we were able to reduce failures from 1/200 deployments to 1/10,000
  3. 4 For each deployment, write a doc that covers: How

    this helped 1. What is being changed 2. Downstream services that may see impact 3. Deployment schedule by Region and Availability Zone 4. What metrics will be monitored during the deployment 5. How rollback has been automated This doc is reviewed by someone outside the service team who is skilled at operations Move from “good intentions” to mechanisms that can be iteratively improved 1. First deploy to canaries to validate performance, resource usage and functionality against known workload. 2. Limit the blast radius of deployment by sequencing rollout 3. Automated rollback ensured decisions made up front, not during on call 4. Aimed for 5-5-5 (5 minutes to deploy, 5 to evaluate success, 5 to rollback). 5. Built a template “collective memory” of issues seen and to be avoided in future Virtuous cycle: as deployments become reliable, they are done more often, becoming yet smaller and yet more reliable Deployment issues are a process problem
  4. 5 But I can’t rollback automatically? There’s a fierce debate

    between the auto-rollback and only roll-forward camps. But if you can rollback, why wouldn’t you? Many companies make this work. E.g. Split a change into multiple deployments 1. Make a database schema change 2. Start writing to that new table from the app tier 3. Start reading from the new table 4. Clean up old artifacts no longer being called General application of make an interface change in the provider, then the consumer, then remove the stale interface in the provider Distributed systems require this. You can’t atomically update everything – you need to support old and new interfaces
  5. 6 Why systems fail – 2. Operator error The largest

    AWS outages were either operator errors or cascading failures with bad remediations Examples: 1. Taking 10,000 load balancers out of service rather than 100 2. Incorrect WHERE clause in a manual DELETE operation in a control plane database 3. Replication storm causing disks to fall over (requiring further replication)
  6. 7 Limiting the blast radius of human errors Humans intrinsically

    have a 1% error rate, particularly when doing repetitive mundane tasks Manual changes need to be “pair programmed” with multiple eyes before each command issued Manual changes should be rare Tool-based changes should limit the blast radius they impact Per-resource changes should be rate-limited and escalate to an operator For example, RDS Multi-AZ will stop failing over after X instances in Y period, raising a ticket instead Ops orchestration tools should do this by default
  7. 8 Why systems fail – 3. Black box components 25%

    of AWS Large-Scale Events involved databases Databases: 1. Have a large blast radius 2. Take a long time to recover 3. Change query plans unexpectedly based on inaccurate statistics 4. Are easily under-administered (e.g. PostgreSQL vacuum and transaction id wrap-around) Also true of edge routers, cloud services, …
  8. 9 Avoid functionality you don’t control At AWS, we tried

    to avoid use of relational databases in our control planes Many services moved to DynamoDB or other home-grown tech Less functionality, less expressibility, but controlled own fate in own code Try to build “escalators” not “elevators” Systems that reliably perform at some rate and degrade to a lower rate Not systems that may perform better in normal case, but fail absolutely and degrade under load 1. pDNS used to fail a lot. By caching IP addresses, most control plane APIs continued to work 2. You can use a ”warm pool” to buffer EC2 control plane failures 3. You can keep enough local disk to buffer a 1-2 hour S3 outage
  9. 10 Commonplace failures First time failures You can build runbooks

    to reduce human errors But MTTR is an hour or more when human in the loop Google said 200 downtime minutes on average across 150 GCP LSEs in 2020 These must be automated. Useless toil and extended unavailability But each remediation is custom multi-month project Need an automation platform that makes it easy to 1. Build a remediation in an hour with only basic shell scripting skills. 2. Handle the distributed systems complexity of changing infrastructure, cascading issues, governance These are difficult Observability tools have lag Dashboards and logs often lack data for a new event Need a platform that understands production ops is a real-time distributed systems problem 1. Support real-time views into resources and metrics 2. Support per-second metrics 3. Integrate resources, metrics, and Linux commands to view and modify the environment 4. Control blast radius, partial failures, retries, etc. Why systems fail – 4. everything eventually fails
  10. 15 Title • Lorem ipsum dolor sit amet, consectetuer adipiscing

    elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis. • Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus. • Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
  11. 16 16 Title 2 • Lorem ipsum dolor sit amet,

    consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis. • Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus. • Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
  12. 17 Title 3 Caption 1 Lorem ipsum dolor sit amet,

    consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. Caption 2 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. Caption 3 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.
  13. 18 Title 4 Lorem ipsum dolor sit amet, consectetuer adipiscing

    elit. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
  14. 19 Title 5 Lorem ipsum dolor sit amet Lorem ipsum

    dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci. Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.
  15. 20 20 Title 6 Lorem ipsum dolor sit amet Lorem

    ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci. 20
  16. 21 Title 7 Header 1 Header 1 Header 1 Header

    1 Header 1 Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
  17. 22 Title 8 Caption 1st Qtr 2nd Qtr 3rd Qtr

    4th Qtr 0 1 2 3 4 5 6 Category 1 Category 2 Category 3 Category 4 Caption Series 3 Series 1
  18. 23 23 “Quote - Lorem ipsum dolor sit amet, consectetuer

    adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. 23