Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: Why breaking things should b...

Chaos Engineering: Why breaking things should be practiced

As presented in the AWS DevDay India series 2018

With the rise of micro-services and large-scale distributed architectures, software systems have grown increasingly complex and hard to understand. Adding to that complexity, the velocity of software delivery has also dramatically increased, resulting in failures being harder to predict and contain. While the cloud allows for high availability, redundancy, and fault-tolerance, no single component can guarantee 100% uptime. Therefore, we have to understand availability but especially learn how to design architectures with failure in mind. And since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to identify failures before they become outages. In this talk, I will deep dive into availability, reliability and large-scale architectures and make an introduction to chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more resilient systems.

Adrian Hornsby

October 12, 2018
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Chaos Engineering: Why Breaking Things Should Be Practiced. Adrian Hornsby, Cloud Architecture Evangelist @ AWS @adhorn
  2. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “
  3. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Partial failure mode
  4. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s
  5. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Netflix 2013 https://medium.com/netflix-techblog
  6. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What “really” is Chaos Engineering?
  7. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org
  8. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.
  9. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack
  10. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix
  11. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Phases of Chaos Engineering
  12. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Steady State Hypothesis Design & Run Experiment Verify & Learn Fix
  13. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero
  14. What is Steady State? • ”normal” behavior of your system

    • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
  15. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).
  16. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!
  17. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Design & Run Experiment
  18. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Designing Experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization
  19. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!
  20. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)
  21. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?
  22. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. DON’T blame that one person …
  23. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of …
  24. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro- services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  25. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Before breaking things …
  26. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. People Application Network & Data Infrastructure
  27. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Patterns for Resilient Architectures Infrastructure
  28. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds
  29. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
  30. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  31. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application
  32. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs
  33. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong
  34. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Patterns for Resilient Architectures Network & Data
  35. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Decoupling with async pattern A Queue B A Queue B Listener Pub-Sub
  36. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }
  37. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User
  38. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica
  39. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance
  40. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A App Instance App Instance App Instance
  41. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Transient state does not belong in the database.
  42. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Patterns for Resilient Architectures Application
  43. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-Scaling Group User
  44. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Avoiding Cascading Failures
  45. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Retries & Exponential Backoff
  46. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
  47. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.
  48. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Service Degradation & Fallbacks
  49. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit
  50. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158
  51. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Patterns for Resilient Architectures People
  52. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Changing Culture takes time! Be patient…
  53. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
  54. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Thank you @adhorn https://medium.com/@adhorn