Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering - AWS Pop-up Loft Stockholm O...

Chaos Engineering - AWS Pop-up Loft Stockholm Oct 31 2018

Chaos Engineering - Breaking stuff on purpose from AWS Pop-up Loft Stockholm Oct 31 2018

https://www.opsiocloud.com | https://www.opsio.se

Gunnar Grosch

October 31, 2018
Tweet

More Decks by Gunnar Grosch

Other Decks in Technology

Transcript

  1. Chaos Engineering – Breaking stuff on purpose 1 AWS Pop-up

    Loft Stockholm Gunnar Grosch @gunnargrosch
  2. • Cloud Evangelist at Opsio www.opsio.se • Have been creating

    chaos with computers since the 80’s. • Background in development and operations. • My own three chaos monkeys at home. • Skateboarder at heart. • Serverless aficionado. ABOUT ME AWS Pop-up Loft Stockholm @gunnargrosch
  3. • Opsio specializes in the operation and support of public,

    private and hybrid cloud platforms. • Expertise, quality, cost-effectiveness and automation pervade our delivery of critical business IT operations. • Delivering everything from magical customer support to fully managed cloud services. • Implementing chaos engineering for customers. ABOUT OPSIO AWS Pop-up Loft Stockholm @gunnargrosch
  4. AWS Pop-up Loft Stockholm @gunnargrosch Don’t ask what happens if

    a system fails, but ask what happens when it fails.
  5. Chaos Engineering is the discipline of experimenting on a distributed

    system in order to build confidence in the system’s capability to withstand turbulent conditions in production. principlesofchaos.org WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
  6. • Sooner or later all complex systems will fail. Not

    if, but when. • Building resilient systems require experience with failure. • By simulating potential errors in advance we can verify that our systems behave as we expect – or fix them if they don’t. • Chaos Engineering is an emerging discipline but the underlying concepts are not. WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
  7. • Thoughtful, planned experiments designed to reveal the weakness in

    our systems. • Inject something harmful in order to build an immunity. • We are “breaking things on purpose” to learn new information about our systems through experimentation. • By triggering incidents intentionally in a controlled way we gain confidence that our systems can deal with failure. • Chaos Engineering requires a base level of resilience! WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
  8. • Many chaos experiments originate in the eight fallacies of

    distributed computing. WHAT IS CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch Fallacy Effect The network is reliable Error handling/retries needed Latency is zero Minimize number of requests Bandwidth is infinite Send small payloads The network is secure Secure data/Authenticate requests Topology doesn't change Changes affect latency, bandwidth and endpoints There is one administrator Changes affect ability to reach destination Transport cost is zero Costs must be budgeted The network is homogeneous Affects reliability, latency and bandwidth
  9. • Unit testing • Integration testing • System testing •

    Acceptance testing ADVANCE FROM TESTING TO EXPERIMENTING AWS Pop-up Loft Stockholm @gunnargrosch Input Component X Output Input Component X Component Y Output Service X Service Y
  10. • Add latency • Service failures • Exhaust resources •

    DNS outage Service X Service Y ⏱ ADVANCE FROM TESTING TO EXPERIMENTING AWS Pop-up Loft Stockholm @gunnargrosch Service X Service Y ❌ Service X Service Y
  11. • Tier 0 – Critical Services. • Services with critical

    functionality. • Services with critical data. WHERE TO USE CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
  12. • "We’ll test it in prod!“ • Classical testing wants

    to identify bugs as far away from production as possible. • Reverse the strategy: Run your chaos experiments as close to production as possible. WHERE TO USE CHAOS ENGINEERING? AWS Pop-up Loft Stockholm @gunnargrosch
  13. • 2010 Netflix created Chaos Monkey in response to the

    move from physical infrastructure to AWS. • 2011 The Simian Army was born and added additional failure injection monkeys: Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, Chaos Gorilla and Latency Monkey. • 2012 Netflix shared the source code for Chaos Monkey on Github. THE HISTORY OF CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch
  14. • 2014 • 2014 Failure Injection Testing (FIT) is announced.

    Gives more granular control over the “blast radius”. Failure-as-a-Service. • 2017 Chaos Engineering book released on O’Reilly Media by members of the Netflix team. THE HISTORY OF CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch
  15. • Monitoring and observability • Incident management • Impact of

    downtime • A base level of resilience (worth repeating) WHAT IS NEEDED TO PERFORM A CHAOS EXPERIMENT? AWS Pop-up Loft Stockholm @gunnargrosch
  16. 1. Define steady state 2. Plan the experiment 3. Form

    your hypothesis 4. Contain the blast radius 5. Notify the organization 6. Run your chaos experiment 7. Measure the results 8. Scale up or abort and fix HOW TO PERFORM A CHAOS EXPERIMENT? AWS Pop-up Loft Stockholm @gunnargrosch
  17. • The normal behavior of a system over time •

    Vital business metrics • Steady state is not necessarily continuous • Business metrics is more useful than system metrics for chaos experiments DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch
  18. • Infrastructure monitoring metrics • Resource: CPU, Disk, Memory and

    IO • Network: Latency, DNS, Packet loss • State: Processes, Shutdown, Time • Alerting metrics • Total alerts • Time to resolution • Self-resolving alerts • Most frequent alerts DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch
  19. • SEV metrics • Total incidents by SEV level •

    Total SEVs • MTTD, MTTR and MTBF for SEVs • Application metrics • Events • Error rates • Performance counters • Business metrics • Orders per time unit • Messages per time unit • Number of items placed in cart DEFINE STEADY STATE AWS Pop-up Loft Stockholm @gunnargrosch
  20. • Only include teams and services that wants to be

    included. • What could go wrong? • Whiteboard the system, services and dependencies. • Make sure to have a “stop” button. PLAN THE CHAOS EXPERIMENT AWS Pop-up Loft Stockholm @gunnargrosch
  21. • Chaos can be injected at any layer of you

    stack to increase system resilience. FORM YOUR HYPOTHESIS AWS Pop-up Loft Stockholm @gunnargrosch API APPLICATION CACHES DATABASE OPERATING SYSTEM HARDWARE NETWORK DNS AVAILABILITY ZONE REGION PEOPLE
  22. • What if latency increases by 300 ms? • What

    if the database becomes read-only? • What if availability zone B is down? • What if Slack is down? • What if an instance terminates? • What if microservice X isn’t responding? Don’t perform chaos experiments in production if you know that it will cause damage. Always fix known problems first! FORM YOUR HYPOTHESIS AWS Pop-up Loft Stockholm @gunnargrosch
  23. • Always design the smallest possible experiment. • By starting

    small a failure will most likely not cause an outage or customer pain. • Remember “run experiments in production”? The closer to production the more you will learn. CONTAIN THE BLAST RADIUS AWS Pop-up Loft Stockholm @gunnargrosch
  24. • There can never be to much information initially. •

    Inform your organization about what you’re doing, why you’re doing it, and when you’re doing it. • With more confidence skip “when you’re doing it”. NOTIFY THE ORGANIZATION AWS Pop-up Loft Stockholm @gunnargrosch
  25. • Dare to press the button. • Watch the metrics!

    • Abort if needed. RUN YOUR CHAOS EXPERIMENT AWS Pop-up Loft Stockholm @gunnargrosch
  26. • Use the metrics to prove if your hypothesis is

    correct. • Is the system resilient to what you injected? • Did anything unexpected happen? • Did you have the proper metrics? You should have. • Share your progress and success! MEASURE THE RESULTS AWS Pop-up Loft Stockholm @gunnargrosch
  27. • With the confidence from smaller-scale experiments the scope can

    scale up. • Increased scope can reveal effects that aren’t noticeable on smaller-scale. SCALE UP OR ABORT AND FIX AWS Pop-up Loft Stockholm @gunnargrosch
  28. Chaos Toolkit is a project whose mission is to provide

    a free, open and community-driven toolkit and API to all the various forms of chaos engineering tools that the community needs. The Chaos Toolkit aims to be the simplest and easiest way to explore building, and automating, your own Chaos Engineering Experiments. Chaos Hub stands on the shoulders of the Chaos Toolkit to provide a complete, user-friendly, platform to automate and collaborate on your Chaos Engineering and Resiliency efforts. https://chaostoolkit.org/ https://chaoshub.org/ https://github.com/chaostoolkit TOOLS FOR CHAOS ENGINEERING – CHAOS TOOLKIT & CHAOS HUB AWS Pop-up Loft Stockholm @gunnargrosch
  29. Built on top of the free and open source Chaos

    Toolkit and Chaos Hub, the Chaos Platform provides an HA, secure, collaborative, observable and customizable production-ready platform for your own chaos engineering experiments. https://chaosplatform.com/ https://chaosiq.io/ TOOLS FOR CHAOS ENGINEERING – CHAOSPLATFORM AWS Pop-up Loft Stockholm @gunnargrosch
  30. Gremlin provides you the framework to safely, securely, and easily

    simulate real outages with an ever-growing library of attacks. Downtime is expensive and damages customer trust. Gremlin's Failure as a Service finds weaknesses in your system before they cause problems. https://www.gremlin.com/ TOOLS FOR CHAOS ENGINEERING – GREMLIN AWS Pop-up Loft Stockholm @gunnargrosch
  31. Chaos Monkey randomly terminates virtual machine instances and containers that

    run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services. https://github.com/Netflix/chaosmonkey TOOLS FOR CHAOS ENGINEERING – CHAOS MONKEY AWS Pop-up Loft Stockholm @gunnargrosch
  32. ChaoSlingr is a Security Chaos Engineering Tool focused primarily on

    the experimentation on AWS Infrastructure to bring system security weaknesses to the forefront. Security is chaotic and the industry has traditionally put emphasis on the importance of preventative security control measures and defense-in-depth where-as our mission is to drive new knowledge and perspective into the attack surface by delivering proactively through detective experimentation.. https://github.com/Optum/ChaoSlingr TOOLS FOR CHAOS ENGINEERING – CHAOSSLINGR AWS Pop-up Loft Stockholm @gunnargrosch
  33. Remember that proper implementation of chaos engineering is more important

    than what tools you use. TOOLS FOR CHAOS ENGINEERING AWS Pop-up Loft Stockholm @gunnargrosch
  34. Reliable network is vital for most (all) applications. Having issues

    with networking will cause errors, delays and other issues that affect the service. • Target: AWS Lightsail three tier web cluster (LB, web fronts, DB) • Attack: Network corruption • Scope: Single instance • Expected results: LET’S BREAK SOMETHING – NETWORK CORRUPTION AWS Pop-up Loft Stockholm @gunnargrosch
  35. Instance based architectures rely more or less on the single

    instance. How well does the architecture handle the loss or termination of an instance? • Target: AWS Wordpress reference architecture • Attack: Instance termination • Scope: Single instance • Expected results: LET’S BREAK SOMETHING – INSTANCE TERMINATION AWS Pop-up Loft Stockholm @gunnargrosch
  36. Using cache services like Memcached and Redis can greatly improve

    the performance of your application. What happens if the cache is unavailable? • Target: AWS Wordpress reference architecture • Attack: Cache cluster reboot • Scope: ElastiCache cluster • Expected results: LET’S BREAK SOMETHING – CACHE UNAVAILABILITY AWS Pop-up Loft Stockholm @gunnargrosch
  37. • Make chaos experiments a part of your CI/CD. •

    Use chaos in blue/green and canary releases. • Automate! ADVANCED USAGE OF CHAOS AWS Pop-up Loft Stockholm @gunnargrosch
  38. • Don’t ask what happens if a system fails, but

    ask what happens when it fails. • Prior to starting your chaos experiments it is vital to collect metrics. • Start with the smallest possible experiment that can cause impact. • Share your progress and success! TAKEAWAYS AWS Pop-up Loft Stockholm @gunnargrosch
  39. AWS Pop-up Loft Stockholm @gunnargrosch Thanks to Tammy Bütow, Ana

    Medina, Kolton Andrus, Nora Jones, Adrian Hornsby and others for the inspiration.