Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: Why Breaking Things Should Be Practiced

Chaos Engineering: Why Breaking Things Should Be Practiced

As presented at the AWS Summit in Dubai - with Qais Ammouri, Head of Technology at Almosafer.

With the wide adoption of micro-services and large-scale distributed systems, architectures have grown increasingly complex and hard to understand. Worse, the software systems running them have become extremely difficult to debug and test, increasing the risk of outages. With these new challenges, new tools are required and since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to reveal failures before they become outages. In this talk, we will make an introduction to chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more robust systems.

Adrian Hornsby

April 17, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Chaos Engineering:
    Why breaking things should be practiced
    Adrian Hornsby
    Sr. Technical Evangelist
    Amazon Web Services
    Qais Ammouri
    Head of Technology
    Almosafer
    @adhorn

    View Slide

  2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Been there?

    View Slide

  3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Distributed Systems are hard
    Amazon Twitter Netflix

    View Slide

  4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Failures are a given and
    everything will eventually fail
    over time.
    Werner Vogels
    CTO – Amazon.com


    View Slide

  5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Resiliency: Ability for a system to handle and
    eventually recover from unexpected conditions

    View Slide

  6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Partial failure mode

    View Slide

  7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    How do we build resilient software
    systems?

    View Slide

  8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    People
    Application
    Network & Data
    Infrastructure

    View Slide

  9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Building confidence through testing
    Unit testing of components:
    • Tested in isolation to ensure function meets expectations.
    Functional testing of integrations:
    • Each execution path tested to assure expected results.
    Is it enough???

    View Slide

  10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    GameDay at Amazon
    Creating Resiliency Through Destruction
    https://www.youtube.com/watch?v=zoz0ZjfrQ9s

    View Slide

  11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Chaos engineering
    https://github.com/Netflix/SimianArmy

    View Slide

  12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Failure injection
    • Start small & build confidence
    • Application level
    • Host failure
    • Resource attacks (CPU, memory, …)
    • Network attacks (dependencies, latency, …)
    • Region attack
    • Human attack
    https://www.gremlin.com
    https://github.com/Netflix/SimianArmy https://chaostoolkit.org

    View Slide

  13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Break your systems on purpose.
    Find out their weaknesses and fix
    them before they break when
    least expected.

    View Slide

  14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Chaos engineering is NOT about breaking
    things randomly without a purpose, chaos
    engineering is about breaking things in a
    controlled environment and through well-
    planned experiments in order to build
    confidence in your application to withstand
    turbulent conditions.

    View Slide

  15. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering

    View Slide

  16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Steady
    State
    Hypothesis
    Design & Run
    Experiment
    Fix
    Build Resilient
    Systems
    Verify & Learn

    View Slide

  17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Build Resilient
    Systems

    View Slide

  18. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Our sales were less than 1
    million SAR
    2012
    It all started
    from a handful of
    people between
    Riyadh and
    Egypt.
    In 2012, Almosafer started between Egypt and
    Riyadh with focus on hotels through social media
    and call center.

    View Slide

  20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    We grew to 70 employees & our
    sales reached to 74 million SAR
    2015
    Al Tayyar Travel
    Group
    (now Seera Group)
    acquired 60% of
    Almosafer…

    View Slide

  21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    We grew to 1000+ employees &
    Our sales exceeded 1.3 billion SAR
    2018
    Crossing the
    billion line.
    Becoming the
    largest OTA in
    Saudi, fully
    acquired by
    Seera Group
    In 2018, Almosafer became largest OTA in Saudi
    Arabia in the flight market.

    View Slide

  22. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    AWS Largest KSA Client and First in EKS in the MENA

    View Slide

  23. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Before Chaos Engineering

    View Slide

  24. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  25. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  26. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  27. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Monitoring
    (Eagle Eye)
    Tech Capabilities
    Culture

    View Slide

  28. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Start with people
    ● Try to avoid the word “Chaos” when talking to your business .

    View Slide

  29. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Start with people
    ● Try to avoid the word “Chaos” when talking your business .
    ● Embrace failure, and fix it.

    View Slide

  30. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Start with people
    ● Try to avoid the word “Chaos” with your business .
    ● Embrace failure, and fix it.
    ● Replace: “If it fails” with “when it fails”.

    View Slide

  31. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Start with people
    ● Try to avoid the word “Chaos” when talking your business .
    ● Embrace failure, and fix it.
    ● Replace: “If it fails” with “when it fails”.
    ● Everything fails, at least once!

    View Slide

  32. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Start with people
    ● Try to avoid the word “Chaos” when talking your business .
    ● Embrace failure, and fix it.
    ● Replace: “If it fails” with “when it fails”.
    ● Everything fails, at least once!
    ● Do fire drills, at least once a month.

    View Slide

  33. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Resiliency in Almosafer
    ● Monitor everything, or die trying.

    View Slide

  34. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Resiliency in Almosafer
    ● Monitor everything, or die trying .
    ● Architect with failure in mind, it is not an edge case.

    View Slide

  35. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Resiliency in Almosafer
    ● Monitor everything, or die trying .
    ● Architect with failure in mind, it is not an edge case.
    ● Resiliency starts in the frontend, avoid blocking UI.

    View Slide

  36. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Resiliency in Almosafer
    ● Monitor everything, or die trying .
    ● Architect with failure in mind, it is not an edge case.
    ● Resiliency starts in the frontend, avoid blocking UI.
    ● Automation testing is not a “nice to have” it is a “Must have”.

    View Slide

  37. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Resiliency in Almosafer
    ● Monitor everything, or die trying .
    ● Architect with failure in mind, it is not an edge case.
    ● Resiliency starts in the frontend, avoid blocking UI.
    ● Automation testing is not a luxury product.
    ● Use circuit breaking
    - timeouts, retries and fallbacks.

    View Slide

  38. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Redundancy is fundamental.
    ● Don’t put your eggs in the same basket be multiregional and multi AZs .

    View Slide

  39. SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What is Next?

    View Slide

  40. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Steady
    State

    View Slide

  42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    What is steady state?
    • ”normal” behavior of your system
    https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

    View Slide

  43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    What is steady state?
    • ”normal” behavior of your system
    • Business Metric
    https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

    View Slide

  44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Business metrics at work
    Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden).
    Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer).
    Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number
    of people who clicked “back” before the page even loaded (Nicole Sullivan).

    View Slide

  45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Hypothesis

    View Slide

  46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    What if…?
    “What if this load balancer breaks?”
    “What if Redis becomes slow?”
    “What if a host on Cassandra goes away?”
    ”What if latency increases by 300ms?”
    ”What if the database stops?”
    Make it everyone’s problem!

    View Slide

  47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Disclaimer!
    Don’t make an hypothesis that you know
    will break you!

    View Slide

  48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Design & Run
    Experiment

    View Slide

  49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Designing experiment
    • Pick hypothesis
    • Scope the experiment
    • Identify metrics
    • Notify the organization

    View Slide

  50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Rules of thumbs
    • Start with very small
    • As close as possible to production
    • Minimize the blast radius.
    • Have an emergency STOP!

    View Slide

  51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Running Chaos Experiment
    Users
    Canary deployment
    Normal Version
    99%
    Users
    1%
    Users
    Start with ..

    View Slide

  52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Verify & Learn

    View Slide

  53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Quantifying the result of the experiment
    • Time to detect?
    • Time for notification? And escalation?
    • Time to public notification?
    • Time for graceful degradation to kick-in?
    • Time for self healing to happen?
    • Time to recovery – partial and full?
    • Time to all-clear and stable?

    View Slide

  54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    PostMortems – COE (Correction of Errors)
    The 5 WHYs
    Outage
    Because
    of …
    Because
    of …
    Because
    of …
    Because
    of …
    NOT
    ENOUGH

    View Slide

  55. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    More questions to ask
    • Can you clarify if there were any preceding events?
    • Why would they believe acting in this way was the best course of action to
    deliver the desired outcome?
    • Is there another failure mode that could present here?
    • What decisions or events prior to this made this work before?
    • Why stop there – are there places to dig deeper that could shine a light more
    on this?
    • Did others step in to help, to advise, or to intercede?

    View Slide

  56. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Rules to remember!
    1. Failure requires multiple faults
    2. There is no isolated ‘cause’ of an accident.
    3. There are multiple contributors to accidents.

    View Slide

  57. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    DON’T blame that one person …

    View Slide

  58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Fix

    View Slide

  59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Fix

    View Slide

  60. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Big challenges to chaos engineering
    Mostly Cultural
    • no time or flexibility to simulate disasters.
    • teams already spending all of its time fixing things.
    • can be very political.
    • might force deep conversations.
    • deeply invested in a specific technical roadmap (micro-services) that
    chaos engineering tests show is not as resilient to failures as originally
    predicted.

    View Slide

  61. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    Changing culture takes time!
    Be patient…

    View Slide

  62. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T

    View Slide

  63. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    S U M M I T
    More Resources
    • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf
    • https://www.gremlin.com
    • https://queue.acm.org/detail.cfm?id=2353017
    • https://softwareengineeringdaily.com/
    • https://github.com/dastergon/awesome-sre
    • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
    • https://medium.com/@NetflixTechBlog
    • http://principlesofchaos.org
    • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp
    • https://github.com/adhorn/awesome-chaos-engineering
    • https://www.infoq.com/presentations/netflix-chaos-microservices
    • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf
    • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy
    • https://medium.com/@adhorn

    View Slide

  64. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  65. Thank you!
    S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Adrian Hornsby
    @adhorn
    https://medium.com/@adhorn

    View Slide