Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: getting out of the starting blocks

Chaos Engineering: getting out of the starting blocks

Architectures are growing increasingly distributed and hard to understand. As a result, software systems have become extremely difficult to debug and test, which increases the risk of failure. With these new challenges, chaos engineering ha become attractive to many organizations as a mechanism for underling the behavior of systems under expected circumstances.

Whilst interest is growing, few have managed to build sustainable chaos engineering practices. In this talk, I will review the state of chaos engineering, the issues customers are facing, based on my learning as an AWS Solution Architect and Technologist focusing on Chaos Engineering and explain why I started to build tools to help with failure injection.

Adrian Hornsby

January 23, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering:
    Getting out of the starting blocks
    Adrian Hornsby
    Principal Technical Evangelist
    Amazon Web Services

    View full-size slide

  2. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What currently prevents the wide
    adoption of chaos engineering in your
    organization?

    View full-size slide

  3. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  4. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Why is production chaos?

    View full-size slide

  5. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  6. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #0 - DON’T CALL IT CHAOS ENGINEERING.

    View full-size slide

  7. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #0 - DON’T CALL IT CHAOS ENGINEERING.

    View full-size slide

  8. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #1 - DON’T FOCUS ON CHAOS ENGINEERING,
    LOOK AT THE BIGGER PICTURE.

    View full-size slide

  9. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Good intentions never work [...]

    View full-size slide

  10. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Because people already had
    good intentions

    View full-size slide

  11. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  12. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    If good intentions don’t work,
    what does?

    View full-size slide

  13. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    The Andon Cord

    View full-size slide

  14. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    1902

    View full-size slide

  15. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Toyota will not allow any defect that they know
    about to go down the manufacturing line.

    View full-size slide

  16. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Source: http://www.autoexpress.co.uk/toyota/prius/34615/japanese-earthquake-hits-car-production

    View full-size slide

  17. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Andon Customer Service

    View full-size slide

  18. • Erroneously listed recharge cable as included
    • Andon cord pulled and page corrected
    • Contacts per unit go from 33% to 3.7%

    View full-size slide

  19. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  20. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    "Good intentions never work, you
    need good mechanisms to make
    anything happen."
    Jeff Bezos

    View full-size slide

  21. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    People have good intention to start
    with!

    View full-size slide

  22. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Good Mechanisms ≈ Complete Processes
    Tools Adoption
    Audit

    View full-size slide

  23. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #2 - CHANGE BEGINS WITH UNDERSTANDING.

    View full-size slide

  24. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What are the top 5 “painful” reasons
    for your fires?

    View full-size slide

  25. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    1. It is always DNS
    2. Configuration drift
    3. SSL Certificate expiration
    4. Deployment failure
    5. Failed link to 3rd party provider

    View full-size slide

  26. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Anatomy of a COE
    • What happened?
    • What was the impact on customers and your business?
    • What were the contributing factors?
    • What data do you have to support this?
    • What lessons did you learn?
    • What corrective actions are you taking?

    View full-size slide

  27. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Audit
    Weekly Operational Metrics Review
    • Continuous inspection mechanism
    • Maintains focus on operations
    • Foundation of a healthy operations program
    Typical Agenda - typically divided into fifteen-minute slots
    • Share successes and failings
    • Action items follow up
    • Review COEs
    • Review key service metrics
    • Identify new best practices

    View full-size slide

  28. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Policy Engine
    • Automated risk and opportunity analyzer
    • Identifies potential risks to availability, infrastructure, security and
    more
    • Highlights opportunities to optimize resource utilization
    • Extensible and configurable
    • Provides a view into policy compliance
    • Allows acknowledgment
    • Reports roll-up the organization hierarchy
    Mechanism to propagate local learnings globally

    View full-size slide

  29. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #3 - CHOOSE YOUR TROJAN HORSE.

    View full-size slide

  30. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Find the right team to start with:
    Not the best (improvements are harder)
    Not the worse (they have bigger problems)

    View full-size slide

  31. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Choose the metrics to measure improvement:
    MTTR is __always__ a good default.

    View full-size slide

  32. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #4 - OVER-INDEX ON THE HYPOTHESIS.

    View full-size slide

  33. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE

    View full-size slide

  34. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  35. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  36. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #5 - INTRODUCE CHAOS ENGINEERING
    EARLY IN THE JOURNEY.

    View full-size slide

  37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Start simple and local!!
    $ docker stop 94a214bbeebd

    View full-size slide

  38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    DDoS yourself
    $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

    View full-size slide

  39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Burn CPU with Stress(–ng)
    $ stress-ng --cpu 0--cpu-method matrixprod -t 60s
    https://kernel.ubuntu.com/~cking/stress-ng/

    View full-size slide

  40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Adding latency to the network
    $ tc qdisc add dev eth0 root netem delay 300ms

    View full-size slide

  41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Blocks DNS resolution
    $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP

    View full-size slide

  42. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    #6 - BLAST-RADIUS REDUCTION MINDSET.

    View full-size slide

  43. #7 - IF YOU HAVEN’T VERIFIED IT, IT’S PROBABLY BROKEN.

    View full-size slide

  44. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Verification:
    1. Disaster Recovery & backups
    2. Auto scaling
    3. Multi-AZ
    3. Fault tolerance & self healing
    4. People

    View full-size slide

  45. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  46. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Getting out of the starting blocks.

    View full-size slide

  47. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Tools Processes
    Culture
    Technology

    View full-size slide

  48. Thank you!
    © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Adrian Hornsby
    https://medium.com/@adhorn
    adhorn

    View full-size slide