Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Verifying Unknown Conditions with Chaos Engineering

Verifying Unknown Conditions with Chaos Engineering

Yury Nino

July 16, 2021
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. Verifying Unknown Conditions Chaos Engineering in AWS 1 © 2021

    ADL - AWS www.sitereliabilityengineering.co
  2. © 2021 ADL - AWS www.sitereliabilityengineering.co 2 @yurynino Colombia www.ingenieriadelcaos.com

    www.sitereliabilityengineering.co www.yurynino.com Site Reliability Engineers Chaos Engineering Advocates @jhonnatangil Colombia
  3. 3 • Challenges with distributed systems. • What is Chaos

    Engineering? • Why testing in Production is hard? • AWS Fault Injection Simulator • FIS Features • FIS Use Cases • FIS Demo AGENDA Topics will be covered © 2021 ADL - AWS www.sitereliabilityengineering.co
  4. 5 © 2021 ADL - AWS www.sitereliabilityengineering.co Humans, are central

    to both the problem and the solution of incidents in engineering!
  5. The infrastructure required by a software system can be as

    complex as the software itself. Every production failure is unique. No two incidents will share the precise chain of failure! Distributed Systems 6 Distributed Systems © 2021 ADL - AWS www.sitereliabilityengineering.co They provide a particular challenge to program! Netflix Twitter
  6. Distributed Systems 8 Distributed Systems Patterns © 2021 ADL -

    AWS www.sitereliabilityengineering.co Patterns provide a structured way of looking at a problem space along with the solutions which are seen multiple times and proven. Patterns is a concept introduced by Christopher Alexander Looking at distributed systems as a series of patterns is a useful way to gain insights into their implementation.
  7. Distributed Systems 9 Distributed Systems Patterns © 2021 ADL -

    AWS www.sitereliabilityengineering.co Consistent Core Follower Readers Generation Clock Gossip Dissemination HeartBeat Hybrid Clock Idempotent Receiver State Watch Quorum
  8. Distributed Systems 12 What could go wrong? © 2021 ADL

    - AWS www.sitereliabilityengineering.co Our systems face all kinds of adversities: hard disks failures, network can go down, customer traffic can overload and cyberattack can happen. In this chaotic world, how can they still be alive?
  9. 13 © 2021 ADL - AWS www.sitereliabilityengineering.co Process Crashes It

    can be taken down for routine maintenance by system administrators. It can be killed doing some file IO because the disk is full and the exception is not properly handled. In cloud environments, it can be even trickier, as some unrelated events can bring the servers down.
  10. 14 © 2021 ADL - AWS www.sitereliabilityengineering.co Network Delays There

    are two problems to be tackled here: A particular server can not wait indefinitely to know if another server has crashed. There should not be two sets of servers, each considering another set to have failed, and therefore continuing to serve different sets of clients. This is called the split brain.
  11. 18 © 2021 ADL - AWS www.sitereliabilityengineering.co What is Chaos

    Engineering? It is the discipline of experimenting failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  12. 19 © 2021 ADL - AWS www.sitereliabilityengineering.co 2008 Chaos Engineering

    began at Netflix 2010 Chaos Monkey & Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massification 2017 SRE Usenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published Distributed Systems Chaos History
  13. 20 © 2021 ADL - AWS www.sitereliabilityengineering.co Chaos Tools Chaos

    Monkey Chaos Toolkit Gremlin Chaos Monkey for Spring Boot ChaosMesh AWS Fault Injection Simulator
  14. 22 Foundations AWS Failure Injection Simulator © 2021 ADL -

    AWS www.sitereliabilityengineering.co
  15. 23 You can measure your Success with Chaos Engineering by

    counting the number of vulnerabilities Nora Jones © 2021 ADL - AWS www.sitereliabilityengineering.co
  16. Distributed Systems 24 AWS Fault Injection Simulator © 2021 ADL

    - AWS www.sitereliabilityengineering.co AWS FIS is for running fault injection experiments on AWS to improve an application’s performance, observability, and resiliency. Fault injection experiments are used in chaos engineering for stressing an application in testing or production environments by creating disruptive events, observing how the system responds, and implementing improvements. FIS provides controls and guardrails to run experiments, such as automatically rolling back or stopping the experiment if specific conditions are met.
  17. 37 © 2021 ADL - AWS www.sitereliabilityengineering.co How to begin?

    https://www.ingenieriadelcaos.com https://chaosengineering.slack.com https://github.com/dastergon/ awesome-chaos-engineering https://www.infoq.com/chaos-engineering https://www.ingenieriadelcaos.com/