Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to Chaos Engineering

An introduction to Chaos Engineering

We will discuss the main concepts of Chaos Engineering, principles and gamedays.

More Decks by Sebastian Alejandro Velasco Dimate

Other Decks in Technology

Transcript

  1. “ The di ff erence between average people and achieving

    people is their perception of and response to failure. -John C. Maxwell
  2. LECTURER BIO Sebastian Alejandro Velasco Sr Advanced Software Development Engineer.

    Ms.c Computer Science Bs.c Software and Computer Engineering Universidad Nacional de Colombia Passionate on building software applications, researching and playing video games
  3. TABLE OF CONTENTS ➤ Resilience and Reliability ➤ Software Development

    Lifecycle ➤ Chaos Engineering ➤ Principles of Chaos Engineering ➤ Gamedays ➤ Chaos Tools
  4. RESILIENCE AND RELIABILITY To keep the water supply working every

    time the tap is turned up To keep the system working at any time To recover the power capacity when the principal generator crashes To call a fallback service when an internal error occurs Difference between Resilience and Reliability To use the auxiliary water tank when the main water supply is broken To keep the lights on every time the switch is pushed
  5. RESILIENCE AND RELIABILITY Reliability is “the probability of failure-free software

    operation for 
 a specified period of time in a specified environment” 
 Resilience is “the ability of a cloud-based service to withstand certain types of failures and yet remain functional from the customer perspective” Every system needs to be RESILIENT in order to be RELIABLE. But one concept per se, does not imply the other
  6. UNDERSTANDING COMPLEXITY AND SIMPLICITY If complexity is causing bad outcomes,

    and we cannot remove the complexity, then what is supposed to be done? Embrace complexity rather than avoid it, trying to optimize for simplicity leads to frustration Learn to navigate complexity. Find tools to move quickly with confidence.
  7. CHAOS ENGINEERING “Chaos engineering is the discipline of experimenting on

    a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.” How much confidence we can have in the complex systems that we put into production? “Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make distributed systems inherently chaotic.” Chaos engineering is a form of experimentation, rather than a form of testing It is about making the chaos inherent in the system visible
  8. CHAOS ENGINEERING Chaos engineering is about but not limited to..

    Simulating failures of a datacenter Injecting latency between services Randomly causing exceptions Emulating I/O errors Injecting failures to source code
  9. STEADY STATE Measurable output of a system that indicates normal

    behavior Throughput Latency percentiles Error rates …
  10. BUILD AN HYPOTHESIS The we build an hypothesis around steady

    state Circuit breaker builds resilience Eureka enables failover of services Reddis is elastic
  11. VARY REAL-WORLD EVENTS Servers dying Chaos variables reflect real-world events

    Malformed responses High traffic Low CPU resources DDos attacks High traffic Core services unresponsive Database Bottlenecks Services data traffic Prioritize events either by potential impact or estimated frequency Any event capable of disrupting steady state is a potential candidate
  12. RUN EXPERIMENT IN PRODUCTION (IDEAL) Systems behave differently depending on

    environment and traffic patterns Sampling real traffic is the only way to reliably capture the request path Chaos strongly prefers to experiment directly on production Keep a detailed tracking of each experiment Application name Hypothesis Environment Duration Load Observability Results Actions
  13. MINIMIZE BLAST RADIUS Experimenting in production has the potential to

    cause customer pain It is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained
  14. GAMEDAYS “GameDays were coined by Jesse Robbins when he worked

    at Amazon and was responsible for availability. Jesse created GameDays with the goal of increasing reliability by purposefully creating major failures on a regular basis” Engineering Teams Support Teams Management Teams Target Time and Place Goals Have Fun Whiteboarding
  15. GAMEDAYS - CHECKLIST ➤ Precice date ➤ War room for

    in-person attendance ➤ Dial-in information (conference link) Things to include ➤ Start ➤ Whiteboarding ➤ Test cases and scoping ➤ Execution ➤ Recap ➤ Key people in attendance Agenda items
  16. GAMEDAYS - ROLES Master of disaster Detective Support Team First

    on call Second on call Incident commander
  17. CHAOS TOOLS Chaos Monkey and Simian Army Chaos monkey for

    Spring boot Gremlin Chaos toolkit Chaos Mesh
  18. REFERENCES ➤ https://www.microsoft.com/security/blog/2014/03/24/reliability-series-1-reliability-vs- resilience/ ➤ https://www.researchgate.net/pro fi le/Aaron-Clark-Ginsberg/publication/ 320456274_What%27s_the_Di ff

    erence_between_Reliability_and_Resilience/links/ 59e651230f7e9b13aca3c2ba/Whats-the-Di ff erence-between-Reliability-and-Resilience.pdf ➤ https://principlesofchaos.org ➤ https://www.gremlin.com/community/tutorials/your- fi rst-chaos-experiment/ ➤ https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/ ➤ https://searchsoftwarequality.techtarget.com/tip/How-to-set-up-a-chaos-engineering- game-day ➤ https://github.com/dastergon/awesome-chaos-engineering