An introduction to Chaos Engineering

AN INTRODUCTION TO CHAOS ENGINEERING Sebastian Alejandro Velasco   Sr
Advanced Software Development Engineer

“ The di ff erence between average people and achieving
people is their perception of and response to failure. -John C. Maxwell

LECTURER BIO Sebastian Alejandro Velasco Sr Advanced Software Development Engineer.
Ms.c Computer Science Bs.c Software and Computer Engineering Universidad Nacional de Colombia Passionate on building software applications, researching and playing video games

TABLE OF CONTENTS ➤ Resilience and Reliability ➤ Software Development
Lifecycle ➤ Chaos Engineering ➤ Principles of Chaos Engineering ➤ Gamedays ➤ Chaos Tools

RESILIENCE AND RELIABILITY

RESILIENCE AND RELIABILITY To keep the water supply working every
time the tap is turned up To keep the system working at any time To recover the power capacity when the principal generator crashes To call a fallback service when an internal error occurs Difference between Resilience and Reliability To use the auxiliary water tank when the main water supply is broken To keep the lights on every time the switch is pushed

RESILIENCE AND RELIABILITY Reliability is “the probability of failure-free software
operation for   a specified period of time in a specified environment”   Resilience is “the ability of a cloud-based service to withstand certain types of failures and yet remain functional from the customer perspective” Every system needs to be RESILIENT in order to be RELIABLE. But one concept per se, does not imply the other

RESILIENCE AND RELIABILITY Building resilience is the key to turning
challenges into successes

SOFTWARE DEVELOPMENT LIFECYCLE

SOFTWARE DEVELOPMENT LIFECYCLE (SDL) So far so good..

CHAOTIC SDL

CHAOS ENGINEERING

UNDERSTANDING COMPLEXITY AND SIMPLICITY

UNDERSTANDING COMPLEXITY AND SIMPLICITY If complexity is causing bad outcomes,
and we cannot remove the complexity, then what is supposed to be done? Embrace complexity rather than avoid it, trying to optimize for simplicity leads to frustration Learn to navigate complexity. Find tools to move quickly with confidence.

CHAOS ENGINEERING “Chaos engineering is the discipline of experimenting on
a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.” How much confidence we can have in the complex systems that we put into production? “Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make distributed systems inherently chaotic.” Chaos engineering is a form of experimentation, rather than a form of testing It is about making the chaos inherent in the system visible

CHAOS ENGINEERING Chaos engineering is about but not limited to..
Simulating failures of a datacenter Injecting latency between services Randomly causing exceptions Emulating I/O errors Injecting failures to source code

CHAOS ENGINEERING TIMELINE

PRINCIPLES OF CHAOS ENGINEERING

STEADY STATE Measurable output of a system that indicates normal
behavior Throughput Latency percentiles Error rates …

BUILD AN HYPOTHESIS The we build an hypothesis around steady
state Circuit breaker builds resilience Eureka enables failover of services Reddis is elastic

VARY REAL-WORLD EVENTS Servers dying Chaos variables reflect real-world events
Malformed responses High traffic Low CPU resources DDos attacks High traffic Core services unresponsive Database Bottlenecks Services data traffic Prioritize events either by potential impact or estimated frequency Any event capable of disrupting steady state is a potential candidate

RUN EXPERIMENT IN PRODUCTION (IDEAL) Systems behave differently depending on
environment and traffic patterns Sampling real traffic is the only way to reliably capture the request path Chaos strongly prefers to experiment directly on production Keep a detailed tracking of each experiment Application name Hypothesis Environment Duration Load Observability Results Actions

MINIMIZE BLAST RADIUS Experimenting in production has the potential to
cause customer pain It is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained

GAMEDAYS

GAMEDAYS “GameDays were coined by Jesse Robbins when he worked
at Amazon and was responsible for availability. Jesse created GameDays with the goal of increasing reliability by purposefully creating major failures on a regular basis” Engineering Teams Support Teams Management Teams Target Time and Place Goals Have Fun Whiteboarding

GAMEDAYS - CHECKLIST ➤ Precice date ➤ War room for
in-person attendance ➤ Dial-in information (conference link) Things to include ➤ Start ➤ Whiteboarding ➤ Test cases and scoping ➤ Execution ➤ Recap ➤ Key people in attendance Agenda items

GAMEDAYS - ROLES Master of disaster Detective Support Team First
on call Second on call Incident commander

CHAOS TOOLS

CHAOS TOOLS Chaos Monkey and Simian Army Chaos monkey for
Spring boot Gremlin Chaos toolkit Chaos Mesh

RECOMMENDED LECTURE amazon store

REFERENCES ➤ https://www.microsoft.com/security/blog/2014/03/24/reliability-series-1-reliability-vs- resilience/ ➤ https://www.researchgate.net/pro fi le/Aaron-Clark-Ginsberg/publication/ 320456274_What%27s_the_Di ff
erence_between_Reliability_and_Resilience/links/ 59e651230f7e9b13aca3c2ba/Whats-the-Di ff erence-between-Reliability-and-Resilience.pdf ➤ https://principlesofchaos.org ➤ https://www.gremlin.com/community/tutorials/your- fi rst-chaos-experiment/ ➤ https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/ ➤ https://searchsoftwarequality.techtarget.com/tip/How-to-set-up-a-chaos-engineering- game-day ➤ https://github.com/dastergon/awesome-chaos-engineering

An introduction to Chaos Engineering

An introduction to Chaos Engineering

Sebastian Alejandro Velasco Dimate

More Decks by Sebastian Alejandro Velasco Dimate

Other Decks in Technology

Featured

Transcript