Resilience vs. Chaos Engineering @ Codemotion Meetup en Accenture C85

Advanced Technology Center Resilience vs. Chaos Engineering

Associate Director – Software Engineering Group – Accenture Spain Global
Java Community of Practice co-lead Iberia Cloud First DevOps lead Advanced Technology Center Cloud First Architecture & Platforms lead Communities matter: coordinating MálagaJUG / Málaga Scala / BoquerónSec Father of two, husband, whistle player, video gamer, sci-fi *.*, Lego, Raspberry Pi, Star Wars, Star Trek, LOTR, Halo, Borderlands, Watch Dogs, Diablo, StarCraft, Black Desert,… LLAP! ABOUT ME @deors314 in/deors JORGE HIDALGO

(*) From Google’s SRE book: https://sre.google/books/ BUILDING UP RESILIENCE(*) on
software-intensive systems is a must- have capability as failures become normal, and as such systems architecture design must be oriented towards the early identification of problems, mitigation and fast reaction to outages 4

HOW DO WE BUILD UP RESILIENCE? 5 SPARED NO EXPENSE!

Design for OBSERVABILITY Design for ROBUSTNESS Design for AVAILABILITY 6

BUILDING UP RESILIENCE 7 OBSERVABILITY Avoid incidents if noticed before
and outage happens ROBUSTNESS Components that handle well internal failure AVAILABILITY Outages will happen and we must be ready for that

RESILIENCE ENGINEERING Architecture and application design must be aware of
different outage types 8 Circuit breaker Bulkhead Back pressure Load shed API throttling/limits Isolate & replace Restart FAULT ERROR FAILURE Is observable? What is the impact? Will it last? Resilience patterns Possibly un-noticeable if remediated Observable, minor loss of availability Major loss of availability Not a problem if contained (self-recovered, replaced invisibly) or is undetectable A problem resulting in multiple events, impacting multiple users The problem is impacting all or a significant number of users, for all or most of system capabilities Maybe transient Retry Isolate & replace Restart Will burn some error budget until solved Unless solved fast will impact SLOs and SLAs Redundancy Load shed API throttling/limits Isolate & replace Restart

(*) From Gremlin: https://www.gremlin.com/community/tutorials/ch aos-engineering-the-history-principles-and- practice/ CHAOS ENGINEERING(*) is the
thoughtful plan and execution of experiments designed to reveal weaknesses in a system before they become outages 9

10 By breaking things on purpose, we surface unknown issues
that could impact our systems and customers. PLAN AN EXPERIMENT Create a hypothesis about what could go wrong CONTAIN THE BLAST RADIUS Execute the smallest test that will teach something (And have a rollback plan in case something goes wrong) SCALE OR SQUASH Found an issue? Superb, it can be fixed now! Otherwise, increase the blast radius until the experiment runs at full scale

“UNLEASHING” CHAOS 11 Known knowns Known unknowns Unknown knowns Unknown
unknowns 1 2 3 4 Things we are aware of and understand Things we are aware of but don’t understand Things we understand but are not aware of Things we are neither aware of nor understand

“UNLEASHING” CHAOS 12 How to plan for an experiment

“UNLEASHING” CHAOS 13 How to plan for an experiment Start
with measuring normal activity, ensure that you have the right telemetry in place, and only afterwards plan for an experiment Plan first experiments on known-knowns and progressively go for more advanced (e.g., complex to simulate) and larger blast radius experiments Every experiment should have one or more performance metrics to halt the experiment if impacted, and one or more scenario metrics to verify the hypothesis of the experiment NEVER TEST A KNOWN BROKEN SYSTEM

“UNLEASHING” CHAOS 14 Experiment examples Known-knowns: We know that if
a node in our Kubernetes cluster fails, it is replaced with another node and workloads are distributed across the cluster Known-unknowns: We know that the node is replaced, and we have logs to check for that, but we don’t know the average time that it takes from experiencing a failure until the cluster is back to previous state Unknown-knowns: If we shutdown two or more nodes at the same time the system should recover by itself, but we don’t know how much time will it take to recover and how many running apps will be impacted by the outage Unknown-unknowns: We don’t know exactly what would happen if we shutdown the entire Kubernetes cluster in our main region, and we don’t know the time needed to restore from the latest backups in a brand-new cluster on a different region 1 2 3 4 Experiment: Shut down one cluster node and measure times along the recovery process (increase cluster size before to reduce app outages!) Experiment: Repeat the experiment at different days and times, get an idea of the average MTTD/MTTR, configure alerts accordingly (e.g., this is taking too long to recover) Experiment : Shut down two cluster nodes at the same time (remember to increase the cluster size beforehand!), measure recovery times at different moments along the week, and repeat periodically (e.g., monthly) Experiment : You are not ready to run this experiment yet – first prioritize the engineering work to handle this failure scenario (e.g., build a backup cluster and run DR exercises) before a chaos experiment is planned and executed

Some tools to start with chaos engineering Gremlin Litmus Chaos
Azure Chaos Studio 15

The fun side to chaos engineering 😛 Kubedoom Cheeky Monkey
16

FRIENDS OR FOES? 17

RESILIENCE + CHAOS Chaos engineering uncovers failures Resilience engineering prevents
them and mitigate their impact 18

DEMO TIME! 19

THANK YOU! 20

Resilience vs. Chaos Engineering @ Codemotion M...

Resilience vs. Chaos Engineering @ Codemotion Meetup en Accenture C85

Jorge Hidalgo

More Decks by Jorge Hidalgo

Other Decks in Programming

Featured

Transcript

Advanced Technology Center Resilience vs. Chaos Engineering

Associate Director – Software Engineering Group – Accenture Spain Global

() From Google’s SRE book: https://sre.google/books/ BUILDING UP RESILIENCE() on

HOW DO WE BUILD UP RESILIENCE? 5 SPARED NO EXPENSE!

Design for OBSERVABILITY Design for ROBUSTNESS Design for AVAILABILITY 6

BUILDING UP RESILIENCE 7 OBSERVABILITY Avoid incidents if noticed before

RESILIENCE ENGINEERING Architecture and application design must be aware of

() From Gremlin: https://www.gremlin.com/community/tutorials/ch aos-engineering-the-history-principles-and- practice/ CHAOS ENGINEERING() is the

10 By breaking things on purpose, we surface unknown issues

“UNLEASHING” CHAOS 11 Known knowns Known unknowns Unknown knowns Unknown

“UNLEASHING” CHAOS 12 How to plan for an experiment

“UNLEASHING” CHAOS 13 How to plan for an experiment Start

“UNLEASHING” CHAOS 14 Experiment examples Known-knowns: We know that if

Some tools to start with chaos engineering Gremlin Litmus Chaos

The fun side to chaos engineering 😛 Kubedoom Cheeky Monkey

FRIENDS OR FOES? 17

RESILIENCE + CHAOS Chaos engineering uncovers failures Resilience engineering prevents

DEMO TIME! 19

THANK YOU! 20