Disaster Recovery Testing vs Chaos Engineering

Chaos Engineering vs Disaster Recovery Testing JAN 21, 2026

About me WHO I AM Yury Niño Roa Cloud AppMod
Engineer @Google @yurynino

He: Are you still mad? … I have memes Me:
Which ones?

The References

Do you know what is Etymology?

Chaos is about complete disorder and confusion. Engineering is about
designing, building, and usage of engines, machines, and structures.

What do you get when you join the words?

Chaos Engineering It is the discipline of experimenting on a
system in order to build confidence in the system's capability to withstand turbulent conditions in production. https://principlesofchaos.org

Disaster is an unexpected event that causes significant destruction or
adverse consequences. Recovery a return to a normal state of health, mind, or strength. Testing is the action of checking that someone or something is working as it is expected.

What do you get when you join the words?

Disaster Recovery Testing It is a program performed internally at
Google, in which a group of engineers plan and execute real and fictitious outages to test the effective response of the involved teams.

If I join Cha practices? Am I here to see
the same that I could have found in a dictionary?

I asked Gemini ...

Focus: Proactive identification of systemic weaknesses in complex systems under
turbulent conditions. Focus: Test reactive response and recovery procedures in the event of a catastrophic failure or disaster. Disaster Recovery Testing Chaos Engineering

Method: Controlled experiments are conducted in production, introducing failures (e.g.,
network latency, server outages) to observe system behavior and uncover hidden vulnerabilities. Method: Involves simulating a major outage (e.g., data center failure) and testing the ability to restore systems and data within a defined recovery time objective (RTO).. Disaster Recovery Testing Chaos Engineering

Goal: To build confidence in the system's ability to withstand
unexpected disruptions and improve overall resilience. Goal: To minimize downtime and data loss in the event of a major disruption, ensuring business continuity. Disaster Recovery Testing Chaos Engineering

It is not because I work for Google, but let
me talk a little bit more about DiRT.

What is about DiRT?

Disaster Recovery Testing Google tests software and systems, but also
people, preparation, processes, and response tools. It's about learning and finding single points of failure—therefore the scope of services and systems is broad. Intentionally disrupt services in order to know how to respond and provide reliability. Established in 2006 to exercise response to production emergencies.

Software Modifying live service configurations, or bringing up services with
known bugs Infrastructure Stress testing large complex architectures, validating SLOs, and ensuring resilience is maintained during disruption. Access Controls Including security, compliance, and privacy. What does Google Test? People and Workflows Removing people who might have knowledge or experience.

Disaster Recovery Testing Testing resilience of a specific system or
product [no expected impact external]. Tier 3 Testing resilience dependencies of a shared system or product. Tier 2 Testing resilience of organizational response to an enterprise level event. Tier 1

Tier 3 Example: Deploy a bad configuration file. Scenario A
bad configuration file is included in the next release generating more CPU and Memory consumption. This impacts only the users of an experimental feature in a product. Response • Incident management protocols from service’s owners. • The continuous testing of the services defined by the owners. • Validating disaster readiness and response of a service and the team. • Identification and expansion of standard tests that can be used to de-risk Tier 2 and Tier 1 testing. What can you learn? • If a team is able to effectively perform IMAG. • If a service is resilient to a specific class of failure. • If the service is not overly dependent on a specific resource.

Tier 2 Example: Run at Service Level Scenario An unusually
large traffic spike degrades the latency of a heavily used shared internal service. The service remains barely within its published SLA for an extended period. Response • Communication to key service consumers. • Incident management protocols from service’s owners. • Emergency serving capacity increases. • Graceful degradation and external messaging to customers. What can you learn? • Do service consumers tolerate worst case scenarios, or do they assume the average experience as a baseline? • Do your alerting and monitoring systems behave the way you want for both service providers and consumers in this scenario?

Scenario Redeploy an application that uses Apache Log4j2 with version
2.14.0, which has a security vulnerability [CVE-2021-44228], launch a script that exploit this vulnerability and validate that your controls are able to monitor and generate alerts. Response • Security team invokes incident management protocols and business continuity plan. • All impacted users are notified. • Support staff isolate impacted workstations and issue emergency alternative OS laptops. • New policies on the fly. Higher demand on shared computing resources. What can you learn? • Creativity and a culture that promotes flexibility helps a lot. • Communication matters, especially when time is limited. • Expect the unexpected. Backup (and restore) essential data automatically. Validate backup and restoration!!! Tier 1 Example: Hacked!

Blackholing internal traffic Add a VPC rule to a cloud
project that routes traffic destined for IP addresses of some of their hosts in-region (say, VMs running MySQL DBs) to some other project where the traffic is dropped. Reversion is by deleting the rule. Chief risk is that more traffic than expected is captured by the rule, so the outage is bigger than planned. Redirect Traffic Away from a region using a Load Balancer The customer can remove a backend from a load balancers load balancers and drain connections. This can be used to simulate failover from one zone or one region to another by removing a managed instance group that includes all the resources in one zone or region, forcing the LB to send the traffic to another region. The risk is that the resources making up the GCLB backends will still remain up but will not serve any traffic. Practical Examples

You told me that this talk would be about CE
vs DiRT Yes, but I have to sell DiRT.

How to compare these Practices?

Are you sure they are both practices?

Dimension: Objective To ensure the system can recover from catastrophic
failures (e.g., data center outages). To understand system behavior and improve reliability by intentionally causing failures in a controlled manner. Disaster Recovery Testing Chaos Engineering

Dimension: Scope Focuses on business continuity plans, backup systems, and
recovery processes. Focuses on distributed systems’ robustness, identifying weaknesses proactively. Disaster Recovery Testing Chaos Engineering

Dimension: Scenario Simulates severe, often rare, events such as data
loss, hardware failures, or site outages. Simulates everyday failures like service crashes, latency spikes, or network disruptions. Disaster Recovery Testing Chaos Engineering

Dimension: Methodology Involves predefined tests and drills to validate recovery
plans and timelines. Uses experiments designed to stress the system in various ways, often run continuously or periodically. Disaster Recovery Testing Chaos Engineering

So bored How many dimensions are they?

Dimension: Automation Often relies on manual or semi-automated procedures and
is usually not fully automated. Highly automated, with tools such as Chaos Monkey that run experiments autonomously. Disaster Recovery Testing Chaos Engineering

Dimension: Frequency Conducted periodically (e.g., annually or quarterly) as a
planned event. Can be conducted frequently, even as part of normal operations, to gain ongoing insights. Disaster Recovery Testing Chaos Engineering

Dimension: Measurement Evaluates recovery metrics like Recovery Time Objective (RTO)
and Recovery Point Objective (RPO). Assesses system performance, resilience, and the ability to handle stress under failure scenarios. Disaster Recovery Testing Chaos Engineering

Dimension: Focus Ensures data integrity, service continuity, and minimal downtime
during recovery. Tests system’s ability to self-heal, maintain service levels, and minimize the blast radius of issues. Disaster Recovery Testing Chaos Engineering

Dimension: Example Testing if a backup system can restore data
after a simulated data center failure. Simulating server outages to see if a distributed system can balance load and maintain performance. Disaster Recovery Testing Chaos Engineering

Thank You!

Disaster Recovery Testing vs Chaos Engineering

Disaster Recovery Testing vs Chaos Engineering

More Decks by Yury Nino

Other Decks in Technology

Featured

Transcript