Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering or DiRT

Yury Nino
November 16, 2024

Chaos Engineering or DiRT

Yury Nino

November 16, 2024
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. Chaos is about complete disorder and confusion. Engineering is about

    designing, building, and usage of engines, machines, and structures.
  2. Chaos is about complete disorder and confusion. Engineering is about

    designing, building, and usage of engines, machines, and structures.
  3. Chaos Engineering It is the discipline of experimenting on a

    system in order to build confidence in the system's capability to withstand turbulent conditions in production. https://principlesofchaos.org
  4. Disaster is an unexpected event that causes significant destruction or

    adverse consequences. Recovery a return to a normal state of health, mind, or strength. Testing is the action of checking that someone or something is working as it is expected.
  5. Disaster is an unexpected event that causes significant destruction or

    adverse consequences. Recovery a return to a normal state of health, mind, or strength. Testing is the action of checking that someone or something is working as it is expected.
  6. Disaster Recovery Testing It is a program performed internally at

    Google, in which a group of engineers plan and execute real and fictitious outages to test the effective response of the involved teams.
  7. If I join Cha practices? Am I here to see

    the same that I could have found in a dictionary?
  8. Focus: Proactive identification of systemic weaknesses in complex systems under

    turbulent conditions. Focus: Test reactive response and recovery procedures in the event of a catastrophic failure or disaster. Disaster Recovery Testing Chaos Engineering
  9. Method: Controlled experiments are conducted in production, introducing failures (e.g.,

    network latency, server outages) to observe system behavior and uncover hidden vulnerabilities. Method: Involves simulating a major outage (e.g., data center failure) and testing the ability to restore systems and data within a defined recovery time objective (RTO).. Disaster Recovery Testing Chaos Engineering
  10. Goal: To build confidence in the system's ability to withstand

    unexpected disruptions and improve overall resilience. Goal: To minimize downtime and data loss in the event of a major disruption, ensuring business continuity. Disaster Recovery Testing Chaos Engineering
  11. It is not because I work for Google, but let

    me talk a little bit more about DiRT.
  12. Disaster Recovery Testing Google tests software and systems, but also

    people, preparation, processes, and response tools. It's about learning and finding single points of failure—therefore the scope of services and systems is broad. Intentionally disrupt services in order to know how to respond and provide reliability. Established in 2006 to exercise response to production emergencies.
  13. Software Modifying live service configurations, or bringing up services with

    known bugs Infrastructure Stress testing large complex architectures, validating SLOs, and ensuring resilience is maintained during disruption. Access Controls Including security, compliance, and privacy. What does Google Test? People and Workflows Removing people who might have knowledge or experience.
  14. Disaster Recovery Testing Testing resilience of a specific system or

    product [no expected impact external]. Tier 3 Testing resilience dependencies of a shared system or product. Tier 2 Testing resilience of organizational response to an enterprise level event. Tier 1
  15. Tier 3 Example: Deploy a bad configuration file. Scenario A

    bad configuration file is included in the next release generating more CPU and Memory consumption. This impacts only the users of an experimental feature in a product. Response • Incident management protocols from service’s owners. • The continuous testing of the services defined by the owners. • Validating disaster readiness and response of a service and the team. • Identification and expansion of standard tests that can be used to de-risk Tier 2 and Tier 1 testing. What can you learn? • If a team is able to effectively perform IMAG. • If a service is resilient to a specific class of failure. • If the service is not overly dependent on a specific resource.
  16. Tier 2 Example: Run at Service Level Scenario An unusually

    large traffic spike degrades the latency of a heavily used shared internal service. The service remains barely within its published SLA for an extended period. Response • Communication to key service consumers. • Incident management protocols from service’s owners. • Emergency serving capacity increases. • Graceful degradation and external messaging to customers. What can you learn? • Do service consumers tolerate worst case scenarios, or do they assume the average experience as a baseline? • Do your alerting and monitoring systems behave the way you want for both service providers and consumers in this scenario?
  17. Scenario Redeploy an application that uses Apache Log4j2 with version

    2.14.0, which has a security vulnerability [CVE-2021-44228], launch a script that exploit this vulnerability and validate that your controls are able to monitor and generate alerts. Response • Security team invokes incident management protocols and business continuity plan. • All impacted users are notified. • Support staff isolate impacted workstations and issue emergency alternative OS laptops. • New policies on the fly. Higher demand on shared computing resources. What can you learn? • Creativity and a culture that promotes flexibility helps a lot. • Communication matters, especially when time is limited. • Expect the unexpected. Backup (and restore) essential data automatically. Validate backup and restoration!!! Tier 1 Example: Hacked!
  18. You told me that this talk would be about CE

    vs DiRT Yes, but I have to sell DiRT.
  19. Dimension: Objective To ensure the system can recover from catastrophic

    failures (e.g., data center outages). To understand system behavior and improve reliability by intentionally causing failures in a controlled manner. Disaster Recovery Testing Chaos Engineering
  20. Dimension: Scope Focuses on business continuity plans, backup systems, and

    recovery processes. Focuses on distributed systems’ robustness, identifying weaknesses proactively. Disaster Recovery Testing Chaos Engineering
  21. Dimension: Scenario Simulates severe, often rare, events such as data

    loss, hardware failures, or site outages. Simulates everyday failures like service crashes, latency spikes, or network disruptions. Disaster Recovery Testing Chaos Engineering
  22. Dimension: Methodology Involves predefined tests and drills to validate recovery

    plans and timelines. Uses experiments designed to stress the system in various ways, often run continuously or periodically. Disaster Recovery Testing Chaos Engineering
  23. Dimension: Automation Often relies on manual or semi-automated procedures and

    is usually not fully automated. Highly automated, with tools such as Chaos Monkey that run experiments autonomously. Disaster Recovery Testing Chaos Engineering
  24. Dimension: Frequency Conducted periodically (e.g., annually or quarterly) as a

    planned event. Can be conducted frequently, even as part of normal operations, to gain ongoing insights. Disaster Recovery Testing Chaos Engineering
  25. Dimension: Measurement Evaluates recovery metrics like Recovery Time Objective (RTO)

    and Recovery Point Objective (RPO). Assesses system performance, resilience, and the ability to handle stress under failure scenarios. Disaster Recovery Testing Chaos Engineering
  26. Dimension: Focus Ensures data integrity, service continuity, and minimal downtime

    during recovery. Tests system’s ability to self-heal, maintain service levels, and minimize the blast radius of issues. Disaster Recovery Testing Chaos Engineering
  27. Dimension: Example Testing if a backup system can restore data

    after a simulated data center failure. Simulating server outages to see if a distributed system can balance load and maintain performance. Disaster Recovery Testing Chaos Engineering