Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos + Observability + Resilience = Chaos Engineering

Yury Nino
November 26, 2020

Chaos + Observability + Resilience = Chaos Engineering

Yury Nino

November 26, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. soy parte de la comunidad it! speaker_! Yury Niño Roa

    Chaos + Observability + Resilience = Chaos Engineering
  2. FOTO SPEAKER CUADRADA soy parte de la comunidad it! Yury

    Niño Roa Site Reliability Engineer & Chaos Engineer Advocate. Loves software applications, automating, reading, writing and teaching about technology. @yurynino https://www.yurynino.dev/
  3. AGENDA! + . . . . . . . .

    . . . . Chaos Foundations Disasters & Chaos. Reliability & Resilience Resilience Patterns Observability Instrumentation & Telemetry. Chaos Engineering Joining points! Chaos Engineering https://www.yurynino.dev/
  4. US FLIGHT 1549 DITCHED https://www.yurynino.dev/ US Airways Flight 1549 ditched

    into the Hudson River New York January 15th 2009 This emergency ditching and evacuation, with the loss of no lives, was a heroic and unique aviation achievement
  5. IN THE CHAOS Was the Hudson River successful ditching a

    miracle, a heroic achievement, or more simply an expression of some response capabilities fundamentally engineered namely its RESILIENCE IN THE CHAOS FOTO RECTANGULAR https://www.yurynino.dev/
  6. IS THE CHAOS SO BAD? FOTO RECTANGULAR “It turns out

    that an eerie type of chaos can lurk just behind a facade of order – and yet, deep inside the chaos lurks an even eerier type of order“ Douglas Hostadter https://www.yurynino.dev/ Nature is both Ordered and Chaotic!
  7. IS THE CHAOS SO BAD? FOTO RECTANGULAR The World is

    Chaotic! Black swans take our systems down and keep them down for a long time. Laura Nolan, SRE in Slack https://www.yurynino.dev/
  8. SOFTWARE IS BECOMING CHAOTIC Netflix Architecture The infrastructure required by

    a software system can be as complex as the software itself. Every production failure is unique. No two incidents will share the precise chain of failure! https://www.yurynino.dev/ Twitter Architecture
  9. CHAOS THEORY Artificial intelligence, big data, modern science, and the

    internet are all revealing a fundamental truth: The world is chaotic and unpredictable! Entropy is in general a measure of "disorder". Chaos Theory is about finding underlying patterns in systems that appear to be disordered. Even small changes by humanity within ecosystems can result in huge and unexpected effects over time. ENTROPY https://www.yurynino.dev/
  10. CHAOS & OBSERVABILITY Incident analysis is a catalyst to help

    you understand more about the past chaos. To learn from chaos requires observability and actionable insights. Observability plays a key role in the middle of chaos! When the crew are confronted with an emergency or abnormal ... they are not flying they are monitoring to make decisions to fly again! https://www.yurynino.dev/
  11. OBSERVABILITY Observability is the old wine of monitoring in a

    new bottle! It is not just about Logs, Metrics and Traces. If you do not have dashboards and analytical insights that show operations and business metrics separately, your ability to observe the the chaos in your services is hampered. https://www.yurynino.dev/ https://www.yurynino.dev/
  12. 7 ATTRIBUTES Service Health Actions 1 6 KEYS FOR OBSERVABILITY

    Transactions 2 IT Resources 3 Relevance 4 Telemetry 5 Insights 7
  13. KEYS FOR OBSERVABILITY These 7 attributes reflect not only the

    maturity of your experiments but the also the maturity of your ‘observability’ As you expand the practice of incident managements into your teams, these attributes prominently start defining the observability of your business. https://www.yurynino.dev/ https://www.yurynino.dev/
  14. 7 ATTRIBUTES Traffic Traces 1 6 SOURCES OF OBSERVABLE Latency

    2 Errors 3 Saturation 4 Logging 5 Metrics 7
  15. CHAOS & RESILIENCE Resilience is not about reducing negatives or

    errors. Resilience engineering is useful identifying and then enhancing the positive capabilities of people in organizations that allow them to adapt effectively and safely under varying circumstances. https://www.yurynino.dev/
  16. Responding The system must first detect that something has happened,

    then recognise the event. Knowing what to do, or being able to respond to regular and irregular variability. ACTUAL Detecting 1 2 2 ACTUAL CAPABILITY
  17. Learning The future performance only can be improved if something

    is learned from past performance. Knowing what has happened to learn from experience, in particular to learn the right lessons. FACTUAL Knowing 2 1 FACTUAL CAPABILITY
  18. Risks Risk assessment focuses on future threats and is suitable

    for systems where the principles of functioning are known Knowing what to expect, or being able to anticipate developments, threats, and opportunities. POTENCIAL Anticipating 3 3 POTENCIAL CAPABILITY
  19. Monitoring Monitoring enables the system to address possible near-term threats

    before they become reality. Knowing what to look for, or being able to monitor that which changes, or may change. CRITICAL Looking for 4 4 CRITICAL CAPABILITY
  20. CHAOS & RESILIENCE In Chaos Engineering, resilience plays a key

    role because our mission is to provide resilience. Validation of hypothesis, steady state behaviour, simulating real-world events, optimizing blast radius are all those stages of your experiments where observability plays a key role. https://www.yurynino.dev/
  21. CHAOS ENGINEERING Chaos Engineering is the discipline of experimenting failures

    in production in order to reveal their weakness and to build confidence in their resilience capability. Chaos Engineering will introduce real time failures into systems to assess system ability to tolerate failures, recoverability, resiliency and high availability. https://www.yurynino.dev/
  22. CHAOS PRINCIPLES Hypothesize about Steady State Vary Real-world Events Run

    Experiments Automate the method https://www.yurynino.dev/
  23. Gremlin CHAOS TOOLS Chaos Monkey & Simian Army Chaos Monkey

    for Spring Boot Chaos Toolkit ChaosMesh https://www.yurynino.dev/
  24. CHAOS ENGINEERING I wonder, why are humans always trying to

    put the physical things in some sort of order when the disorder is more beautiful and inspiring. Peggy Laffan https://www.yurynino.dev/