Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: Injecting Failures for Build...

Chaos Engineering: Injecting Failures for Building Resilience in Systems

In a distributed world we all depend on the distributed systems more than ever. As these systems become more complex, the failures are much harder to predict. Chaos Engineering introduces the injection of failures as a discipline for building confidence in the resilience capability of the systems.

Yury Nino

May 18, 2019
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. Nice to meet you YURY NIÑO Software Engineer and Chaos

    Engineer Advocate. Loves building software applications, solving resilience issues and teaching. Passionate about reading, writing and cycling.
  2. Agenda • Resilience vs Reliability • Why the world needs

    Resilience and Reliability? • Chaos Engineering • Principles of Chaos • Chaos in Practice • Game Days
  3. A recognition for ... This talk is dedicated to the

    #SystemAdministrators well caffeinated, who get woken up in the middle of the night when “things go bump”. #EngineeringTeam #DigitalFactory @jnhernandz @
  4. A resilient system can maintain an acceptable level of service

    in the face of failure. A resilient system can weather the storm such a large scale natural disaster or a controlled chaos engineering. Tammy Bütow Principal SRE at Gremlin
  5. A distributed system on production needs to be resilient in

    order to be reliable and this is precisely a target that we Software Engineers, Systems Engineers, Site Reliability Engineers and Chaos Engineers always aim. Mine :)
  6. Because ... We are surrounded by distributed systems. When we

    read the news in our cellphones, send an email or buy our lunch ... We do not tolerate that they fail!
  7. February 28th, 2017 will be remembered • Simple Storage Service

    (S3) went down in US-EAST. • Outage lasted about 4 hrs. • > 100.000 websites across the world were impacted.
  8. The World is Chaotic! • Distributed systems contains moving parts.

    • Many things can go wrong. ◦ Hard disks can fail. ◦ The network can go down. ◦ Customer traffic can overload.
  9. Chaos Engineering It is the discipline of experimenting in production

    on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  10. Chaos Engineering It is deliberately inducing stress or fault into

    software and/or hardware as a way of learning/verifying things about systems. https://www.gremlin.com
  11. Chaos Engineering is about • Simulating the failure of a

    datacenter. • Injecting latency between services. • Randomly causing exceptions. • Changing time travel. • Emulating I/O errors. http://principlesofchaos.org/
  12. 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey was

    launched 2018 A lot of resources for Chaos Engineering. 2014 Role of Chaos Engineer was created. History of Chaos Engineering Kolton Andrus
  13. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis

    Circuit Breaker works Environment My Home Results Duration 5 - 10 seconds Load 1 request Actions
  14. 4. Run the Experiment Application Name Finer Observability DataDog Hypothesis

    Circuit Breaker works Facing latencies > 5 seconds between dashboard_api and smart_api to open the circuit. Environment My Home Results Duration 20 milliseconds Load 1 request Issue #4356 Configure the proper hystrix parameters according the results. Implement a fallback. Actions
  15. Game Days can Transform our Teams Even though Game Days

    are not real! they make Engineers gain confidence.
  16. Since we, Engineers are experiencing the failure as part of

    our job, we should start designing for failure. Me :) The best time to learn about fire is when you’re on fire. —Jen Hammond, New Relic engineering manager