Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: Resilience Strategies in Cloud Architectures.

Chaos Engineering: Resilience Strategies in Cloud Architectures.

Yury Niño Roa

August 29, 2019
Tweet

More Decks by Yury Niño Roa

Other Decks in Technology

Transcript

  1. Nice to meet you YURY NIÑO DevOps Engineer and Chaos

    Engineer Advocate Loves building software applications, solving resilience issues and teaching. Passionate about reading, writing and cycling.
  2. Agenda • Nature Is Ordered And Chaotic. • Humanity and

    Nature. • The systems are built by Humans! • Cloud Architectures. • Cloud Patterns. • Chaos Engineering. • Demo.
  3. Nature is both ordered and Chaotic! “It turns out that

    an eerie type of chaos can lurk just behind a facade of order – and yet, deep inside the chaos lurks an even eerier type of order“ Douglas Hostadter
  4. Chaos Theory is about finding underlying patterns in systems that

    appear to be disordered. Even small changes by humanity within ecosystems can result in huge and unexpected effects over time.
  5. Chaos Theory has made a valuable contribution to the naturalist

    and now to the Software Systems. When a butterfly flutters its wings in one part of the world, it can eventually cause a hurricane in another. Edward Norton Lorenz
  6. I wonder, why are humans always trying to put the

    physical things in some sort of order when the disorder is more beautiful and inspiring. Peggy Laffan
  7. Software Systems are chaotic! • Unpredictable events are bound to

    happen. • Distributed systems contains moving parts. • Many things can go wrong. ◦ Hard disks can fail. ◦ The network can go down. ◦ Surge in customer traffic can overload.
  8. Screws fall out all the time; the world is an

    imperfect place. We talk a lot about building resilient systems, but all systems are (at least for now) built by humans. Vicky Brasseur The Software Systems are built by Humans! but we can control them. Mine :)
  9. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  10. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  11. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  12. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  13. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  14. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  15. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  16. Challenges in Cloud Availability Data Management Design & Implementation Messaging

    Management & Monitoring Performance & Scalability Security Resilience
  17. Performance, resilience, and power consumption are interdependent key system design

    factors. An increase in resilience (e.g., though redundancy) can result in higher performance and in higher power consumption (as more hardware). Saurabh Hukerikar Christian Engelmann
  18. A design pattern describes a generalizable solution to a recurring

    problem that occurs within a well-defined context. Saurabh Hukerikar Christian Engelmann
  19. Resilience Patterns • Circuit Breaker. • Bulkhead. • Health Endpoint.

    • Leader Election. • Retry. • Elastic Load Balancer Dulle Griet by Pieter Bruegel
  20. Chaos Engineering It is the discipline of experimenting in production

    on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  21. Chaos Engineering It is deliberately inducing stress or fault into

    software and/or hardware as a way of learning/verifying things about systems. https://www.gremlin.com
  22. Chaos Recipes Attack: CPU / Memory / Disk Scope: Single

    instance. Expected Results: • Rate of good responses goes down. • Errors increase at all layers. • Alerts fire. • Load balancer routes traffic away. www.gremlin.com
  23. Chaos Recipes Attack: DNS blackhole. Scope: Single instance. Expected Results:

    • Inbound traffic may drop. • Traffic to external systems may fail. • Startup may not complete successfully. www.gremlin.com
  24. Chaos Recipes Attack: Network Blackhole / Latency Scope: Single instance

    Expected Results: • Traffic to dependency goes to 0. • Startup completes without errors. • Timeouts and concurrency limits. • Dependency alerts. www.gremlin.com
  25. Possibly the most useful trait in life is resilience, and

    you build resilience through experiencing difficulty and challenges. You cannot control the environment! You can control your systems. Me :)