Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: The Discipline based on Fail...

Chaos Engineering: The Discipline based on Failure and Resilience

En la actualidad la mayoría de nuestras actividades dependen de sistemas distribuídos que son complejos, cambiantes y dinámicos. A medida que esos sistemas crecen, la probabilidad de falla se hace más difícil de predecir y mitigar. La Ingeniería del Caos introduce una disciplina basada en el método científico, que mendiante la inyección controlada de errores, busca identificar el comportamiento de los sistemas en un escenario de falla y generar estrategias de resiliencia que permitan construir confiabilidad.
La charla empieza con una definición de lo qué es Ingeniería del Caos, sus principios y fases, continua con un análisis del impacto en áreas como la medicina, el transporte y las finanzas. En la mitad de la charla se incluye una demo, con un ejemplo práctico de Ingeniería del Caos y al final presenta una metodología para ejecutar días de caos en un equipo de ingniería.

Yury Nino

June 18, 2019
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. #PurpleHeart @GEEKGIRLSLATAM #WomenSupportEachOther El corazón de la tecnología Inspiración, Empoderamiento

    y Conexión de niñas, jóvenes y mujeres como agentes de cambio a través del uso, apropiación y CREACIÓN de tecnología. Un trabajo desde el SER para el HACER.
  2. YURY NIÑO Solutions Architect and Chaos Engineer Advocate. Loves building

    software applications, solving resilience issues and teaching. Passionate about reading, writing and cycling. Nice to meet you
  3. Agenda • What is the mean of failing? • Black

    swans • Resilience! • Chaos Engineering • Chaos Recipes • How to start in Chaos Engineering?
  4. Black Swans are events that comes as a surprise, have

    a major effect, and are often inappropriately rationalized. The term is based on an ancient saying that presumed black swans did not exist
  5. Black swans take our systems down and keep them down

    for a long time. Hitting limits Spreading slowness Cyberattacks Dependency problems Laura Nolan, SRE in Slack
  6. The World is Chaotic! • Distributed systems contain moving parts.

    • Many things can go wrong. ◦ Hard disks can fail. ◦ The network can go down. ◦ Client traffic can overload. Dulle Griet by Pieter Bruegel
  7. You cannot control the environment! You can control your systems.

    Me :) Let go of plans gone wrong. Things have a way of working out in the end. Embrace the Chaos. Bob Miglani.
  8. A resilient system can maintain an acceptable level of service

    in the face of failure. A resilient system can weather the storm such a large scale natural disaster or a controlled chaos engineering. Tammy Bütow Principal SRE at Gremlin
  9. Because ... We are surrounded by distributed systems. When we

    read the news in our cellphones, send an email or buy our lunch ... We need that they always work!
  10. Chaos Engineering It is the discipline of experimenting in production

    on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  11. Chaos Engineering It is deliberately inducing stress or fault into

    software and/or hardware as a way of learning/verifying things about systems. https://www.gremlin.com
  12. Infrastructure Attacks • CPU Generates high load for CPU cores.

    • Memory Allocates a specific amount of RAM. • Disk Writes files to disk to fill it to a specific percentage. www.gremlin.com
  13. Network Attacks • Latency Injects latency into network traffic. •

    Packet loss Induces packet loss into the network traffic. • DNS Blocks the access to DNS servers. www.gremlin.com
  14. State Attacks • Shutdown Reboots the host operating system. •

    Time travel Changes the host’s system time. • Process killer Kill a specific process to simulate when an application or dependency crashes. www.gremlin.com
  15. Chaos Recipes Attack: CPU / Memory / Disk Scope: Single

    instance. Expected Results: • Rate of good responses goes down. • Errors increase at all layers. • Alerts fire. • Load balancer routes traffic away. www.gremlin.com
  16. Chaos Recipes Attack: DNS blackhole. Scope: Single instance. Expected Results:

    • Inbound traffic may drop. • Traffic to external systems may fail. • Startup may not complete successfully. www.gremlin.com
  17. Chaos Recipes Attack: Network Blackhole / Latency Scope: Single instance

    Expected Results: • Traffic to dependency goes to 0. • Startup completes without errors. • Timeouts and concurrency limits. • Dependency alerts. www.gremlin.com
  18. Chaos Engineering can transform our teams Even though the experiments

    are not real! they make Engineers gain confidence.
  19. “I just do things.” … Joker Add some chaos to

    your life. A certain amount is healthy. It stimulates growth and change and passion and excitement. Mark Manson
  20. El corazón púrpura indica creatividad, colaboración, talento y sororidad (mujeres

    apoyando otras mujeres), y fluye para articular actores que promueven oportunidades de aprendizaje técnico, liderazgo y participación de más mujeres en la Industria Tecnológica, contribuyendo al cierre de brechas de género y a la consolidación de políticas de género, diversidad, inclusión y equidad. #PurpleHeart