Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Software Developers should build a Resilience Culture based on Chaos Engineering

Yury Nino
November 17, 2021

Why Software Developers should build a Resilience Culture based on Chaos Engineering

Yury Nino

November 17, 2021
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. Everybody who implements software knows that our systems must be

    resilient but what happens with us, the humans? Should we be resilient?
  2. If for SLAs, there’s no such thing as 100% Uptime

    - Why does Humans should be available all time?
  3. AGENDA * Applications must be resilient * How to probe

    those Patterns? * Using Chaos Engineering * Building software is complex * what about humans? * should they be resilient? * How? With Chaos Game Days
  4. * Design for Least Privilege * Design for Understandability *

    Design for Changing Landscape * Design for Resilience * Design for Recovery DESIGN PRINCIPLES
  5. CODING PRINCIPLES * Programming Language Choice * Complexity vs Understandability

    * Securing Third-Party Software * Testing Code * Identifying weakness * Implement patterns for Resilience
  6. Code will inevitably include bugs, but we can avoid them

    using hardened frameworks to resilience.
  7. When we write software, we are mentally trying to execute

    the code, to understand what is happening. That process is called TRACING. The part of the brain used to do tracing is called the WORKING MEMORY.
  8. Confusion while coding can be caused by: * A lack

    of knowledge * A lack of easy-to-access information * A lack of processing power in the brain. Mental models are mental representations that we form while thinking of problems. People can hold multiple mental models that can compete with each other.
  9. CHAOS ENGINEERING It is the discipline of experimenting failures in

    production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  10. 2008 Chaos Engineering was born at Netflix 2010 Chaos Monkey

    & Simian Army were launched 2016 Gremlin was born 2019 Chaos Massification 2017 SRE USenix Chaos IQ ChaosConf 2018 Book Chaos Eng 2020 Book Chaos Eng CHAOS HISTORY
  11. When we are not punished or humiliated for speaking up

    with ideas, questions, concerns, or mistakes. We feel comfortable being ourselves. David Altman
  12. Resilience is the ability to positively adapt to difficult situations

    and overcome adversity. Resilience includes both physical and mental positive adaptation. Resilience sounds like something you want, but why do you need it? Software development is filled with mental challenges. @nadrosia
  13. Resilience is the intrinsic ability of a system to adjust

    its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Erik Hollnagel
  14. 4 Essential Capabilities 4 Sets of answers to construct resilience

    profile Actual Respond Factual Learn Critical Monitor Potential Anticipate https://www.yurynino.dev/
  15. • To be able to construct a mental representation. •

    To be able to assess risks and threats as relevant. • To be able to switch from a situation under control. • To be able to maintain a relevant level of confidence. • To be able to make a decision in a complex situation. https://www.yurynino.dev/ IN AN EMERGENCY
  16. • To be able to make an intelligent usage of

    procedures. • To be able to use available resources. • To be able to manage time and pressure. • To be able to cooperate with and crew members. • To be able to properly use and manage information. IN AN EMERGENCY
  17. Chaos GameDays GameDays are an interactive, real-world and learning exercises.

    They are designed to give players a chance to put their skills in a technology to test. GameDays were created by Jesse Robbins inspired by his experience & training as a firefighter.
  18. First on Call Monitors, triages, and tries to mitigate failures

    caused by the Master of Disaster. Master of Disaster Decides the failure and declares start of incident and attack!!! Team Find and solve the exhibited issues, and write up postmortem. HOW TO RUN A GAMEDAY?
  19. Before After During • Pick a hypothesis. • Pick a

    style. • Decide who. • Decide where. • Decide when. • Document. • Get approval! • Detect the situation. • Take a deep breath. • Communicate. • Visit dashboards. • Analyze data. • Propose solutions. • Apply and solve! • Write a postmortem. • What Happened • Impact • Duration • Resolution Time • Resolution • Timeline • Action Items