Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Software Developers should build a Resilience Culture based on Chaos Engineering

Why Software Developers should build a Resilience Culture based on Chaos Engineering

Yury Nino

July 01, 2021
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. 4 in 10 enterprises reported that a single hour of

    downtime can cost them between $1 million and over $5 million, excluding fines and legal fees. 2020 Global Server Hardware Server OS Reliability Survey – found that an 87% majority of organizations now require a minimum of 99.99% availability. An SMB company that estimates that one hour of downtime “only” costs the firm $10,000 could still incur a cost of $167 for a single minute of per server downtime. https://thenewstack.io/chaos-engineering-on-ci-cd-pip elines/
  2. AGENDA * Building software is complex * Developers need to

    be resilient * How to Cultivate Resilience? * Using Chaos Engineering * Chaos in CI/CD * Training with Chaos Gamedays
  3. * Design for Least Privilege * Design for Understandability *

    Design for Changing Landscape * Design for Resilience * Design for Recovery PRINCIPLES DESIGN
  4. Google reported that 85% of all bugs in Android were

    caused by memory management errors. How to guarantee resilience? They concluded that “they need to move towards memory safe languages”. Code to fail
  5. CODING PRINCIPLES * Programming Language Choice * Complexity vs Understandability

    * Securing Third-Party Software * Testing Code * Data Validation
  6. * Define a disaster * Prepare a Disaster Planning *

    Identify Team and Roles * Establish Severity Models * Develop Response Plans * Create Detailed Playbooks OPERATIONS PRINCIPLES
  7. * Require Code Reviews * Rely on Automation * Verify

    Artifacts, Not Just People * Treat Configuration as Code * Securing Against the Threat Model * Policies Verifiable Builds * Post-Deployment Verification DEPLOYMENT PRINCIPLES
  8. When we write software, we are mentally trying to execute

    the code, to understand what is happening. That process is called TRACING. The part of the brain used to do tracing is called the WORKING MEMORY.
  9. Confusion while coding can be caused by: * A lack

    of knowledge * A lack of easy-to-access information * A lack of processing power in the brain. Mental models are mental representations that we form while thinking of problems. People can hold multiple mental models that can compete with each other.
  10. Resilience is the ability to positively adapt to difficult situations

    and overcome adversity. Resilience includes both physical and mental positive adaptation. Resilience sounds like something you want, but why do you need it? Software development is filled with mental challenges. @nadrosia
  11. Code will inevitably include bugs, but we can avoid them

    using hardened frameworks to resilience.
  12. Seek discomfort @nadrosia Seek Purpose and Find Your “Why” Take

    Care of Yourself Cultivate Social Connections
  13. * To be able to construct a mental representation of

    the situation. * To be able to assess risk and threats as relevant for the flight. * To be able to switch from a situation under control. * To be able to maintain a relevant level of confidence. * To be able to make a decision in a complex. RESILIENCE IN THE CHAOS
  14. * To be able to make an intelligent usage of

    procedures. * To be able to use available technical and human resources. * To be able to manage time and time pressure. * To be able to cooperate with, crew members and other staff. * To be able to properly use and manage information. RESILIENCE IN THE CHAOS
  15. CHAOS ENGINEERING It is the discipline of experimenting failures in

    production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  16. 2008 Chaos Engineering was born at Netflix 2010 Chaos Monkey

    & Simian Army were launched 2016 Gremlin was born 2019 Chaos Massification 2017 SRE USenix Chaos IQ ChaosConf 2018 Book Chaos Eng 2020 Book Chaos Eng CHAOS HISTORY
  17. GameDays were created by Jesse Robbins inspired by his experience

    & training as a firefighter. A Chaos GameDay is an event hosted to conduct chaos experiments to validate or invalidate a hypothesis resilience.
  18. GAMEDAYS -- CHAOS GAMEDAYS GameDays are interactive team-based learning exercises

    designed to give players a chance to put their skills to the test in a real-world, gamified, risk-free environment. A Chaos GameDay is a practice event, and although it can take a whole day, it usually requires only a few hours. The goal of a GameDay is to practice how you, your team, and your supporting systems deal with real-world turbulent conditions.
  19. https://www.yurynino.dev/ Before After During • Pick a hypothesis. • Pick

    a style. • Decide who. • Decide where. • Decide when. • Document. • Get approval! • Detect the situation. • Take a deep breath. • Communicate. • Visit dashboards. • Analyze data. • Propose solutions. • Apply and solve! • Write a postmortem. • What Happened • Impact • Duration • Resolution Time • Resolution • Timeline • Action Items THE FRAMEWORK
  20. First on Call Monitors, triages, and tries to mitigate failures

    caused by the Master of Disaster. Master of Disaster Decides the failure and declares start of incident and attack!!! Team Find and solve the exhibited issues, and write up postmortem. CHAOS GAMEDAYS ROLES
  21. Resilience is the intrinsic ability of a system to adjust

    its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Erik Hollnagel