Why Software Developers should build a Resilience Culture based on Chaos Engineering

Slide 1

Slide 1 text

Slide 2

Slide 2 text

YURY NIÑO SITE RELIABILITY ENGINEERING CHAOS ENGINEERING ADVOCATE COLOMBIA

Slide 3

Slide 3 text

4 in 10 enterprises reported that a single hour of downtime can cost them between $1 million and over $5 million, excluding fines and legal fees. 2020 Global Server Hardware Server OS Reliability Survey – found that an 87% majority of organizations now require a minimum of 99.99% availability. An SMB company that estimates that one hour of downtime “only” costs the firm $10,000 could still incur a cost of $167 for a single minute of per server downtime. https://thenewstack.io/chaos-engineering-on-ci-cd-pip elines/

Slide 4

Slide 4 text

AGENDA * Building software is complex * Developers need to be resilient * How to Cultivate Resilience? * Using Chaos Engineering * Chaos in CI/CD * Training with Chaos Gamedays

Slide 5

Slide 5 text

HOW DO WE BUILD SOFTWARE?

Slide 6

Slide 6 text

With Software Development Life Cycles Plan Requirements Design Code Deploy Maintain

Slide 7

Slide 7 text

Deploy Maintain Plan Requirements Design Code Deploy With Software Development Life Cycles

Slide 8

Slide 8 text

* Design for Least Privilege * Design for Understandability * Design for Changing Landscape * Design for Resilience * Design for Recovery PRINCIPLES DESIGN

Slide 9

Slide 9 text

Google reported that 85% of all bugs in Android were caused by memory management errors. How to guarantee resilience? They concluded that “they need to move towards memory safe languages”. Code to fail

Slide 10

Slide 10 text

CODING PRINCIPLES * Programming Language Choice * Complexity vs Understandability * Securing Third-Party Software * Testing Code * Data Validation

Slide 11

Slide 11 text

* Define a disaster * Prepare a Disaster Planning * Identify Team and Roles * Establish Severity Models * Develop Response Plans * Create Detailed Playbooks OPERATIONS PRINCIPLES

Slide 12

Slide 12 text

* Require Code Reviews * Rely on Automation * Verify Artifacts, Not Just People * Treat Configuration as Code * Securing Against the Threat Model * Policies Verifiable Builds * Post-Deployment Verification DEPLOYMENT PRINCIPLES

Slide 13

Slide 13 text

BUILD SOFTWARE IS COMPLEX

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

When we write software, we are mentally trying to execute the code, to understand what is happening. That process is called TRACING. The part of the brain used to do tracing is called the WORKING MEMORY.

Slide 16

Slide 16 text

Confusion while coding can be caused by: * A lack of knowledge * A lack of easy-to-access information * A lack of processing power in the brain. Mental models are mental representations that we form while thinking of problems. People can hold multiple mental models that can compete with each other.

Slide 17

Slide 17 text

Despite your best efforts, your code probably won’t always behave as expected.

Slide 18

Slide 18 text

By anticipating mistakes like failing, SDLC can eliminate redundant rework and after-the-fact fixes.

Slide 19

Slide 19 text

DEVELOPERS NEED TO BE RESILIENT

Slide 20

Slide 20 text

Resilience is the ability to positively adapt to difficult situations and overcome adversity. Resilience includes both physical and mental positive adaptation. Resilience sounds like something you want, but why do you need it? Software development is filled with mental challenges. @nadrosia

Slide 21

Slide 21 text

Code will inevitably include bugs, but we can avoid them using hardened frameworks to resilience.

Slide 22

Slide 22 text

HOW TO CULTIVATE RESILIENCE AS A DEVELOPER

Slide 23

Slide 23 text

Seek discomfort @nadrosia Seek Purpose and Find Your “Why” Take Care of Yourself Cultivate Social Connections

Slide 24

Slide 24 text

It is ok but how cultivate resilience from a practical side? Learning from other Fields

Slide 25

Slide 25 text

Resilience RESILIENCE IN THE CHAOS

Slide 26

Slide 26 text

* To be able to construct a mental representation of the situation. * To be able to assess risk and threats as relevant for the flight. * To be able to switch from a situation under control. * To be able to maintain a relevant level of confidence. * To be able to make a decision in a complex. RESILIENCE IN THE CHAOS

Slide 27

Slide 27 text

* To be able to make an intelligent usage of procedures. * To be able to use available technical and human resources. * To be able to manage time and time pressure. * To be able to cooperate with, crew members and other staff. * To be able to properly use and manage information. RESILIENCE IN THE CHAOS

Slide 28

Slide 28 text

Using Chaos Engineering

Slide 29

Slide 29 text

CHAOS ENGINEERING It is the discipline of experimenting failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/

Slide 30

Slide 30 text

Hypothesize about Steady State Run Experiments Vary Real-World Events Automate Experiments CHAOS PRINCIPLES

Slide 31

Slide 31 text

2008 Chaos Engineering was born at Netﬂix 2010 Chaos Monkey & Simian Army were launched 2016 Gremlin was born 2019 Chaos Massiﬁcation 2017 SRE USenix Chaos IQ ChaosConf 2018 Book Chaos Eng 2020 Book Chaos Eng CHAOS HISTORY

Slide 32

Slide 32 text

Chaos Monkey Chaos Toolkit Gremlin Chaos Mesh Chaos for Spring Boot CHAOS TOOLS

Slide 33

Slide 33 text

Chaos in CI/CD

Slide 34

Slide 34 text

CHAOS IN PIPELINES

Slide 35

Slide 35 text

Training with Chaos Gamedays

Slide 36

Slide 36 text

GameDays were created by Jesse Robbins inspired by his experience & training as a firefighter. A Chaos GameDay is an event hosted to conduct chaos experiments to validate or invalidate a hypothesis resilience.

Slide 37

Slide 37 text

GAMEDAYS -- CHAOS GAMEDAYS GameDays are interactive team-based learning exercises designed to give players a chance to put their skills to the test in a real-world, gamified, risk-free environment. A Chaos GameDay is a practice event, and although it can take a whole day, it usually requires only a few hours. The goal of a GameDay is to practice how you, your team, and your supporting systems deal with real-world turbulent conditions.

Slide 38

Slide 38 text

https://www.yurynino.dev/ Before After During ● Pick a hypothesis. ● Pick a style. ● Decide who. ● Decide where. ● Decide when. ● Document. ● Get approval! ● Detect the situation. ● Take a deep breath. ● Communicate. ● Visit dashboards. ● Analyze data. ● Propose solutions. ● Apply and solve! ● Write a postmortem. ● What Happened ● Impact ● Duration ● Resolution Time ● Resolution ● Timeline ● Action Items THE FRAMEWORK

Slide 39

Slide 39 text

First on Call Monitors, triages, and tries to mitigate failures caused by the Master of Disaster. Master of Disaster Decides the failure and declares start of incident and attack!!! Team Find and solve the exhibited issues, and write up postmortem. CHAOS GAMEDAYS ROLES

Slide 40

Slide 40 text

A USE CASE

Slide 41

Slide 41 text

THE EXPERIMENT

Slide 42

Slide 42 text

THE ATTACK

Slide 43

Slide 43 text

THE RESULTS

Slide 44

Slide 44 text

THE POSTMORTEM

Slide 45

Slide 45 text

RESILIENCE

Slide 46

Slide 46 text

https://chaosengineering.slack.com https://github.com/dastergon/ awesome-chaos-engineering https://www.infoq.com/chaos-engineering @yurynino HOW TO BEGIN

Slide 47

Slide 47 text

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Erik Hollnagel

Slide 48

Slide 48 text

@yurynino THANK YOU