Slide 1

Slide 1 text

Why Software Developers should build a Resilience Culture based on Chaos Engineering

Slide 2

Slide 2 text

YURY NIÑO CLOUD INFRASTRUCTURE ENGINEER SITE RELIABILITY ENGINEERING CHAOS ENGINEERING ADVOCATE

Slide 3

Slide 3 text

https://www.infoq.com/articles/architecture-trends-2021/

Slide 4

Slide 4 text

Everybody who implements software knows that our systems must be resilient but what happens with us, the humans? Should we be resilient?

Slide 5

Slide 5 text

If for SLAs, there’s no such thing as 100% Uptime - Why does Humans should be available all time?

Slide 6

Slide 6 text

AGENDA * Applications must be resilient * How to probe those Patterns? * Using Chaos Engineering * Building software is complex * what about humans? * should they be resilient? * How? With Chaos Game Days

Slide 7

Slide 7 text

HOW DO WE BUILD SOFTWARE?

Slide 8

Slide 8 text

* Design for Least Privilege * Design for Understandability * Design for Changing Landscape * Design for Resilience * Design for Recovery DESIGN PRINCIPLES

Slide 9

Slide 9 text

CODING PRINCIPLES * Programming Language Choice * Complexity vs Understandability * Securing Third-Party Software * Testing Code * Identifying weakness * Implement patterns for Resilience

Slide 10

Slide 10 text

APPLICATIONS MUST BE RESILIENT

Slide 11

Slide 11 text

Code will inevitably include bugs, but we can avoid them using hardened frameworks to resilience.

Slide 12

Slide 12 text

Steady State Let it Crash Circuit Breakers Fail Fast Handshaking Test Harnesses Governor

Slide 13

Slide 13 text

BUILD SOFTWARE IS COMPLEX

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

When we write software, we are mentally trying to execute the code, to understand what is happening. That process is called TRACING. The part of the brain used to do tracing is called the WORKING MEMORY.

Slide 16

Slide 16 text

Confusion while coding can be caused by: * A lack of knowledge * A lack of easy-to-access information * A lack of processing power in the brain. Mental models are mental representations that we form while thinking of problems. People can hold multiple mental models that can compete with each other.

Slide 17

Slide 17 text

By anticipating mistakes like failing, SDLC can eliminate redundant rework and after-the-fact fixes.

Slide 18

Slide 18 text

It is ok but how to probe it works?

Slide 19

Slide 19 text

Using Chaos Engineering

Slide 20

Slide 20 text

CHAOS ENGINEERING It is the discipline of experimenting failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/

Slide 21

Slide 21 text

Hypothesize about Steady State Run Experiments Vary Real-World Events Automate Experiments CHAOS PRINCIPLES

Slide 22

Slide 22 text

2008 Chaos Engineering was born at Netflix 2010 Chaos Monkey & Simian Army were launched 2016 Gremlin was born 2019 Chaos Massification 2017 SRE USenix Chaos IQ ChaosConf 2018 Book Chaos Eng 2020 Book Chaos Eng CHAOS HISTORY

Slide 23

Slide 23 text

Humans, are central to both the problem and the solution of challenges in engineering!

Slide 24

Slide 24 text

When we are not punished or humiliated for speaking up with ideas, questions, concerns, or mistakes. We feel comfortable being ourselves. David Altman

Slide 25

Slide 25 text

Resilience is the ability to positively adapt to difficult situations and overcome adversity. Resilience includes both physical and mental positive adaptation. Resilience sounds like something you want, but why do you need it? Software development is filled with mental challenges. @nadrosia

Slide 26

Slide 26 text

Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Erik Hollnagel

Slide 27

Slide 27 text

4 Essential Capabilities 4 Sets of answers to construct resilience profile Actual Respond Factual Learn Critical Monitor Potential Anticipate https://www.yurynino.dev/

Slide 28

Slide 28 text

Human Factors HUMAN FACTORS

Slide 29

Slide 29 text

● To be able to construct a mental representation. ● To be able to assess risks and threats as relevant. ● To be able to switch from a situation under control. ● To be able to maintain a relevant level of confidence. ● To be able to make a decision in a complex situation. https://www.yurynino.dev/ IN AN EMERGENCY

Slide 30

Slide 30 text

● To be able to make an intelligent usage of procedures. ● To be able to use available resources. ● To be able to manage time and pressure. ● To be able to cooperate with and crew members. ● To be able to properly use and manage information. IN AN EMERGENCY

Slide 31

Slide 31 text

Humans operate differently when they expect things to fail! Aaron Rinehart

Slide 32

Slide 32 text

Chaos GameDays GameDays are an interactive, real-world and learning exercises. They are designed to give players a chance to put their skills in a technology to test. GameDays were created by Jesse Robbins inspired by his experience & training as a firefighter.

Slide 33

Slide 33 text

First on Call Monitors, triages, and tries to mitigate failures caused by the Master of Disaster. Master of Disaster Decides the failure and declares start of incident and attack!!! Team Find and solve the exhibited issues, and write up postmortem. HOW TO RUN A GAMEDAY?

Slide 34

Slide 34 text

Before After During ● Pick a hypothesis. ● Pick a style. ● Decide who. ● Decide where. ● Decide when. ● Document. ● Get approval! ● Detect the situation. ● Take a deep breath. ● Communicate. ● Visit dashboards. ● Analyze data. ● Propose solutions. ● Apply and solve! ● Write a postmortem. ● What Happened ● Impact ● Duration ● Resolution Time ● Resolution ● Timeline ● Action Items

Slide 35

Slide 35 text

Chaos References

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

https://chaosengineering.slack.com https://github.com/dastergon/ awesome-chaos-engineering https://www.infoq.com/chaos-engineering @yurynino HOW TO BEGIN

Slide 38

Slide 38 text

@yurynino THANK YOU