Slide 1

Slide 1 text

Resilient Software Design The past, the present and the future Uwe Friedrichsen – codecentric AG – 2013-2022

Slide 2

Slide 2 text

Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

Slide 3

Slide 3 text

The past

Slide 4

Slide 4 text

Perception of resilience in IT in the past (and often still in the present)

Slide 5

Slide 5 text

Resilience = Fault tolerance Perception of resilience in IT in the past

Slide 6

Slide 6 text

Fault tolerance (early days) • Fault tolerance started decades ago • SAPO (1950s) • NASA LLNM computing (1960s), e.g., for Apollo and Voyager • F14 CADC (1970s) • Telecommunication switches (1970s) • Tandem Computers, Inc. (1970s) • Fault tolerance typically solved at hardware and OS level • Software development usually only affected marginally

Slide 7

Slide 7 text

Fault tolerance (continued) • Boom with rise of cloud and microservices (early 201x) • E.g., Netflix OS (especially Hystrix) • More software development attention • Called “resilience”, but still focus on fault-tolerance • Meanwhile more infrastructure-level support • E.g., service meshes, API gateways, cloud infrastructure • Often neglected for the sake of cost-efficiency

Slide 8

Slide 8 text

Perception of resilience outside of IT in the past (and still in the present)

Slide 9

Slide 9 text

Resilience ≠ Fault tolerance Perception of resilience outside of IT in the past

Slide 10

Slide 10 text

Resilience outside of IT • Multidisciplinary field • Psychological resilience • Organizational resilience • Supply chain resilience • Ecological resilience • Resilience engineering (safety) • Materials resilience • Cyber resilience (security) • ...

Slide 11

Slide 11 text

Resilience outside of IT (continued) • Often different focus • Robustness/fault tolerance: Handle known failure modes • Resilience: Adapt to unknown failure modes (“surprises”) • Still, no generally accepted definition • Definition depends on the field • Sometimes robustness is part of it, sometimes it is not • Often related to safety and dependability • Common ground: Resilience is more than just robustness

Slide 12

Slide 12 text

Bottom line

Slide 13

Slide 13 text

Bottom line (past) • Broad multidisciplinary field • No generally accepted definition • Used as synonym for fault-tolerance in IT • Usually expected to be solved at infrastructure level • Often ignored for the sake of maximizing cost-efficiency

Slide 14

Slide 14 text

The present

Slide 15

Slide 15 text

It became a bit quiet about resilient software design

Slide 16

Slide 16 text

Microservices became popular, but nothing else changed much ...

Slide 17

Slide 17 text

Ignoring the effects of distribution • Architects ignore effects of distribution • Developers ignore effects of distribution • Everyone else expects things to become faster and cheaper • Development expects infrastructure to solve the problems • Operations curses and is stressed out

Slide 18

Slide 18 text

You build it. You ignore it. Build things as you like and neglect the consequences of your acting!

Slide 19

Slide 19 text

Why is this a problem?

Slide 20

Slide 20 text

Distributed systems in a nutshell

Slide 21

Slide 21 text

Everything fails, all the time. -- Werner Vogels

Slide 22

Slide 22 text

Effects of distributed systems • Distributed systems introduce non-determinism regarding • Execution completeness • Message ordering • Communication timing • You will be affected by this at the application level • Don’t expect your infrastructure to hide all effects from you • Better know how to detect if it hit you and how to respond

Slide 23

Slide 23 text

What can the infrastructure do for us in such a setting?

Slide 24

Slide 24 text

Infrastructure level means • Detect if a peer does not (timely) respond • Retry accessing the peer • Try to access a different instance from failover group • Try to fire up new instances • After instance loss is detected • If load exceeds a certain level (“autoscale”) • Throttle incoming requests • Notify administrators if additional action is required • …

Slide 25

Slide 25 text

This is quite a bit, but …

Slide 26

Slide 26 text

Infrastructure level limitations • Not all failure modes supported (e.g., response failures) • Not all patterns supported (e.g., idempotency, fallback) • Not ubiquitously available (e.g., on-premises autoscale) • Often support from application level required (e.g., metrics) • Only undifferentiated, coarse-grained actions possible

Slide 27

Slide 27 text

The effects of distribution will still hit you at the application level

Slide 28

Slide 28 text

The question no longer is if failures will hit you. The only question left is when and how bad they will hit you.

Slide 29

Slide 29 text

Complexity of IT system landscapes grows continually ...

Slide 30

Slide 30 text

System landscape complexity • New development projects only focus on local optimization • Ignoring impact on complexity of whole system landscape • Leads to disproportionate increase in complexity • New paradigms only focus on their advantages • Ignoring effects on complexity of whole system landscape • Leads to disproportionate increase in complexity • Only a matter of time until IT will collapse beyond repair

Slide 31

Slide 31 text

This requires resilience thinking beyond application robustness But it also requires more focus on application robustness, i.e., resilient software design

Slide 32

Slide 32 text

Slowly, companies start to realize that there might be a problem ...

Slide 33

Slide 33 text

... that they might steer towards an abyss ...

Slide 34

Slide 34 text

... that they might need more resilience ...

Slide 35

Slide 35 text

... and must not neglect IT ...

Slide 36

Slide 36 text

... that their future viability might depend on their resilience

Slide 37

Slide 37 text

Still, they usually act based on habits and old practice ...

Slide 38

Slide 38 text

... neglecting building resilience for short-term gains

Slide 39

Slide 39 text

Bottom line

Slide 40

Slide 40 text

Bottom line (present) • Understanding grows that resilience is needed at all levels • Complexity of IT landscapes has become a problem • Still, investments are scarce • “It’s going to be alright” mindset still prevalent

Slide 41

Slide 41 text

The future

Slide 42

Slide 42 text

Resilience will become the topic of the 21st century alongside with sustainability

Slide 43

Slide 43 text

Not because companies want to but because they do not have a choice

Slide 44

Slide 44 text

Still, several challenges need to be solved first

Slide 45

Slide 45 text

Homework that needs to be done

Slide 46

Slide 46 text

Homework that needs to be done • Stop fighting about the “right” definition of resilience • In the end, all resilience proponents have the same goal • Debates about the “right” definition only confuse other people • Makes it harder to spread the ideas and their implementation

Slide 47

Slide 47 text

Here is my suggestion ...

Slide 48

Slide 48 text

resilience The ability to successfully cope with adverse events and situations, including 1. handling expected adverse events and situations (robustness) 2. handling unexpected adverse events and situations (surprise) 3. improving due to adverse events and situations (anti-fragility) resilient software design Designing and building software-based systems in ways that improve their dependability and thus support resilience according to the definition above

Slide 49

Slide 49 text

Homework that needs to be done • Stop fighting about the “right” definition of resilience • Break traditional company habits • Maximizing efficiency cripples resilience

Slide 50

Slide 50 text

Acceptable variance Large Small Large Small Achievable resilience Achievable efficiency

Slide 51

Slide 51 text

Homework that needs to be done • Stop fighting about the “right” definition of resilience • Break traditional company habits • Maximizing efficiency cripples resilience • Short-term thinking compromises resilience • Focus on minimizing short-term development costs compromises resilient software design • Huge change of ingrained mindset

Slide 52

Slide 52 text

Homework that needs to be done • Stop fighting about the “right” definition of resilience • Break traditional company habits • Understand resilience in IT • Resilience is a socio-technical topic • Cannot be solved at the technical level alone • Cannot be solved with tools or products • Technology can only support

Slide 53

Slide 53 text

Homework that needs to be done • Stop fighting about the “right” definition of resilience • Break traditional company habits • Understand resilience in IT • Understand resilient software design • Cannot be solved at the infrastructure level • Requires tight ops-dev feedback loops to be effective • Without a proper functional design, nothing else matters

Slide 54

Slide 54 text

Now what?

Slide 55

Slide 55 text

Some recommendations • Regarding system design • Mind the functional design • Strive for functional independence of runtime units • Then augment with resilience patterns • Domain-driven design can support

Slide 56

Slide 56 text

Some recommendations • Regarding system design • Regarding software landscape grooming • Simplify! – Complexity is the enemy of resilience • Coordinate infrastructure, application and organization level measures

Slide 57

Slide 57 text

Some recommendations • Regarding system design • Regarding software landscape grooming • Regarding IT organization and processes • Establish short feedback loops across the IT value chain • Make resilience a continuous improvement process • Include chaos engineering

Slide 58

Slide 58 text

Some recommendations • Regarding system design • Regarding software landscape grooming • Regarding IT organization and processes • Regarding product functionality • Simplify! –Keep the product simple • Regularly remove features that are rarely or not used • Implement business metrics

Slide 59

Slide 59 text

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” -- Antoine de Saint-Exupery

Slide 60

Slide 60 text

Some recommendations • Regarding system design • Regarding software landscape grooming • Regarding IT organization and processes • Regarding product functionality • Regarding humans • Provide great user experience for all types of users • Provide training for all parties along the IT value chain

Slide 61

Slide 61 text

More to ponder

Slide 62

Slide 62 text

More to ponder • Organic computing • Interplay between resilience and sustainability • Interplay between resilience and security • Resilience beyond robustness, withstanding and recovery

Slide 63

Slide 63 text

Summing up

Slide 64

Slide 64 text

Summing up • Resilience is huge multidisciplinary topic • Started as fault-tolerance in IT • Had a little hype a few years ago • Will become essential topic of the 21st century • Much more than fault-tolerance or robustness alone • Awareness increases • Yet currently little investments • Lots of homework to be done

Slide 65

Slide 65 text

The future is already here – it's just not evenly distributed. ― William Gibson

Slide 66

Slide 66 text

Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/