Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient software design - The past, the present and the future

Resilient software design - The past, the present and the future

This slide deck is a bit of stocktaking and a look ahead regarding resilience in IT and resilience software design. It starts with the past - how resilience was treated inside and outside of IT.

Then it moves to the current state where it became a bit quit around resilient software design while at the same time microservices are everywhere and the complexity of the overall IT system landscape explodes, every day becoming a bit worse. One would expect that the problems become apparent for everyone but surprisingly they do not.

Still, companies start to realize at the business level that they might need more resilience due to unexpected adverse events and uncertainty becoming the norm. And as business and IT have become inseparable due to the ongoing digital transformation, it also affects IT. But as old habits die hard, investments are still scarce.

In the third part, the slide deck looks at the future of resilience and resilient software design. The prediction is that resilience alongside with sustainability will become the most important topics of IT (and the companies they support) in the ongoing 21st century. Still, to get there some homework needs to be done first. The slide deck lists a selection.

Finally, a few recommendations where to start and how to organize your way towards a resilient IT organization including a robust IT system landscape are given.

As always, the slide deck does not contain the voice track which means a lot of details are missing. Still, I hope it gives you a few ideas to ponder ...

Uwe Friedrichsen

November 17, 2022
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. Resilient Software Design
    The past, the present and the future
    Uwe Friedrichsen – codecentric AG – 2013-2022

    View Slide

  2. Uwe Friedrichsen
    Works @ codecentric
    https://twitter.com/ufried
    https://www.speakerdeck.com/ufried
    https://ufried.com/

    View Slide

  3. The past

    View Slide

  4. Perception of resilience in IT in the past
    (and often still in the present)

    View Slide

  5. Resilience = Fault tolerance
    Perception of resilience in IT in the past

    View Slide

  6. Fault tolerance (early days)
    • Fault tolerance started decades ago
    • SAPO (1950s)
    • NASA LLNM computing (1960s), e.g., for Apollo and Voyager
    • F14 CADC (1970s)
    • Telecommunication switches (1970s)
    • Tandem Computers, Inc. (1970s)
    • Fault tolerance typically solved at hardware and OS level
    • Software development usually only affected marginally

    View Slide

  7. Fault tolerance (continued)
    • Boom with rise of cloud and microservices (early 201x)
    • E.g., Netflix OS (especially Hystrix)
    • More software development attention
    • Called “resilience”, but still focus on fault-tolerance
    • Meanwhile more infrastructure-level support
    • E.g., service meshes, API gateways, cloud infrastructure
    • Often neglected for the sake of cost-efficiency

    View Slide

  8. Perception of resilience outside of IT in the past
    (and still in the present)

    View Slide

  9. Resilience ≠ Fault tolerance
    Perception of resilience outside of IT in the past

    View Slide

  10. Resilience outside of IT
    • Multidisciplinary field
    • Psychological resilience
    • Organizational resilience
    • Supply chain resilience
    • Ecological resilience
    • Resilience engineering (safety)
    • Materials resilience
    • Cyber resilience (security)
    • ...

    View Slide

  11. Resilience outside of IT (continued)
    • Often different focus
    • Robustness/fault tolerance: Handle known failure modes
    • Resilience: Adapt to unknown failure modes (“surprises”)
    • Still, no generally accepted definition
    • Definition depends on the field
    • Sometimes robustness is part of it, sometimes it is not
    • Often related to safety and dependability
    • Common ground: Resilience is more than just robustness

    View Slide

  12. Bottom line

    View Slide

  13. Bottom line (past)
    • Broad multidisciplinary field
    • No generally accepted definition
    • Used as synonym for fault-tolerance in IT
    • Usually expected to be solved at infrastructure level
    • Often ignored for the sake of maximizing cost-efficiency

    View Slide

  14. The present

    View Slide

  15. It became a bit quiet about resilient software design

    View Slide

  16. Microservices became popular, but nothing else changed much ...

    View Slide

  17. Ignoring the effects of distribution
    • Architects ignore effects of distribution
    • Developers ignore effects of distribution
    • Everyone else expects things to become faster and cheaper
    • Development expects infrastructure to solve the problems
    • Operations curses and is stressed out

    View Slide

  18. You build it. You ignore it.
    Build things as you like and neglect the consequences of your acting!

    View Slide

  19. Why is this a problem?

    View Slide

  20. Distributed systems in a nutshell

    View Slide

  21. Everything fails, all the time.
    -- Werner Vogels

    View Slide

  22. Effects of distributed systems
    • Distributed systems introduce non-determinism regarding
    • Execution completeness
    • Message ordering
    • Communication timing
    • You will be affected by this at the application level
    • Don’t expect your infrastructure to hide all effects from you
    • Better know how to detect if it hit you and how to respond

    View Slide

  23. What can the infrastructure do for us in such a setting?

    View Slide

  24. Infrastructure level means
    • Detect if a peer does not (timely) respond
    • Retry accessing the peer
    • Try to access a different instance from failover group
    • Try to fire up new instances
    • After instance loss is detected
    • If load exceeds a certain level (“autoscale”)
    • Throttle incoming requests
    • Notify administrators if additional action is required
    • …

    View Slide

  25. This is quite a bit, but …

    View Slide

  26. Infrastructure level limitations
    • Not all failure modes supported (e.g., response failures)
    • Not all patterns supported (e.g., idempotency, fallback)
    • Not ubiquitously available (e.g., on-premises autoscale)
    • Often support from application level required (e.g., metrics)
    • Only undifferentiated, coarse-grained actions possible

    View Slide

  27. The effects of distribution will still hit you at the application level

    View Slide

  28. The question no longer is if failures will hit you.
    The only question left is when and how bad they will hit you.

    View Slide

  29. Complexity of IT system landscapes grows continually ...

    View Slide

  30. System landscape complexity
    • New development projects only focus on local optimization
    • Ignoring impact on complexity of whole system landscape
    • Leads to disproportionate increase in complexity
    • New paradigms only focus on their advantages
    • Ignoring effects on complexity of whole system landscape
    • Leads to disproportionate increase in complexity
    • Only a matter of time until IT will collapse beyond repair

    View Slide

  31. This requires resilience thinking beyond application robustness
    But it also requires more focus on application robustness, i.e., resilient software design

    View Slide

  32. Slowly, companies start to realize that there might be a problem ...

    View Slide

  33. ... that they might steer towards an abyss ...

    View Slide

  34. ... that they might need more resilience ...

    View Slide

  35. ... and must not neglect IT ...

    View Slide

  36. ... that their future viability might depend on their resilience

    View Slide

  37. Still, they usually act based on habits and old practice ...

    View Slide

  38. ... neglecting building resilience for short-term gains

    View Slide

  39. Bottom line

    View Slide

  40. Bottom line (present)
    • Understanding grows that resilience is needed at all levels
    • Complexity of IT landscapes has become a problem
    • Still, investments are scarce
    • “It’s going to be alright” mindset still prevalent

    View Slide

  41. The future

    View Slide

  42. Resilience will become the topic of the 21st century
    alongside with sustainability

    View Slide

  43. Not because companies want to
    but because they do not have a choice

    View Slide

  44. Still, several challenges need to be solved first

    View Slide

  45. Homework that needs to be done

    View Slide

  46. Homework that needs to be done
    • Stop fighting about the “right” definition of resilience
    • In the end, all resilience proponents have the same goal
    • Debates about the “right” definition only confuse other people
    • Makes it harder to spread the ideas and their implementation

    View Slide

  47. Here is my suggestion ...

    View Slide

  48. resilience
    The ability to successfully cope with adverse events and situations, including
    1. handling expected adverse events and situations (robustness)
    2. handling unexpected adverse events and situations (surprise)
    3. improving due to adverse events and situations (anti-fragility)
    resilient software design
    Designing and building software-based systems in ways that improve their
    dependability and thus support resilience according to the definition above

    View Slide

  49. Homework that needs to be done
    • Stop fighting about the “right” definition of resilience
    • Break traditional company habits
    • Maximizing efficiency cripples resilience

    View Slide

  50. Acceptable variance
    Large Small
    Large
    Small
    Achievable resilience
    Achievable efficiency

    View Slide

  51. Homework that needs to be done
    • Stop fighting about the “right” definition of resilience
    • Break traditional company habits
    • Maximizing efficiency cripples resilience
    • Short-term thinking compromises resilience
    • Focus on minimizing short-term development costs
    compromises resilient software design
    • Huge change of ingrained mindset

    View Slide

  52. Homework that needs to be done
    • Stop fighting about the “right” definition of resilience
    • Break traditional company habits
    • Understand resilience in IT
    • Resilience is a socio-technical topic
    • Cannot be solved at the technical level alone
    • Cannot be solved with tools or products
    • Technology can only support

    View Slide

  53. Homework that needs to be done
    • Stop fighting about the “right” definition of resilience
    • Break traditional company habits
    • Understand resilience in IT
    • Understand resilient software design
    • Cannot be solved at the infrastructure level
    • Requires tight ops-dev feedback loops to be effective
    • Without a proper functional design, nothing else matters

    View Slide

  54. Now what?

    View Slide

  55. Some recommendations
    • Regarding system design
    • Mind the functional design
    • Strive for functional independence of runtime units
    • Then augment with resilience patterns
    • Domain-driven design can support

    View Slide

  56. Some recommendations
    • Regarding system design
    • Regarding software landscape grooming
    • Simplify! – Complexity is the enemy of resilience
    • Coordinate infrastructure, application and organization level
    measures

    View Slide

  57. Some recommendations
    • Regarding system design
    • Regarding software landscape grooming
    • Regarding IT organization and processes
    • Establish short feedback loops across the IT value chain
    • Make resilience a continuous improvement process
    • Include chaos engineering

    View Slide

  58. Some recommendations
    • Regarding system design
    • Regarding software landscape grooming
    • Regarding IT organization and processes
    • Regarding product functionality
    • Simplify! –Keep the product simple
    • Regularly remove features that are rarely or not used
    • Implement business metrics

    View Slide

  59. “Perfection is achieved,
    not when there is nothing more to add,
    but when there is nothing left to take away.”
    -- Antoine de Saint-Exupery

    View Slide

  60. Some recommendations
    • Regarding system design
    • Regarding software landscape grooming
    • Regarding IT organization and processes
    • Regarding product functionality
    • Regarding humans
    • Provide great user experience for all types of users
    • Provide training for all parties along the IT value chain

    View Slide

  61. More to ponder

    View Slide

  62. More to ponder
    • Organic computing
    • Interplay between resilience and sustainability
    • Interplay between resilience and security
    • Resilience beyond robustness, withstanding and recovery

    View Slide

  63. Summing up

    View Slide

  64. Summing up
    • Resilience is huge multidisciplinary topic
    • Started as fault-tolerance in IT
    • Had a little hype a few years ago
    • Will become essential topic of the 21st century
    • Much more than fault-tolerance or robustness alone
    • Awareness increases
    • Yet currently little investments
    • Lots of homework to be done

    View Slide

  65. The future is already here – it's just not evenly distributed.
    ― William Gibson

    View Slide

  66. Uwe Friedrichsen
    Works @ codecentric
    https://twitter.com/ufried
    https://www.speakerdeck.com/ufried
    https://ufried.com/

    View Slide