Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient software design - The past, the present and the future

Resilient software design - The past, the present and the future

This slide deck is a bit of stocktaking and a look ahead regarding resilience in IT and resilience software design. It starts with the past - how resilience was treated inside and outside of IT.

Then it moves to the current state where it became a bit quit around resilient software design while at the same time microservices are everywhere and the complexity of the overall IT system landscape explodes, every day becoming a bit worse. One would expect that the problems become apparent for everyone but surprisingly they do not.

Still, companies start to realize at the business level that they might need more resilience due to unexpected adverse events and uncertainty becoming the norm. And as business and IT have become inseparable due to the ongoing digital transformation, it also affects IT. But as old habits die hard, investments are still scarce.

In the third part, the slide deck looks at the future of resilience and resilient software design. The prediction is that resilience alongside with sustainability will become the most important topics of IT (and the companies they support) in the ongoing 21st century. Still, to get there some homework needs to be done first. The slide deck lists a selection.

Finally, a few recommendations where to start and how to organize your way towards a resilient IT organization including a robust IT system landscape are given.

As always, the slide deck does not contain the voice track which means a lot of details are missing. Still, I hope it gives you a few ideas to ponder ...

Uwe Friedrichsen

November 17, 2022
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. Resilient Software Design The past, the present and the future

    Uwe Friedrichsen – codecentric AG – 2013-2022
  2. Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

  3. The past

  4. Perception of resilience in IT in the past (and often

    still in the present)
  5. Resilience = Fault tolerance Perception of resilience in IT in

    the past
  6. Fault tolerance (early days) • Fault tolerance started decades ago

    • SAPO (1950s) • NASA LLNM computing (1960s), e.g., for Apollo and Voyager • F14 CADC (1970s) • Telecommunication switches (1970s) • Tandem Computers, Inc. (1970s) • Fault tolerance typically solved at hardware and OS level • Software development usually only affected marginally
  7. Fault tolerance (continued) • Boom with rise of cloud and

    microservices (early 201x) • E.g., Netflix OS (especially Hystrix) • More software development attention • Called “resilience”, but still focus on fault-tolerance • Meanwhile more infrastructure-level support • E.g., service meshes, API gateways, cloud infrastructure • Often neglected for the sake of cost-efficiency
  8. Perception of resilience outside of IT in the past (and

    still in the present)
  9. Resilience ≠ Fault tolerance Perception of resilience outside of IT

    in the past
  10. Resilience outside of IT • Multidisciplinary field • Psychological resilience

    • Organizational resilience • Supply chain resilience • Ecological resilience • Resilience engineering (safety) • Materials resilience • Cyber resilience (security) • ...
  11. Resilience outside of IT (continued) • Often different focus •

    Robustness/fault tolerance: Handle known failure modes • Resilience: Adapt to unknown failure modes (“surprises”) • Still, no generally accepted definition • Definition depends on the field • Sometimes robustness is part of it, sometimes it is not • Often related to safety and dependability • Common ground: Resilience is more than just robustness
  12. Bottom line

  13. Bottom line (past) • Broad multidisciplinary field • No generally

    accepted definition • Used as synonym for fault-tolerance in IT • Usually expected to be solved at infrastructure level • Often ignored for the sake of maximizing cost-efficiency
  14. The present

  15. It became a bit quiet about resilient software design

  16. Microservices became popular, but nothing else changed much ...

  17. Ignoring the effects of distribution • Architects ignore effects of

    distribution • Developers ignore effects of distribution • Everyone else expects things to become faster and cheaper • Development expects infrastructure to solve the problems • Operations curses and is stressed out
  18. You build it. You ignore it. Build things as you

    like and neglect the consequences of your acting!
  19. Why is this a problem?

  20. Distributed systems in a nutshell

  21. Everything fails, all the time. -- Werner Vogels

  22. Effects of distributed systems • Distributed systems introduce non-determinism regarding

    • Execution completeness • Message ordering • Communication timing • You will be affected by this at the application level • Don’t expect your infrastructure to hide all effects from you • Better know how to detect if it hit you and how to respond
  23. What can the infrastructure do for us in such a

    setting?
  24. Infrastructure level means • Detect if a peer does not

    (timely) respond • Retry accessing the peer • Try to access a different instance from failover group • Try to fire up new instances • After instance loss is detected • If load exceeds a certain level (“autoscale”) • Throttle incoming requests • Notify administrators if additional action is required • …
  25. This is quite a bit, but …

  26. Infrastructure level limitations • Not all failure modes supported (e.g.,

    response failures) • Not all patterns supported (e.g., idempotency, fallback) • Not ubiquitously available (e.g., on-premises autoscale) • Often support from application level required (e.g., metrics) • Only undifferentiated, coarse-grained actions possible
  27. The effects of distribution will still hit you at the

    application level
  28. The question no longer is if failures will hit you.

    The only question left is when and how bad they will hit you.
  29. Complexity of IT system landscapes grows continually ...

  30. System landscape complexity • New development projects only focus on

    local optimization • Ignoring impact on complexity of whole system landscape • Leads to disproportionate increase in complexity • New paradigms only focus on their advantages • Ignoring effects on complexity of whole system landscape • Leads to disproportionate increase in complexity • Only a matter of time until IT will collapse beyond repair
  31. This requires resilience thinking beyond application robustness But it also

    requires more focus on application robustness, i.e., resilient software design
  32. Slowly, companies start to realize that there might be a

    problem ...
  33. ... that they might steer towards an abyss ...

  34. ... that they might need more resilience ...

  35. ... and must not neglect IT ...

  36. ... that their future viability might depend on their resilience

  37. Still, they usually act based on habits and old practice

    ...
  38. ... neglecting building resilience for short-term gains

  39. Bottom line

  40. Bottom line (present) • Understanding grows that resilience is needed

    at all levels • Complexity of IT landscapes has become a problem • Still, investments are scarce • “It’s going to be alright” mindset still prevalent
  41. The future

  42. Resilience will become the topic of the 21st century alongside

    with sustainability
  43. Not because companies want to but because they do not

    have a choice
  44. Still, several challenges need to be solved first

  45. Homework that needs to be done

  46. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • In the end, all resilience proponents have the same goal • Debates about the “right” definition only confuse other people • Makes it harder to spread the ideas and their implementation
  47. Here is my suggestion ...

  48. resilience The ability to successfully cope with adverse events and

    situations, including 1. handling expected adverse events and situations (robustness) 2. handling unexpected adverse events and situations (surprise) 3. improving due to adverse events and situations (anti-fragility) resilient software design Designing and building software-based systems in ways that improve their dependability and thus support resilience according to the definition above
  49. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Maximizing efficiency cripples resilience
  50. Acceptable variance Large Small Large Small Achievable resilience Achievable efficiency

  51. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Maximizing efficiency cripples resilience • Short-term thinking compromises resilience • Focus on minimizing short-term development costs compromises resilient software design • Huge change of ingrained mindset
  52. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Understand resilience in IT • Resilience is a socio-technical topic • Cannot be solved at the technical level alone • Cannot be solved with tools or products • Technology can only support
  53. Homework that needs to be done • Stop fighting about

    the “right” definition of resilience • Break traditional company habits • Understand resilience in IT • Understand resilient software design • Cannot be solved at the infrastructure level • Requires tight ops-dev feedback loops to be effective • Without a proper functional design, nothing else matters
  54. Now what?

  55. Some recommendations • Regarding system design • Mind the functional

    design • Strive for functional independence of runtime units • Then augment with resilience patterns • Domain-driven design can support
  56. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Simplify! – Complexity is the enemy of resilience • Coordinate infrastructure, application and organization level measures
  57. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Regarding IT organization and processes • Establish short feedback loops across the IT value chain • Make resilience a continuous improvement process • Include chaos engineering
  58. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Regarding IT organization and processes • Regarding product functionality • Simplify! –Keep the product simple • Regularly remove features that are rarely or not used • Implement business metrics
  59. “Perfection is achieved, not when there is nothing more to

    add, but when there is nothing left to take away.” -- Antoine de Saint-Exupery
  60. Some recommendations • Regarding system design • Regarding software landscape

    grooming • Regarding IT organization and processes • Regarding product functionality • Regarding humans • Provide great user experience for all types of users • Provide training for all parties along the IT value chain
  61. More to ponder

  62. More to ponder • Organic computing • Interplay between resilience

    and sustainability • Interplay between resilience and security • Resilience beyond robustness, withstanding and recovery
  63. Summing up

  64. Summing up • Resilience is huge multidisciplinary topic • Started

    as fault-tolerance in IT • Had a little hype a few years ago • Will become essential topic of the 21st century • Much more than fault-tolerance or robustness alone • Awareness increases • Yet currently little investments • Lots of homework to be done
  65. The future is already here – it's just not evenly

    distributed. ― William Gibson
  66. Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/