Resilience as a Continuous Delivery Enabler

The traditional IT reliability strategy is robustness - optimise for MTBF (mean time between failures) by maintaining a failure-free production environment. This assumes failures are preventable, and depends on risk management theatre such as end-to-end testing. It leaves organisations trapped in discontinuous delivery, and in such circumstances a continuous delivery programme focused on throughput is unlikely to succeed.

A more effective reliability strategy in most scenarios is resilience - optimise for MTTR (mean time to repair) by rapidly responding to production failures. This assumes failures are inevitable, and depends on the adaptive capacity of teams and services being increased with operability practices.

Resilience is also a great enabler of continuous delivery. A continuous delivery programme focused on resilience will increase stakeholder confidence, and lay the groundwork for challenging robustness risk management theatre to increase throughput.

During this talk, Steve Smith will explain why discontinuous delivery is part of the tradition of optimising for MTBF, and how optimising for MTTR can power continuous delivery adoption. This is an overview of a new approach to continuous delivery, backed by examples from private and public sector organisations.

The key learnings for participants are:

1. optimising for MTBF is an antiquated, flawed approach to IT reliability that results in long-term discontinuous delivery
2. if an organisation has optimised for MTBF, a continuous delivery programme focused on throughput is likely to fail
3. optimising for MTTR is a superior reliability strategy that advocates graceful extensibility to limit the impact of failures
4. resilience as a continuous delivery enabler is a heuristic that advocates resilience as the focus of a continuous delivery programme
5. improving the resilience of services by an order of magnitude makes it easier to offer practical alternatives to robustness risk management theatre


Steve Smith

May 31, 2018


  4. What is Reliability? “Reliability is the probability that a system

    will perform a required function without failure under stated conditions for a stated period of time” Patrick O’Connor and Andre Kleyner - Practical Reliability Engineering Reliable IT services keep an organisation running Without reliability, Continuous Delivery is worthless
  19. Optimise For Robustness “The complexity of these systems makes it

    impossible for them to run without multiple flaws being present” Richard Cook - How Complex Systems Fail A production environment is a complex system A production environment is always near failure
  22. Optimise For Resilience

  23. Optimise For Resilience “Graceful extensibility is the ability of a

    system to extend its capacity to adapt when surprise events challenge its boundaries” David Woods - Four Concepts for Resilience Graceful extensibility comes from adaptive capacity Sources of adaptive capacity must be created Graceful extensibility leads to sustained adaptability
  40. Summary “Resilience and the ability to innovate... are essential” Dr

    Nicole Forsgren, Jez Humble, and Gene Kim - Accelerate Optimising For Robustness is a flawed strategy Optimising For Resilience is a superior approach, and is a great foundation for Continuous Delivery
