Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Engineering Resilient Systems through Cross-Disciplinary insight

Engineering Resilient Systems through Cross-Disciplinary insight

What can we learn from the resilience that we see in organizations and Software Projects, and how can we apply this to build resilient IT infrastructure.

Volker Hilsheimer

February 02, 2013

Other Decks in Technology


  1. What do I know? •  Trolltech, Nokia, CFEngine •  Team

    lead, Facilitator •  Couchsurfer •  Taste-Discoverer •  @vhilsheimer
  2. resilience [noun] The physical property of a material that can

    return to its original shape or position after deformation that does not exceed its elastic limit The ability of an ecosystem to return to its original state after being disturbed The ability to recover quickly from illness, change, or misfortune.
  3. Design vs Reality •  Systems as Designed –  State diagrams,

    flowcharts, models –  Static and deterministic •  Systems in Reality –  Complex, non-linear –  Dynamic, stochastic, non-deterministic •  And yet, things work most of the time
  4. Why do things NOT go wrong? •  Negatives have only

    limited use for improvement –  Stability is not about the absence of something –  It is not something a system has –  It is something a system does •  People hold the inherent imperfection together –  Able to adjust the system beyond its limitations –  Can anticipate, recognize, respond and learn How empowered are the people in your system?
  5. Resilience is the ability to Anticipate Recognize Respond Learn To

    engineer resilient systems we must therefore encourage the principles, methods and behaviors by which these qualities can be brought about.
  6. Resilience in Organizations Patterns •  Try to understand •  Establish

    a shared purpose •  Engaging dialog, respect •  Blame-free retrospection •  Autonomy and self-organization •  Promote collaboration, agreement •  Create transparency •  Encourage leadership at all levels •  Take risks and fail fast Anti-Patterns •  Focus on org-charts and hierarchy •  Internal competition •  Weed out failures •  Over-commit, lack focus •  Apply micro-management •  Push change through top-down •  Take past success for granted •  Local optimization •  Don’t take risks
  7. Resilience in Software Projects Patterns •  Meritocratic, responsibility •  Explicit

    policies, coding style •  Short cycles creating user value •  Modularity, testable code •  Continuous integration •  Time for refactoring, iterations •  Fix bugs before writing new code •  Build knowledge through code reviews and commit logs Anti-Patterns •  Chief Architect and Planners •  Change Control mechanisms •  Painful release process •  Spaghetti code, complex dependencies •  (Waterfall) Plans over results •  QA organization, late testing •  Knowledge exists outside the code (if at all)
  8. Engineering Resilient IT Systems •  Allow operators to anticipate and

    recognize –  Automation frees the operators from mundane tasks –  Connect monitoring to the ontology of the system •  Enable reasoning about the system –  Avoid black boxes – they block mental simulation –  Make knowledge about the system part of the system •  Enable fast failure and fast recovery –  Infrastructure is testable code, specification, documentation •  Remove bottlenecks and dependencies –  Loose coupling and voluntary cooperation –  Autonomy, agility
  9. Engineering Resilient IT Systems •  Cross-functional teams with shared goals

    –  DevOps, Kanban, Agile •  Design for continuous maintenance –  Not a periodic, planned activity –  Make visible how components can safely be replaced •  Empower the operators –  We must trust them with the controls to our systems –  Use tools and workflows that increase confidence, not control
  10. Can we measure resilience? Your system has a resilience score

    of 37. To increase the score, you can •  Add more comments and use self-explanatory variable names in your policy code for service “webshop” •  Share more knowledge – only one user has made 39 changes to the configuration of “apache” in the last 7 months •  Increase redundancy – you have a linear dependency between hosts ws1 as1 and db1 for your mission critical service “webshop” •  Increase capacity for host db1 – CPU and disk IO are high at the same time as service “payment service” peaks