Engineering Resilient Systems through Cross-Disciplinary insight
What can we learn from the resilience that we see in organizations and Software Projects, and how can we apply this to build resilient IT infrastructure.
return to its original shape or position after deformation that does not exceed its elastic limit The ability of an ecosystem to return to its original state after being disturbed The ability to recover quickly from illness, change, or misfortune.
flowcharts, models – Static and deterministic • Systems in Reality – Complex, non-linear – Dynamic, stochastic, non-deterministic • And yet, things work most of the time
limited use for improvement – Stability is not about the absence of something – It is not something a system has – It is something a system does • People hold the inherent imperfection together – Able to adjust the system beyond its limitations – Can anticipate, recognize, respond and learn How empowered are the people in your system?
a shared purpose • Engaging dialog, respect • Blame-free retrospection • Autonomy and self-organization • Promote collaboration, agreement • Create transparency • Encourage leadership at all levels • Take risks and fail fast Anti-Patterns • Focus on org-charts and hierarchy • Internal competition • Weed out failures • Over-commit, lack focus • Apply micro-management • Push change through top-down • Take past success for granted • Local optimization • Don’t take risks
policies, coding style • Short cycles creating user value • Modularity, testable code • Continuous integration • Time for refactoring, iterations • Fix bugs before writing new code • Build knowledge through code reviews and commit logs Anti-Patterns • Chief Architect and Planners • Change Control mechanisms • Painful release process • Spaghetti code, complex dependencies • (Waterfall) Plans over results • QA organization, late testing • Knowledge exists outside the code (if at all)
recognize – Automation frees the operators from mundane tasks – Connect monitoring to the ontology of the system • Enable reasoning about the system – Avoid black boxes – they block mental simulation – Make knowledge about the system part of the system • Enable fast failure and fast recovery – Infrastructure is testable code, specification, documentation • Remove bottlenecks and dependencies – Loose coupling and voluntary cooperation – Autonomy, agility
– DevOps, Kanban, Agile • Design for continuous maintenance – Not a periodic, planned activity – Make visible how components can safely be replaced • Empower the operators – We must trust them with the controls to our systems – Use tools and workflows that increase confidence, not control
of 37. To increase the score, you can • Add more comments and use self-explanatory variable names in your policy code for service “webshop” • Share more knowledge – only one user has made 39 changes to the configuration of “apache” in the last 7 months • Increase redundancy – you have a linear dependency between hosts ws1 as1 and db1 for your mission critical service “webshop” • Increase capacity for host db1 – CPU and disk IO are high at the same time as service “payment service” peaks