Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Future of Chaos Engineering: In Pursuit of ...

Crystal
September 26, 2019

The Future of Chaos Engineering: In Pursuit of the Unknown Unknowns

"Systems fail all the time" goes the popular mantra in Reliability and Resilience engineering fields. Given this premise, industry leading organisations' practices have accelerated and matured several degrees to where we were even a few years ago. Organisations are beginning to stretch beyond their homegrown approaches to building organisational resilience to leveraging the expertise within the industry, and integrating approaches directly into the software deployment lifecycle through commoditised Chaos services.

However, our systems and organisations keep growing in complexity under the ever-increasing pressure for efficiency and scale. Our architectural approaches and paradigms keep shifting to cope with the complexity of distributed system domains such as wide adoption of microservices, Serverless, multi-tenancy and micro front-ends development approaches.

A current limiting factor in running Chaos experiments is their contrived nature - we must think ahead what could go wrong. Is this true to experience? What about the sense of surprise that usually pervades failure situations? How can we facilitate more random, generative experiments?

Crystal

September 26, 2019
Tweet

More Decks by Crystal

Other Decks in Technology

Transcript

  1. The Future of Chaos Engineering In Pursuit of the Unknown

    Unknowns Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn
  2. "Complexity doesn't allow us to think in linear, unidirectional terms

    along which progress or regress could be plotted."
  3. BRAZIL AUSTRALIA ARABIA CHINA FRANCE GERMANY INDIA ITALY JAPAN KOREA

    LATIN AMERICA NETHER LANDS POLAND PORTUGAL SOUTH AFRICA SPAIN TAIWAN THAILAND TURKEY UK UKRAINE HUNGARY BULGARIA ICELAND ROMANIA CZECH REP SLOVAKIA MEXICO RUSSIA
  4. COMPLICATED Known Unknowns SIMPLE Known knowns COMPLEX Unknown Unknowns CHAOTIC

    Unknowables Emergent Practice Good Practice Novel Practice Best Practice Disorder
  5. “Progress depends on our changing the world to fit us.

    Not the other way around.” Halt and Catch Fire
  6. Organisational Pressures and Constraints Regulators Policies Economics Competition Governance Logistics

    Management Outside influences Internal (org) influences Operator influences Efficiency Trade Offs Automation Time criticality Esoteric knowledge Mental models Ergonomics OpEx vs CapEx pressures Lacking details Culture norms Geopolitical Vendors Societal culture Workload Cognitive switching The Sharp and Blunt Ends of Large Complex Systems by Richard Cook and David Woods
  7. Actions. Other sources for learning opportunities. Action 1 Description: Gaps

    identified in architectural knowledge. Mary will do a 2 weeks rotation to shadow and pair on team Orion. Artefacts: Whiteboard diagrams from post-incident review Owner: Orion Action 2 Description: Incident Management process did not flow in expected order. Escalations were delayed. Schedule more role playing and game days. Artefacts: Game Day template Incident Management Process Owner: SRE Action 3 Description: Too many graphs are being displayed in single dashboard. Many are not easily discernible by product engineering. Zenith to work with Orion and Hydra teams on system metrics visualisation strategy. Artefacts: DataDog dashboard (timestamped to match incident timings) Owner: Zenith
  8. ████████╗██╗ ██╗ █████╗ ███╗ ██╗██╗ ██╗ ██╗ ██╗ ██████╗ ██╗

    ██╗ ╚══██╔══╝██║ ██║██╔══██╗████╗ ██║██║ ██╔╝ ╚██╗ ██╔╝██╔═══██╗██║ ██║ ██║ ███████║███████║██╔██╗ ██║█████╔╝ ╚████╔╝ ██║ ██║██║ ██║ ██║ ██╔══██║██╔══██║██║╚██╗██║██╔═██╗ ╚██╔╝ ██║ ██║██║ ██║ ██║ ██║ ██║██║ ██║██║ ╚████║██║ ██╗ ██║ ╚██████╔╝╚██████╔╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn