THE FUTURE OF CHAOS ENGINEERING: IN PURSUIT OF THE UNKNOWN UNKNOWNS

777bc656cb5c276519c2d52951d6ebca?s=47 Chaos Conf
September 26, 2019

THE FUTURE OF CHAOS ENGINEERING: IN PURSUIT OF THE UNKNOWN UNKNOWNS

Crystal Hirschorn, Conde Nast

"Systems fail all the time" goes the popular mantra in Reliability and Resilience engineering fields. Given this premise, industry leading organizations' practices have accelerated and matured several degrees to where we were even a few years ago. Organizations are beginning to stretch beyond their homegrown approaches to building organizational resilience to leveraging the expertise within the industry, and integrating approaches directly into the software deployment lifecycle through commoditized Chaos services.

However, our systems and organizations keep growing in complexity under the ever-increasing pressure for efficiency and scale. Our architectural approaches and paradigms keep shifting to cope with the complexity of domains such as wide adoption of micro services and Serverless development approaches.

A current limiting factor in running Chaos experiments is their contrived nature - we must think ahead what could go wrong. Is this true to experience? What about the sense of surprise that usually pervades failure situations? How can we facilitate more random, generative experiments?

In this talk, Crystal will offer where our Chaos and Resilience practices must evolve to keep pace with the challenges of growing complexity.

777bc656cb5c276519c2d52951d6ebca?s=128

Chaos Conf

September 26, 2019
Tweet

Transcript

  1. The Future of Chaos Engineering In Pursuit of the Unknown

    Unknowns Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn
  2. "Complexity doesn't allow us to think in linear, unidirectional terms

    along which progress or regress could be plotted."
  3. None
  4. BRAZIL AUSTRALIA ARABIA CHINA FRANCE GERMANY INDIA ITALY JAPAN KOREA

    LATIN AMERICA NETHER LANDS POLAND PORTUGAL SOUTH AFRICA SPAIN TAIWAN THAILAND TURKEY UK UKRAINE HUNGARY BULGARIA ICELAND ROMANIA CZECH REP SLOVAKIA MEXICO RUSSIA
  5. COMPLICATED Known Unknowns SIMPLE Known knowns COMPLEX Unknown Unknowns CHAOTIC

    Unknowables Emergent Practice Good Practice Novel Practice Best Practice Disorder
  6. https://www.youtube.com/watch?v=cefJd2v037U Experimenting effectively

  7. None
  8. Modern architecture evolution

  9. Modern architecture evolution

  10. Modern architectures: Microservices

  11. Modern architectures: Service Mesh

  12. Modern architectures: Serverless Synchronous (push) Asynchronous (event) Streaming

  13. Modern architectures: Applications

  14. Modern architectures: Front-end

  15. The Root Cause Fallacy: A Brief Story of a Web

    Platform Outage
  16. The Root Cause Fallacy: A Brief Story of a Web

    Platform Outage
  17. The Root Cause Fallacy: A Brief Story of a Web

    Platform Outage
  18. The Root Cause Fallacy: A Brief Story of a Web

    Platform Outage
  19. https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html

  20. None
  21. None
  22. None
  23. None
  24. None
  25. “Progress depends on our changing the world to fit us.

    Not the other way around.”
  26. Organisational Pressures and Constraints Regulators Policies Economics Competition Governance Logistics

    Management Outside influences Internal (org) influences Operator influences Efficiency Trade Offs Automation Time criticality Esoteric knowledge Mental models Ergonomics OpEx vs CapEx pressures Lacking details Culture norms Geopolitical Vendors Societal culture Workload Cognitive switching
  27. At what cost? https://www.gremlin.com/ecommerce-cost-of-downtime/

  28. http://www.safetydifferently.com/the-varieties-of-human-work/ An alternative approach to post mortems

  29. Invite a diverse audience to your post-incident learning reviews

  30. Actions. What can we turn into hypotheses / experiments?

  31. Actions. Other sources for learning opportunities. Action 1 Description: Gaps

    identified in architectural knowledge. Mary will do a 2 weeks rotation to shadow and pair on team Orion. Artefacts: Whiteboard diagrams from post-incident review Owner: Orion Action 2 Description: Incident Management process did not flow in expected order. Escalations were delayed. Schedule more role playing and game days. Artefacts: Game Day template Incident Management Process Owner: SRE Action 3 Description: Too many graphs are being displayed in single dashboard. Many are not easily discernible by product engineering. Zenith to work with Orion and Hydra teams on system metrics visualisation strategy. Artefacts: DataDog dashboard (timestamped to match incident timings) Owner: Zenith
  32. CI/CD/CV pipelines

  33. Tooling and Toolchains

  34. Tooling and Toolchains https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

  35. Multi-vector attacks

  36. It’s Stochastic, It’s Fantastic.

  37. ████████╗██╗ ██╗ █████╗ ███╗ ██╗██╗ ██╗ ██╗ ██╗ ██████╗ ██╗

    ██╗ ╚══██╔══╝██║ ██║██╔══██╗████╗ ██║██║ ██╔╝ ╚██╗ ██╔╝██╔═══██╗██║ ██║ ██║ ███████║███████║██╔██╗ ██║█████╔╝ ╚████╔╝ ██║ ██║██║ ██║ ██║ ██╔══██║██╔══██║██║╚██╗██║██╔═██╗ ╚██╔╝ ██║ ██║██║ ██║ ██║ ██║ ██║██║ ██║██║ ╚████║██║ ██╗ ██║ ╚██████╔╝╚██████╔╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn