The Future of Chaos Engineering: In Pursuit of the Unknown Unknowns

The Future of Chaos Engineering In Pursuit of the Unknown
Unknowns Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn

"Complexity doesn't allow us to think in linear, unidirectional terms
along which progress or regress could be plotted."

BRAZIL AUSTRALIA ARABIA CHINA FRANCE GERMANY INDIA ITALY JAPAN KOREA
LATIN AMERICA NETHER LANDS POLAND PORTUGAL SOUTH AFRICA SPAIN TAIWAN THAILAND TURKEY UK UKRAINE HUNGARY BULGARIA ICELAND ROMANIA CZECH REP SLOVAKIA MEXICO RUSSIA

COMPLICATED Known Unknowns SIMPLE Known knowns COMPLEX Unknown Unknowns CHAOTIC
Unknowables Emergent Practice Good Practice Novel Practice Best Practice Disorder

https://www.youtube.com/watch?v=cefJd2v037U Experimenting effectively

Modern architecture evolution: Where we started

Modern architecture evolution

Modern architectures: Microservices

Modern architectures: Service Mesh

Modern architectures: Serverless Synchronous (push) Asynchronous (event) Streaming

Modern architectures: Applications

Modern architectures: Micro front-ends

The Root Cause Fallacy: A Brief Story of a Web
Platform Outage

https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html

“Progress depends on our changing the world to fit us.
Not the other way around.” Halt and Catch Fire

Organisational Pressures and Constraints Regulators Policies Economics Competition Governance Logistics
Management Outside influences Internal (org) influences Operator influences Efficiency Trade Offs Automation Time criticality Esoteric knowledge Mental models Ergonomics OpEx vs CapEx pressures Lacking details Culture norms Geopolitical Vendors Societal culture Workload Cognitive switching The Sharp and Blunt Ends of Large Complex Systems by Richard Cook and David Woods

At what cost? https://www.gremlin.com/ecommerce-cost-of-downtime/

http://www.safetydifferently.com/the-varieties-of-human-work/ An alternative approach to post mortems

Invite a diverse audience to your post-incident learning reviews

Actions. What can we turn into hypotheses / experiments?

Actions. Other sources for learning opportunities. Action 1 Description: Gaps
identified in architectural knowledge. Mary will do a 2 weeks rotation to shadow and pair on team Orion. Artefacts: Whiteboard diagrams from post-incident review Owner: Orion Action 2 Description: Incident Management process did not flow in expected order. Escalations were delayed. Schedule more role playing and game days. Artefacts: Game Day template Incident Management Process Owner: SRE Action 3 Description: Too many graphs are being displayed in single dashboard. Many are not easily discernible by product engineering. Zenith to work with Orion and Hydra teams on system metrics visualisation strategy. Artefacts: DataDog dashboard (timestamped to match incident timings) Owner: Zenith

CI/CD/CV pipelines https://www.verica.io/continuous-verification/

Tooling and Toolchains https://github.com/dastergon/awesome-chaos-engineering#notable-tools

Tooling and Toolchains https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

Multi-vector attacks

It’s Stochastic, It’s Fantastic.

████████╗██╗ ██╗ █████╗ ███╗ ██╗██╗ ██╗ ██╗ ██╗ ██████╗ ██╗
██╗ ╚══██╔══╝██║ ██║██╔══██╗████╗ ██║██║ ██╔╝ ╚██╗ ██╔╝██╔═══██╗██║ ██║ ██║ ███████║███████║██╔██╗ ██║█████╔╝ ╚████╔╝ ██║ ██║██║ ██║ ██║ ██╔══██║██╔══██║██║╚██╗██║██╔═██╗ ╚██╔╝ ██║ ██║██║ ██║ ██║ ██║ ██║██║ ██║██║ ╚████║██║ ██╗ ██║ ╚██████╔╝╚██████╔╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn

The Future of Chaos Engineering: In Pursuit of ...

The Future of Chaos Engineering: In Pursuit of the Unknown Unknowns

Crystal

More Decks by Crystal

Other Decks in Technology

Featured

Transcript