Embedding a culture of experimentation and resilience at Condé Nast

Embedding a culture of experimentation and resilience at Condé Nast
Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn

BRAZIL AUSTRALIA ARABIA CHINA FRANCE GERMANY INDIA ITALY JAPAN KOREA
LATIN AMERICA NETHER LANDS POLAND PORTUGAL SOUTH AFRICA SPAIN TAIWAN THAILAND TURKEY UK UKRAINE HUNGARY BULGARIA ICELAND ROMANIA CZECH REP SLOVAKIA MEXICO RUSSIA

"Complexity doesn't allow us to think in linear, unidirectional terms
along which progress or regress could be plotted."

Common areas of test to verify systems’ robustness and peoples’
resilience to failures - Redundancy and failover states - Scaling (auto and manual) - Load testing - Stateful applications - Stateless applications and services - “Unhappy paths” of code execution (ALFI) - Request lifecycles and dependency maps - Authentication modes. - Certs. - Vendor interfaces. What happens if vendor e.g. “black box” does something unexpected? - Security: attack vectors (OWASP et al.) - Caching layers For further inspiration and reading: https://medium.com/@adhorn/the-chaos-engineering-collection-5e188d6a90e2

https://www.youtube.com/watch?v=cefJd2v037U Experimenting effectively

Modern architecture evolution

Our ﬁrst Game Day. Early 2018. Issue 0: TheoryCraft An
image has gone missing on the Vogue UK front page. Every time a customer loads the page instead of the image they see a blank box. This may have caused an increase in 500s which have been reported by the load balancer and triggered a page. Customer complaints, editors and engineers are all sure to notice in the next few minutes. The oncall agent has just received the following page: “Vogue UK 500s per minute > 100 for 5 minutes. See Rocket runbook.” What happens next?

Our ﬁrst Game Day. Early 2018. Followed by other scenarios
to test our internal response to incidents: Issue 3: ETCD de-sync Test procedure: ETCD process will be halted on all nodes. Simulating a network partition type scenario. Expected outcome: - This should immediately escalate to Cloud Platforms team (via PagerDuty) due to the number of alarms triggered and the scope of the outage. - Cloud Platforms team should be able to identify that ETCD is no longer running, and will either replace the nodes or manually trigger a restart.

Recent Game Days (Dec 2019 + Jan 2020) Scenario 1:
Known Departures Issue Test: The purpose of this scenario will be to verify that people can marry up information in a runbook with incidents being reported by users. As such we will trigger a Departures failure as an out of hours issue. Context/scene setting: It’s 9PM on a Friday. It is still US working hours and they’re rushing to get a change out for the weekend. (Let's say an editor is unhappy… ). Service Operations have recieved a ticket and have no idea what it is related to, they have escalated it to us. The ticket reads: “US Engineering teams have reported that they cannot deploy their tracking pixel change that must be in place by 9AM on Monday, Departures is broken.”

Recent Game Days (Dec 2019 + Jan 2020) Injecting Chaos
- the experiment conditions - Sabre (the cluster scheduler) is killed. - Expectation is that Departures a) won’t show any available builds and b) therefore block any deployments.

Growing the complexity of your experiments. End-to-end tests.

Amazon XRay’s Service Dependency Map (partial view)

Modern architectures: Microservices pattern

Chaos tools: Microservices https://medium.com/@adhorn/the-chaos-engineering-collection-5e188d6a90e2

Modern architectures: Service Mesh

Chaos tools: Service Mesh Chaos Mesh by PingCAP: https://medium.com/@PingCAP/chaos-mesh-your-chaos-engineering-solution-for-system-resiliency-on-kubernetes-a95f7489 d708

Modern architectures: Serverless Synchronous (push) Asynchronous (event) Streaming

Chaos tools: Serverless https://www.gremlin.com/chaos-engineering/chaos-engineering-for-serverless-infrastructure/ https://github.com/gunnargrosch/serverless-chaos-demo

Chaos “readiness” within your organisation

A Story of a Web Platform Outage and the importance
of observability Lack of instrumentation in the app (Datadog APM + other) Initial punt at a “hypothesis” Enriching the context for all involved. Attention drawn to key Artefacts (graphs).

Further attention drawn to key artefacts Deemed a P0/P1 priority.
Separate incident channel created.

Capturing data points to enrich the post mortem! Another “signal”
of something awry. Cascading failures. Stopping non-production/non-critical apps and processes. My favourite: Engineers are drawn to incidents like moths to a ﬂame!

In light of more context and information, another hypothesis is
given. Automated messages: timelines of events as useful “metadata” for post incident analysis

http://www.safetydifferently.com/the-varieties-of-human-work/ What Next: An alternative approach to RCAs

Actions. What can we turn into hypotheses / experiments?

Actions. Other sources for learning opportunities. Action 1 Description: Gaps
identiﬁed in architectural knowledge. Mary will do a 2 weeks rotation to shadow and pair on team Orion. Artefacts: Whiteboard diagrams from post-incident review Owner: Orion Action 2 Description: Incident Management process did not ﬂow in expected order. Escalations were delayed. Schedule more role playing and game days. Artefacts: Game Day template Incident Management Process Owner: SRE Action 3 Description: Too many graphs are being displayed in single dashboard. Many are not easily discernible by product engineering. Zenith to work with Orion and Hydra teams on system metrics visualisation strategy. Artefacts: DataDog dashboard (timestamped to match incident timings) Owner: Zenith

████████╗██╗ ██╗ █████╗ ███╗ ██╗██╗ ██╗ ██╗ ██╗ ██████╗ ██╗
██╗ ╚══██╔══╝██║ ██║██╔══██╗████╗ ██║██║ ██╔╝ ╚██╗ ██╔╝██╔═══██╗██║ ██║ ██║ ███████║███████║██╔██╗ ██║█████╔╝ ╚████╔╝ ██║ ██║██║ ██║ ██║ ██╔══██║██╔══██║██║╚██╗██║██╔═██╗ ╚██╔╝ ██║ ██║██║ ██║ ██║ ██║ ██║██║ ██║██║ ╚████║██║ ██╗ ██║ ╚██████╔╝╚██████╔╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn

Embedding a culture of experimentation and resi...

Embedding a culture of experimentation and resilience at Condé Nast

Crystal

More Decks by Crystal

Other Decks in Programming

Featured

Transcript