Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embedding a culture of experimentation and resi...

Crystal
January 28, 2020

Embedding a culture of experimentation and resilience at Condé Nast

Covering concepts from Chaos Engineering, Resilience Engineering, observability practices and experimentation as part of the cultural norm through the entire software dev and operations lifecycle. Tooling for Chaos Engineering, architectural patterns to consider when deciding on a tooling strategy, and real world examples of making Chaos Engineering part of the Condé Nast technology culture.

Crystal

January 28, 2020
Tweet

More Decks by Crystal

Other Decks in Programming

Transcript

  1. Embedding a culture of experimentation and resilience at Condé Nast

    Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn
  2. BRAZIL AUSTRALIA ARABIA CHINA FRANCE GERMANY INDIA ITALY JAPAN KOREA

    LATIN AMERICA NETHER LANDS POLAND PORTUGAL SOUTH AFRICA SPAIN TAIWAN THAILAND TURKEY UK UKRAINE HUNGARY BULGARIA ICELAND ROMANIA CZECH REP SLOVAKIA MEXICO RUSSIA
  3. "Complexity doesn't allow us to think in linear, unidirectional terms

    along which progress or regress could be plotted."
  4. Common areas of test to verify systems’ robustness and peoples’

    resilience to failures - Redundancy and failover states - Scaling (auto and manual) - Load testing - Stateful applications - Stateless applications and services - “Unhappy paths” of code execution (ALFI) - Request lifecycles and dependency maps - Authentication modes. - Certs. - Vendor interfaces. What happens if vendor e.g. “black box” does something unexpected? - Security: attack vectors (OWASP et al.) - Caching layers For further inspiration and reading: https://medium.com/@adhorn/the-chaos-engineering-collection-5e188d6a90e2
  5. Our first Game Day. Early 2018. Issue 0: TheoryCraft An

    image has gone missing on the Vogue UK front page. Every time a customer loads the page instead of the image they see a blank box. This may have caused an increase in 500s which have been reported by the load balancer and triggered a page. Customer complaints, editors and engineers are all sure to notice in the next few minutes. The oncall agent has just received the following page: “Vogue UK 500s per minute > 100 for 5 minutes. See Rocket runbook.” What happens next?
  6. Our first Game Day. Early 2018. Followed by other scenarios

    to test our internal response to incidents: Issue 3: ETCD de-sync Test procedure: ETCD process will be halted on all nodes. Simulating a network partition type scenario. Expected outcome: - This should immediately escalate to Cloud Platforms team (via PagerDuty) due to the number of alarms triggered and the scope of the outage. - Cloud Platforms team should be able to identify that ETCD is no longer running, and will either replace the nodes or manually trigger a restart.
  7. Recent Game Days (Dec 2019 + Jan 2020) Scenario 1:

    Known Departures Issue Test: The purpose of this scenario will be to verify that people can marry up information in a runbook with incidents being reported by users. As such we will trigger a Departures failure as an out of hours issue. Context/scene setting: It’s 9PM on a Friday. It is still US working hours and they’re rushing to get a change out for the weekend. (Let's say an editor is unhappy… ). Service Operations have recieved a ticket and have no idea what it is related to, they have escalated it to us. The ticket reads: “US Engineering teams have reported that they cannot deploy their tracking pixel change that must be in place by 9AM on Monday, Departures is broken.”
  8. Recent Game Days (Dec 2019 + Jan 2020) Injecting Chaos

    - the experiment conditions - Sabre (the cluster scheduler) is killed. - Expectation is that Departures a) won’t show any available builds and b) therefore block any deployments.
  9. A Story of a Web Platform Outage and the importance

    of observability Lack of instrumentation in the app (Datadog APM + other) Initial punt at a “hypothesis” Enriching the context for all involved. Attention drawn to key Artefacts (graphs).
  10. Capturing data points to enrich the post mortem! Another “signal”

    of something awry. Cascading failures. Stopping non-production/non-critical apps and processes. My favourite: Engineers are drawn to incidents like moths to a flame!
  11. In light of more context and information, another hypothesis is

    given. Automated messages: timelines of events as useful “metadata” for post incident analysis
  12. Actions. Other sources for learning opportunities. Action 1 Description: Gaps

    identified in architectural knowledge. Mary will do a 2 weeks rotation to shadow and pair on team Orion. Artefacts: Whiteboard diagrams from post-incident review Owner: Orion Action 2 Description: Incident Management process did not flow in expected order. Escalations were delayed. Schedule more role playing and game days. Artefacts: Game Day template Incident Management Process Owner: SRE Action 3 Description: Too many graphs are being displayed in single dashboard. Many are not easily discernible by product engineering. Zenith to work with Orion and Hydra teams on system metrics visualisation strategy. Artefacts: DataDog dashboard (timestamped to match incident timings) Owner: Zenith
  13. ████████╗██╗ ██╗ █████╗ ███╗ ██╗██╗ ██╗ ██╗ ██╗ ██████╗ ██╗

    ██╗ ╚══██╔══╝██║ ██║██╔══██╗████╗ ██║██║ ██╔╝ ╚██╗ ██╔╝██╔═══██╗██║ ██║ ██║ ███████║███████║██╔██╗ ██║█████╔╝ ╚████╔╝ ██║ ██║██║ ██║ ██║ ██╔══██║██╔══██║██║╚██╗██║██╔═██╗ ╚██╔╝ ██║ ██║██║ ██║ ██║ ██║ ██║██║ ██║██║ ╚████║██║ ██╗ ██║ ╚██████╔╝╚██████╔╝ ╚═╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═══╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ Crystal Hirschorn VP Engineering, Global Strategy & Operations, Condé Nast @cfhirschorn