Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we Won the Super Bowl (Commercials)

How we Won the Super Bowl (Commercials)

Presented at devopsdays Philadelphia: https://devopsdays.org/events/2024-philadelphia/program

Matt Kuritz

May 17, 2024
Tweet

Other Decks in Technology

Transcript

  1. The Farmer’s Dog Yes… • Scale backends horizontally • Scale

    postgres DB vertically • Upgrade postgres • Load tests 7 The ‘known knowns’ No… • Add a cache layer • Read Replicas • Finish in progress decoupling • New decoupling (extract components from API) @_kuritz /in/kuritz
  2. The Farmer’s Dog “ From the perspective of overcoming the

    risk of brittleness, a third use of the label resilience becomes the idea of graceful extensibility - how a system extends performance, or brings extra adaptive capacity to bear, when surprise events challenge its boundaries. 10 Woods DD. “Four concepts for resilience and the implications for the future of resilience engineering”. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.018i @_kuritz /in/kuritz
  3. The Farmer’s Dog 11 Rasmussen Model Rasmussen. “Risk management in

    a dynamic society: a modelling problem” Saf. Sci., 27 (2–3) (1997) @_kuritz /in/kuritz
  4. The Farmer’s Dog • 455 contributions in 28 days •

    36% of 2023 contributions • Coding 7 of 8 weekend days 13 The workload boundary @_kuritz /in/kuritz
  5. The Farmer’s Dog • Rested going in • Immediately took

    time off post cleanup • Lots of time away over summer to recover 14 Why it worked @_kuritz /in/kuritz
  6. The Farmer’s Dog “ Recognizing hazard and successfully manipulating system

    operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. 17 Cook (1998). How Complex Systems Fail. (Chicago: CtL). https://how.complexsystems.fail/ @_kuritz /in/kuritz
  7. The Farmer’s Dog 18 Finding the edge 125 req/s 175

    req/s 1.2k req/s 900 req/s @_kuritz /in/kuritz
  8. The Farmer’s Dog “ Adaptive capacities exist before changes and

    disruptions call upon those capacities. Systems possess varieties of adaptive capacity, and Resilience Engineering seeks to understand how these are built, sustained, degraded, and lost. 20 Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I. (Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org. @_kuritz /in/kuritz
  9. The Farmer’s Dog • Continuous deployment • Progressive rollouts +

    dark launches • Trunk based development + monorepos ◦ 1 version: main ◦ Everything is a parallel change • All engineers serve on call 22 Practices -> Adaptive Capacity @_kuritz /in/kuritz
  10. The Farmer’s Dog 24 Capacity to deploy deploys/day weekly avg

    (left) deploys/day daily sum (right) More on platform/product and practices: Arrested DevOps Podcast @_kuritz /in/kuritz
  11. The Farmer’s Dog “ Resilience engineering enhances the adaptive capacity

    needed for response to surprises. A system with adaptive capacity is poised to adapt. It has some readiness to change how it currently works - its models, plans, processes, behaviors 26 D. D. Woods and J. Allspaw, "Revealing the critical role of human performance in software", ACM Queue, vol. 17, no. 6, pp. 1-13, 2019. https://queue.acm.org/detail.cfm?id=3380776 @_kuritz /in/kuritz
  12. The Farmer’s Dog • Project led by group of Engineers

    • PMs focussed on stakeholders, execution • Cross-PAWD daily standup • Traded some reliability for speed • Ensemble Programming 27 Changing Process + Behaviors @_kuritz /in/kuritz
  13. The Farmer’s Dog • Q1 Plans -> 🗑 • Stakeholders

    still concerned about metric targets • Execs helped quickly resolve discrepancy 28 Changing Plans @_kuritz /in/kuritz
  14. The Farmer’s Dog “ Effective organizations build reciprocity across roles

    and levels (Ostrom, 2003). Reciprocity [...] is commitment to mutual assistance [...] one unit donates from their limited resources now to help another in their role, so both achieve benefits for overarching goals, and trusts that when the roles are reversed, the other unit will come to its aid. 30 Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I. (Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org. @_kuritz /in/kuritz
  15. The Farmer’s Dog “We prioritize the success of the entire

    company and mission” Some Behaviors: • We take full individual and collective ownership, and are reliable team members • We go beyond our own “job” and “department” • We share bad news when we have it, and ask for help when we need it • We compete aggressively externally, but never with one another 31 Company Value - Play Team @_kuritz /in/kuritz
  16. The Farmer’s Dog • AWS JS SDK v2 defaulted to

    50 sockets max. • Also did not set ‘keepAlive’ so no reuse. • Both changed in v3 of SDK. • Discovered other JS libs also did not reuse sockets by default. 33 Questionable Defaults @_kuritz /in/kuritz
  17. The Farmer’s Dog • k8s exhausted the IPs in it’s

    subnets • Our cluster is stateless, relatively easy to migrate • Launched new cluster in new subnets with more IPs • Platform team ensembled every day until fixed • <2 weeks to execute 34 Running out of IPs @_kuritz /in/kuritz
  18. The Farmer’s Dog • Postgres struggles with large number of

    connections and overhead of creating conns. • Solution: connection pooler proxy service • https://hub.docker.com/r/edoburu/pgbouncer/ 35 Too Many Postgres Connections pgbouncer improves postgresql performance @_kuritz /in/kuritz
  19. The Farmer’s Dog • GraphQL was architected to eager load

    user details (and related data). • Too deeply ingrained into system to rip out in time. • Solution: create a ‘blocklist’ of queries and mutations that didn’t depend on it to bypass. 37 GraphQL Eager Loading @_kuritz /in/kuritz
  20. The Farmer’s Dog • Used sequelize ‘findOrCreate’ • Creates a

    nested transaction in postgres 42 The ‘Audit Log’ Nested Txns @_kuritz /in/kuritz
  21. The Farmer’s Dog Overview: Postgres.AI Post Related Incidents: • Sentry:

    Transaction ID Wraparound in Postgres • Gitlab: Why we spent the last month eliminating PostgreSQL subtransactions • Migration lessons learned: Even Amazon can face mishaps with new tools 38 Nested Txn Problems @_kuritz /in/kuritz
  22. The Farmer’s Dog 39 Clearing the path 5 min load

    test: 3000 req/s @_kuritz /in/kuritz
  23. The Farmer’s Dog • We experiment with our process more

    • Engineering has more capacity to work towards its goals • We are careful not to trade our resilience for robustness 43 Lasting Impact @_kuritz /in/kuritz