Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we Won the Super Bowl (Commercials)

How we Won the Super Bowl (Commercials)

Avatar for Matt Kuritz

Matt Kuritz

May 17, 2024
Tweet

Other Decks in Technology

Transcript

  1. The Farmer’s Dog Yes… • Scale backends horizontally • Scale

    postgres DB vertically • Upgrade postgres • Load tests 6 The ‘known knowns’ No… • Add a cache layer • Read Replicas • Finish in progress decoupling • New decoupling (extract components from API) @_kuritz /in/kuritz
  2. The Farmer’s Dog 10 Safety-I vs Safety-II Hollnagel E., Wears

    R.L. and Braithwaite J. From Safety-I to Safety-II: A White Paper. The Resilient Health Care Net: Published simultaneously by the University of Southern Denmark, University of Florida, USA, and Macquarie University, Australia.
  3. The Farmer’s Dog 11 Safety-I vs Safety-II Hollnagel E., Wears

    R.L. and Braithwaite J. From Safety-I to Safety-II: A White Paper. The Resilient Health Care Net: Published simultaneously by the University of Southern Denmark, University of Florida, USA, and Macquarie University, Australia.
  4. The Farmer’s Dog 13 Rasmussen Model Rasmussen. “Risk management in

    a dynamic society: a modelling problem” Saf. Sci., 27 (2–3) (1997) @_kuritz /in/kuritz
  5. The Farmer’s Dog “ From the perspective of overcoming the

    risk of brittleness, a third use of the label resilience becomes the idea of graceful extensibility - how a system extends performance, or brings extra adaptive capacity to bear, when surprise events challenge its boundaries. 14 Woods DD. “Four concepts for resilience and the implications for the future of resilience engineering”. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.018i @_kuritz /in/kuritz
  6. The Farmer’s Dog • 455 contributions in 28 days •

    36% of 2023 contributions • Coding 7 of 8 weekend days 15 Pushing the workload boundary @_kuritz /in/kuritz
  7. The Farmer’s Dog • Rested going in • Immediately took

    time off post cleanup • Lots of time away over summer to recover 16 Why it worked @_kuritz /in/kuritz
  8. The Farmer’s Dog 19 Finding the edge 125 req/s 175

    req/s 1.2k req/s 900 req/s @_kuritz /in/kuritz
  9. The Farmer’s Dog “ Recognizing hazard and successfully manipulating system

    operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. 20 Cook (1998). How Complex Systems Fail. (Chicago: CtL). https://how.complexsystems.fail/ @_kuritz /in/kuritz
  10. The Farmer’s Dog “ Adaptive capacities exist before changes and

    disruptions call upon those capacities. Systems possess varieties of adaptive capacity, and Resilience Engineering seeks to understand how these are built, sustained, degraded, and lost. 23 Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I. (Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org. @_kuritz /in/kuritz
  11. The Farmer’s Dog • Continuous deployment • Progressive rollouts +

    dark launches • Trunk based development + monorepos ◦ 1 version: main ◦ Everything is a parallel change • All engineers serve on call 24 Practices -> Adaptive Capacity @_kuritz /in/kuritz
  12. The Farmer’s Dog 26 Capacity to deploy deploys/day weekly avg

    (left) deploys/day daily sum (right) More on platform/product and practices: Arrested DevOps Podcast @_kuritz /in/kuritz
  13. The Farmer’s Dog “ Resilience engineering enhances the adaptive capacity

    needed for response to surprises. A system with adaptive capacity is poised to adapt. It has some readiness to change how it currently works - its models, plans, processes, behaviors 28 D. D. Woods and J. Allspaw, "Revealing the critical role of human performance in software", ACM Queue, vol. 17, no. 6, pp. 1-13, 2019. https://queue.acm.org/detail.cfm?id=3380776 @_kuritz /in/kuritz
  14. The Farmer’s Dog • Project led by group of Engineers

    • PMs focussed on stakeholders, execution • Cross-PAWD daily standup • Traded some reliability for speed • Ensemble/Mob Programming 30 Changing Process + Behaviors @_kuritz /in/kuritz
  15. The Farmer’s Dog “ Effective organizations build reciprocity across roles

    and levels (Ostrom, 2003). Reciprocity [...] is commitment to mutual assistance [...] one unit donates from their limited resources now to help another in their role, so both achieve benefits for overarching goals, and trusts that when the roles are reversed, the other unit will come to its aid. 35 Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I. (Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org. @_kuritz /in/kuritz
  16. The Farmer’s Dog “We prioritize the success of the entire

    company and mission” Some Behaviors: • We take full individual and collective ownership, and are reliable team members • We go beyond our own “job” and “department” • We share bad news when we have it, and ask for help when we need it • We compete aggressively externally, but never with one another 36 Company Value - Play Team @_kuritz /in/kuritz
  17. The Farmer’s Dog • AWS JS SDK v2 defaulted to

    50 sockets max. • Also did not set ‘keepAlive’ so no reuse. • Both changed in v3 of SDK. • Discovered other JS libs also did not reuse sockets by default. 38 Questionable Defaults @_kuritz /in/kuritz
  18. The Farmer’s Dog • k8s exhausted the IPs in it’s

    subnets • Our cluster is stateless, relatively easy to migrate • Launched new cluster in new subnets with more IPs • Platform team ensembled every day until fixed • <2 weeks to execute 39 Running out of IPs @_kuritz /in/kuritz
  19. The Farmer’s Dog • Postgres struggles with large number of

    connections and overhead of creating conns. • Solution: connection pooler proxy service • https://hub.docker.com/r/edoburu/pgbouncer/ 40 Too Many Postgres Connections pgbouncer improves postgresql performance @_kuritz /in/kuritz
  20. The Farmer’s Dog • GraphQL was architected to eager load

    user details (and related data). • Too deeply ingrained into system to rip out in time. • Solution: create a ‘blocklist’ of queries and mutations that didn’t depend on it to bypass. 42 GraphQL Eager Loading @_kuritz /in/kuritz
  21. The Farmer’s Dog • Used sequelize ‘findOrCreate’ • Creates a

    nested transaction in postgres 43 The ‘Audit Log’ Nested Txns @_kuritz /in/kuritz
  22. The Farmer’s Dog Overview: Postgres.AI Post Related Incidents: • Sentry:

    Transaction ID Wraparound in Postgres • Gitlab: Why we spent the last month eliminating PostgreSQL subtransactions • Migration lessons learned: Even Amazon can face mishaps with new tools 44 Nested Txn Problems @_kuritz /in/kuritz
  23. The Farmer’s Dog 45 Clearing the path 5 min load

    test: 3000 req/s @_kuritz /in/kuritz
  24. The Farmer’s Dog • We experiment with our process more

    • Engineering has more capacity to work towards its goals • We are careful not to trade our resilience for robustness 48 Lasting Impact @_kuritz /in/kuritz