R.L. and Braithwaite J. From Safety-I to Safety-II: A White Paper. The Resilient Health Care Net: Published simultaneously by the University of Southern Denmark, University of Florida, USA, and Macquarie University, Australia.
R.L. and Braithwaite J. From Safety-I to Safety-II: A White Paper. The Resilient Health Care Net: Published simultaneously by the University of Southern Denmark, University of Florida, USA, and Macquarie University, Australia.
risk of brittleness, a third use of the label resilience becomes the idea of graceful extensibility - how a system extends performance, or brings extra adaptive capacity to bear, when surprise events challenge its boundaries. 14 Woods DD. “Four concepts for resilience and the implications for the future of resilience engineering”. Reliability Engineering and System Safety (2015), http://dx.doi.org/10.1016/j.ress.2015.03.018i @_kuritz /in/kuritz
operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”. 20 Cook (1998). How Complex Systems Fail. (Chicago: CtL). https://how.complexsystems.fail/ @_kuritz /in/kuritz
disruptions call upon those capacities. Systems possess varieties of adaptive capacity, and Resilience Engineering seeks to understand how these are built, sustained, degraded, and lost. 23 Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I. (Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org. @_kuritz /in/kuritz
dark launches • Trunk based development + monorepos ◦ 1 version: main ◦ Everything is a parallel change • All engineers serve on call 24 Practices -> Adaptive Capacity @_kuritz /in/kuritz
needed for response to surprises. A system with adaptive capacity is poised to adapt. It has some readiness to change how it currently works - its models, plans, processes, behaviors 28 D. D. Woods and J. Allspaw, "Revealing the critical role of human performance in software", ACM Queue, vol. 17, no. 6, pp. 1-13, 2019. https://queue.acm.org/detail.cfm?id=3380776 @_kuritz /in/kuritz
and levels (Ostrom, 2003). Reciprocity [...] is commitment to mutual assistance [...] one unit donates from their limited resources now to help another in their role, so both achieve benefits for overarching goals, and trusts that when the roles are reversed, the other unit will come to its aid. 35 Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I. (Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org. @_kuritz /in/kuritz
company and mission” Some Behaviors: • We take full individual and collective ownership, and are reliable team members • We go beyond our own “job” and “department” • We share bad news when we have it, and ask for help when we need it • We compete aggressively externally, but never with one another 36 Company Value - Play Team @_kuritz /in/kuritz
50 sockets max. • Also did not set ‘keepAlive’ so no reuse. • Both changed in v3 of SDK. • Discovered other JS libs also did not reuse sockets by default. 38 Questionable Defaults @_kuritz /in/kuritz
subnets • Our cluster is stateless, relatively easy to migrate • Launched new cluster in new subnets with more IPs • Platform team ensembled every day until fixed • <2 weeks to execute 39 Running out of IPs @_kuritz /in/kuritz
connections and overhead of creating conns. • Solution: connection pooler proxy service • https://hub.docker.com/r/edoburu/pgbouncer/ 40 Too Many Postgres Connections pgbouncer improves postgresql performance @_kuritz /in/kuritz
user details (and related data). • Too deeply ingrained into system to rip out in time. • Solution: create a ‘blocklist’ of queries and mutations that didn’t depend on it to bypass. 42 GraphQL Eager Loading @_kuritz /in/kuritz
Transaction ID Wraparound in Postgres • Gitlab: Why we spent the last month eliminating PostgreSQL subtransactions • Migration lessons learned: Even Amazon can face mishaps with new tools 44 Nested Txn Problems @_kuritz /in/kuritz
• Engineering has more capacity to work towards its goals • We are careful not to trade our resilience for robustness 48 Lasting Impact @_kuritz /in/kuritz