Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Resilient Services in Clojure

Mourjo Sen
December 05, 2020
240

Building Resilient Services in Clojure

By employing strategies like bulkheads, circuit breakers, load shedding, how we can make services written in Clojure battle ready for production systems. And how to go beyond the now in terms of constant evolution of the resiliency strategies. This was presented at re:Clojure 2020, https://reclojure.org.

Mourjo Sen

December 05, 2020
Tweet

Transcript

  1. (hello re-clojure) I am Mourjo Sen! ◎ Software Engineer at

    Helpshift ◎ 5 years with Clojure ◎ @mourjo_sen 2
  2. Stability of Systems Resiliency is the ability of a system

    to gracefully handle and recover from failures. 6
  3. Circuit Breakers 29 Business logic Exception Handling Fallbacks Timeouts Retries

    Circuit Breakers User-centric System-centric History of failures
  4. Circuit Breakers 30 Business logic Exception Handling Fallbacks Timeouts Retries

    Circuit Breakers User-centric System-centric Circuit open
  5. Circuit Breakers 31 Business logic Exception Handling Fallbacks Timeouts Retries

    Circuit Breakers User-centric System-centric Circuit closed
  6. Circuit Breakers 32 Business logic Exception Handling Fallbacks Timeouts Retries

    Circuit Breakers User-centric System-centric Retry five times and then break the circuit
  7. ◎ Time since last failure ◎ Number or % of

    failures in last W seconds ◎ Number of slow ops Circuit Breaker Strategy 35
  8. Health Checks 36 Business logic Exception Handling Fallbacks Timeouts Retries

    Circuit Breakers Health Checks User-centric System-centric
  9. System Configuration ◎ JVM options like Heap, GC ◎ Machine

    config like t2.micro, r3.2xlarge 38 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs User-centric System-centric
  10. Instrumentation ◎ Monitoring / Alerting ◎ Clojure specific instrumentation ◦

    https://github.com/metrics-clojure/metrics-clojure 39 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs Instrumentation User-centric System-centric
  11. Share System Resources by Pooling ◎ Resources are finite ◎

    System components should respect finiteness 40 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs Instrumentation Resource Pooling User-centric System-centric
  12. Share System Resources by Pooling 43 Database driver pool HTTP

    connection pool Background task pool Web server pool
  13. Bulkheads 44 Business logic Exception Handling Fallbacks Timeouts Retries Circuit

    Breakers Health Checks System Configs Instrumentation Resource Pooling Bulkheads User-centric System-centric
  14. Load Shedding: Prevent Cascading Failures 46 Business logic Exception Handling

    Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs Instrumentation Resource Pooling Bulkheads Load shedding User-centric System-centric
  15. When to Shed Load? ◎ Response Time = Queueing Time

    + Service Time ◎ Use queueing time as a signal for load 52
  16. The Journey Thus Far 59 Business logic Exception handling Fallbacks

    Timeouts Retries Circuit breakers Health checks System Configs Instrumentation Resource pooling Bulkheads Load shedding ? User-centric System-centric
  17. ◎ Constant uphill battle ◎ Incidents are lessons for the

    future Feedback in Resilience Engineering 61
  18. ◎ Constant uphill battle ◎ Incidents are lessons for the

    future Feedback in Resilience Engineering 62 Product Engineering Incident Failure Discovery Knowledge Dissemination Incident Analysis
  19. Beyond Incidental Resilience ◎ How to ensure our resilience patterns

    are reliable? ◎ Few opportunities to learn failure patterns 64
  20. Beyond Incidental Resilience ◎ Chaos Engineering: Ingest failures deliberately ◦

    Confirmation of resilience ◦ Discovery of new failure patterns https://netflix.github.io/chaosmonkey/ 65
  21. Beyond Incidental Resilience 66 Product Engineering Incident Failure Discovery Chaos

    Engineering Knowledge Dissemination Incident Analysis
  22. Conclusion: Tokyo was not Built in a Day Earthquake-prone zones

    are home to the safest buildings in the world. 68 https://www.bbc.com/future/gallery/20190114-how-japans-skyscrapers-are-built-to-survive-earthquakes https://en.wikipedia.org/wiki/Tokyo
  23. Thanks! Any questions? @mourjo_sen 69 Business logic Exception handling Fallbacks

    Timeouts Retries Circuit breakers Health checks System Configs Instrumentation Resource pooling Bulkheads Load shedding ? User-centric System-centric
  24. References ◎ https://github.com/resilience4j/resilience4j ◎ https://github.com/ylgrgyq/resilience-for-clojure ◎ https://docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency ◎ https://netflix.github.io/chaosmonkey/ ◎

    https://github.com/mourjo/procrustes ◎ https://github.com/dakrone/clj-http ◎ https://github.com/swaldman/c3p0 ◎ https://github.com/seancorfield/next-jdbc ◎ https://github.com/ring-clojure/ring ◎ https://github.com/TheClimateCorporation/claypoole ◎ https://medium.com/helpshift-engineering/achieving-graceful-restarts-of-clojure-serv ices-b3a3b9c1d60d ◎ https://medium.com/helpshift-engineering/load-shedding-in-clojure-d4857ce11588 ◎ https://sre.google/sre-book/table-of-contents/ 70