Slide 1

Slide 1 text

Building Resilient Services in Clojure

Slide 2

Slide 2 text

(hello re-clojure) I am Mourjo Sen! ◎ Software Engineer at Helpshift ◎ 5 years with Clojure ◎ @mourjo_sen 2

Slide 3

Slide 3 text

Stability of Systems 3

Slide 4

Slide 4 text

Stability of Systems 4

Slide 5

Slide 5 text

Stability of Systems 5

Slide 6

Slide 6 text

Stability of Systems Resiliency is the ability of a system to gracefully handle and recover from failures. 6

Slide 7

Slide 7 text

7 Business Logic

Slide 8

Slide 8 text

8 Business Logic

Slide 9

Slide 9 text

9 Business Logic

Slide 10

Slide 10 text

10 Business Logic https://twitter.com/cfiesler/status/1330542730469052418

Slide 11

Slide 11 text

11 Business Logic https://twitter.com/cfiesler/status/1330542730469052418

Slide 12

Slide 12 text

Journey to Resilience 12 Business logic User-centric System-centric

Slide 13

Slide 13 text

User-centric Resilience Systems we depend on will fail. Design for the optimal user experience. 13

Slide 14

Slide 14 text

Exception Handling 14 Business logic Exception Handling User-centric System-centric

Slide 15

Slide 15 text

Exception Handling 15 Business logic Exception Handling User-centric System-centric

Slide 16

Slide 16 text

Exception Handling 16 Business logic Exception Handling User-centric System-centric

Slide 17

Slide 17 text

Fallbacks 17 Business logic Exception Handling Fallbacks User-centric System-centric

Slide 18

Slide 18 text

Fallbacks 18 Business logic Exception Handling Fallbacks User-centric System-centric

Slide 19

Slide 19 text

Fallbacks 19 Business logic Exception Handling Fallbacks User-centric System-centric

Slide 20

Slide 20 text

Timeouts 20 Business logic Exception Handling Fallbacks Timeouts User-centric System-centric

Slide 21

Slide 21 text

Timeouts 21 Business logic Exception Handling Fallbacks Timeouts User-centric System-centric

Slide 22

Slide 22 text

Retries

Slide 23

Slide 23 text

Retries Transient failure

Slide 24

Slide 24 text

Retries

Slide 25

Slide 25 text

Retries

Slide 26

Slide 26 text

Retries 26

Slide 27

Slide 27 text

Retries 27

Slide 28

Slide 28 text

Circuit Breakers 28 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers User-centric System-centric

Slide 29

Slide 29 text

Circuit Breakers 29 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers User-centric System-centric History of failures

Slide 30

Slide 30 text

Circuit Breakers 30 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers User-centric System-centric Circuit open

Slide 31

Slide 31 text

Circuit Breakers 31 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers User-centric System-centric Circuit closed

Slide 32

Slide 32 text

Circuit Breakers 32 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers User-centric System-centric Retry five times and then break the circuit

Slide 33

Slide 33 text

Circuit Breakers 33

Slide 34

Slide 34 text

Circuit Breakers 34

Slide 35

Slide 35 text

◎ Time since last failure ◎ Number or % of failures in last W seconds ◎ Number of slow ops Circuit Breaker Strategy 35

Slide 36

Slide 36 text

Health Checks 36 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks User-centric System-centric

Slide 37

Slide 37 text

System-centric resilience Design to prevent failures as far as possible. 37

Slide 38

Slide 38 text

System Configuration ◎ JVM options like Heap, GC ◎ Machine config like t2.micro, r3.2xlarge 38 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs User-centric System-centric

Slide 39

Slide 39 text

Instrumentation ◎ Monitoring / Alerting ◎ Clojure specific instrumentation ○ https://github.com/metrics-clojure/metrics-clojure 39 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs Instrumentation User-centric System-centric

Slide 40

Slide 40 text

Share System Resources by Pooling ◎ Resources are finite ◎ System components should respect finiteness 40 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs Instrumentation Resource Pooling User-centric System-centric

Slide 41

Slide 41 text

Share System Resources by Pooling 41

Slide 42

Slide 42 text

Share System Resources by Pooling 42 Incident timeline JVM threads

Slide 43

Slide 43 text

Share System Resources by Pooling 43 Database driver pool HTTP connection pool Background task pool Web server pool

Slide 44

Slide 44 text

Bulkheads 44 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs Instrumentation Resource Pooling Bulkheads User-centric System-centric

Slide 45

Slide 45 text

Bulkheads 45

Slide 46

Slide 46 text

Load Shedding: Prevent Cascading Failures 46 Business logic Exception Handling Fallbacks Timeouts Retries Circuit Breakers Health Checks System Configs Instrumentation Resource Pooling Bulkheads Load shedding User-centric System-centric

Slide 47

Slide 47 text

Load Shedding: Prevent Cascading Failures 47

Slide 48

Slide 48 text

Load Shedding: Prevent Cascading Failures 48

Slide 49

Slide 49 text

Load Shedding: Prevent Cascading Failures 49

Slide 50

Slide 50 text

Load Shedding: Prevent Cascading Failures 50

Slide 51

Slide 51 text

When to Shed Load? ◎ Response Time = Queueing Time + Service Time 51

Slide 52

Slide 52 text

When to Shed Load? ◎ Response Time = Queueing Time + Service Time ◎ Use queueing time as a signal for load 52

Slide 53

Slide 53 text

Load Shedding Ring Middleware 53

Slide 54

Slide 54 text

Load Shedding Ring Middleware 54

Slide 55

Slide 55 text

Load Shedding Ring Middleware 55

Slide 56

Slide 56 text

Load Shedding vs Rate Limiting ◎ More dynamic 56

Slide 57

Slide 57 text

Load Shedding vs Rate Limiting ◎ More dynamic 57

Slide 58

Slide 58 text

Load Shedding vs Rate Limiting ◎ More dynamic 58

Slide 59

Slide 59 text

The Journey Thus Far 59 Business logic Exception handling Fallbacks Timeouts Retries Circuit breakers Health checks System Configs Instrumentation Resource pooling Bulkheads Load shedding ? User-centric System-centric

Slide 60

Slide 60 text

◎ Constant uphill battle Feedback in Resilience Engineering 60

Slide 61

Slide 61 text

◎ Constant uphill battle ◎ Incidents are lessons for the future Feedback in Resilience Engineering 61

Slide 62

Slide 62 text

◎ Constant uphill battle ◎ Incidents are lessons for the future Feedback in Resilience Engineering 62 Product Engineering Incident Failure Discovery Knowledge Dissemination Incident Analysis

Slide 63

Slide 63 text

Beyond Incidental Resilience ◎ How to ensure our resilience patterns are reliable? 63

Slide 64

Slide 64 text

Beyond Incidental Resilience ◎ How to ensure our resilience patterns are reliable? ◎ Few opportunities to learn failure patterns 64

Slide 65

Slide 65 text

Beyond Incidental Resilience ◎ Chaos Engineering: Ingest failures deliberately ○ Confirmation of resilience ○ Discovery of new failure patterns https://netflix.github.io/chaosmonkey/ 65

Slide 66

Slide 66 text

Beyond Incidental Resilience 66 Product Engineering Incident Failure Discovery Chaos Engineering Knowledge Dissemination Incident Analysis

Slide 67

Slide 67 text

Conclusion: Tokyo was not Built in a Day 67

Slide 68

Slide 68 text

Conclusion: Tokyo was not Built in a Day Earthquake-prone zones are home to the safest buildings in the world. 68 https://www.bbc.com/future/gallery/20190114-how-japans-skyscrapers-are-built-to-survive-earthquakes https://en.wikipedia.org/wiki/Tokyo

Slide 69

Slide 69 text

Thanks! Any questions? @mourjo_sen 69 Business logic Exception handling Fallbacks Timeouts Retries Circuit breakers Health checks System Configs Instrumentation Resource pooling Bulkheads Load shedding ? User-centric System-centric

Slide 70

Slide 70 text

References ◎ https://github.com/resilience4j/resilience4j ◎ https://github.com/ylgrgyq/resilience-for-clojure ◎ https://docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency ◎ https://netflix.github.io/chaosmonkey/ ◎ https://github.com/mourjo/procrustes ◎ https://github.com/dakrone/clj-http ◎ https://github.com/swaldman/c3p0 ◎ https://github.com/seancorfield/next-jdbc ◎ https://github.com/ring-clojure/ring ◎ https://github.com/TheClimateCorporation/claypoole ◎ https://medium.com/helpshift-engineering/achieving-graceful-restarts-of-clojure-serv ices-b3a3b9c1d60d ◎ https://medium.com/helpshift-engineering/load-shedding-in-clojure-d4857ce11588 ◎ https://sre.google/sre-book/table-of-contents/ 70