Circuit Breakers
32
Business logic
Exception Handling
Fallbacks
Timeouts
Retries
Circuit Breakers
User-centric System-centric
Retry five
times and then
break the
circuit
Slide 33
Slide 33 text
Circuit Breakers
33
Slide 34
Slide 34 text
Circuit Breakers
34
Slide 35
Slide 35 text
◎ Time since last failure
◎ Number or % of failures in last W seconds
◎ Number of slow ops
Circuit Breaker Strategy
35
Slide 36
Slide 36 text
Health Checks
36
Business logic
Exception Handling
Fallbacks
Timeouts
Retries
Circuit Breakers
Health Checks
User-centric System-centric
Slide 37
Slide 37 text
System-centric
resilience
Design to prevent failures
as far as possible.
37
Slide 38
Slide 38 text
System Configuration
◎ JVM options like Heap, GC
◎ Machine config like t2.micro, r3.2xlarge
38
Business logic
Exception Handling
Fallbacks
Timeouts
Retries
Circuit Breakers
Health Checks
System
Configs
User-centric System-centric
Slide 39
Slide 39 text
Instrumentation
◎ Monitoring / Alerting
◎ Clojure specific instrumentation
○ https://github.com/metrics-clojure/metrics-clojure
39
Business logic
Exception Handling
Fallbacks
Timeouts
Retries
Circuit Breakers
Health Checks
System
Configs
Instrumentation
User-centric System-centric
Slide 40
Slide 40 text
Share System Resources by Pooling
◎ Resources are finite
◎ System components should respect finiteness
40
Business logic
Exception Handling
Fallbacks
Timeouts
Retries
Circuit Breakers
Health Checks
System
Configs
Instrumentation
Resource Pooling
User-centric System-centric
Slide 41
Slide 41 text
Share System Resources by Pooling
41
Slide 42
Slide 42 text
Share System Resources by Pooling
42
Incident timeline
JVM threads
Slide 43
Slide 43 text
Share System Resources by Pooling
43
Database driver pool
HTTP connection pool
Background task pool
Web server pool
Slide 44
Slide 44 text
Bulkheads
44
Business logic
Exception Handling
Fallbacks
Timeouts
Retries
Circuit Breakers
Health Checks
System
Configs
Instrumentation
Resource Pooling
Bulkheads
User-centric System-centric
Slide 45
Slide 45 text
Bulkheads
45
Slide 46
Slide 46 text
Load Shedding: Prevent Cascading Failures
46
Business logic
Exception Handling
Fallbacks
Timeouts
Retries
Circuit Breakers
Health Checks
System
Configs
Instrumentation
Resource Pooling
Bulkheads
Load shedding
User-centric System-centric
Slide 47
Slide 47 text
Load Shedding: Prevent Cascading Failures
47
Slide 48
Slide 48 text
Load Shedding: Prevent Cascading Failures
48
Slide 49
Slide 49 text
Load Shedding: Prevent Cascading Failures
49
Slide 50
Slide 50 text
Load Shedding: Prevent Cascading Failures
50
Slide 51
Slide 51 text
When to Shed Load?
◎ Response Time = Queueing Time + Service Time
51
Slide 52
Slide 52 text
When to Shed Load?
◎ Response Time = Queueing Time + Service Time
◎ Use queueing time as a signal for load
52
Slide 53
Slide 53 text
Load Shedding Ring Middleware
53
Slide 54
Slide 54 text
Load Shedding Ring Middleware
54
Slide 55
Slide 55 text
Load Shedding Ring Middleware
55
Slide 56
Slide 56 text
Load Shedding vs Rate Limiting
◎ More dynamic
56
Slide 57
Slide 57 text
Load Shedding vs Rate Limiting
◎ More dynamic
57
Slide 58
Slide 58 text
Load Shedding vs Rate Limiting
◎ More dynamic
58
Slide 59
Slide 59 text
The Journey Thus Far
59
Business logic
Exception handling
Fallbacks
Timeouts
Retries
Circuit breakers
Health checks
System
Configs
Instrumentation
Resource pooling
Bulkheads
Load shedding
?
User-centric System-centric
Slide 60
Slide 60 text
◎ Constant uphill battle
Feedback in Resilience Engineering
60
Slide 61
Slide 61 text
◎ Constant uphill battle
◎ Incidents are lessons for the future
Feedback in Resilience Engineering
61
Slide 62
Slide 62 text
◎ Constant uphill battle
◎ Incidents are lessons for the future
Feedback in Resilience Engineering
62
Product
Engineering
Incident
Failure
Discovery
Knowledge Dissemination Incident Analysis
Slide 63
Slide 63 text
Beyond Incidental Resilience
◎ How to ensure our resilience patterns are
reliable?
63
Slide 64
Slide 64 text
Beyond Incidental Resilience
◎ How to ensure our resilience patterns are
reliable?
◎ Few opportunities to learn failure patterns
64
Slide 65
Slide 65 text
Beyond Incidental Resilience
◎ Chaos Engineering: Ingest failures deliberately
○ Confirmation of resilience
○ Discovery of new failure patterns
https://netflix.github.io/chaosmonkey/
65
Conclusion: Tokyo was not Built in a Day
Earthquake-prone zones are home to the
safest buildings in the world.
68
https://www.bbc.com/future/gallery/20190114-how-japans-skyscrapers-are-built-to-survive-earthquakes
https://en.wikipedia.org/wiki/Tokyo
Slide 69
Slide 69 text
Thanks!
Any questions?
@mourjo_sen
69
Business logic
Exception handling
Fallbacks
Timeouts
Retries
Circuit breakers
Health checks
System Configs
Instrumentation
Resource pooling
Bulkheads
Load shedding
?
User-centric
System-centric