Resilient Microservices Architecture: A Pragmatic Approach to Dealing with Chaos

Slide 1

Slide 1 text

Resilient Software Architecture for Microservices: Dr. Walter Kammergruber A Pragmatic Approach to Dealing with Chaos MicroService Meetup Munich, 26.06.2019

Slide 2

Slide 2 text

1. Why is resilience important for Microservice (especially)? 2. Best practices for designing resilient distributed architectures 3. Testing resilience in applications Overview 2

Slide 3

Slide 3 text

Why Resilience? 3

Slide 4

Slide 4 text

4 turnoff.us/geek/are-you-ready-for-microservices/

Slide 5

Slide 5 text

“Anything that can possibly go wrong, does.” 5 Sack, John. The Butcher: The Ascent of Yerupaja epigraph (1952), reprinted in Shapiro, Fred R., ed., The Yale Book of Quotations 529 (2006) Murphy's law

Slide 6

Slide 6 text

‘Resilience is an imperative. Our software runs on the truly dismal computers we call data centers. Besides being heinously complex ...they are unreliable and prone to operator error.’ 6 Marius Eriksen, former “peddler of abstraction” at Twitter

Slide 7

Slide 7 text

Slide 8

Slide 8 text

E. B. Walker Photography https://www.flickr.com/photos/premierehdr/

Slide 9

Slide 9 text

Potential Problems with microservices in general 9 ● Performance ● Correctness ● Monitoring ● Debugging ● Efﬁciency ● Security ● Operability ● Resilience

Slide 10

Slide 10 text

Resilience in Microservice is hard At least one of those ● software you didn't write ● hardware you can't touch ● network you can't conﬁgure 10 Stuff ● breaks in new and surprising ways ● and your customers shouldn't notice

Slide 11

Slide 11 text

Best Practices 11

Slide 12

Slide 12 text

12 Very Simple Example Invoice Service Exchange Rate Service $€£¥₽ ?

Slide 13

Slide 13 text

Defensive Programming 13

Slide 14

Slide 14 text

14 def retrieve_exchange_rates MAX_RETRIES = 7, retry_count = 0 while retry_count < MAX_RETRIES do response = call_exchange_rates() if response.present? cache_rates(response) return response end sleep(wait_time) wait_time = increase_wait_time(retry_count) retry_count = retry_count + 1 end return cached_rates() if cached_rates().present? raise Error.new('Could not fetch Exchange Rate') end

Slide 15

Slide 15 text

What's wrong? Code smells - not job of the business logic / your code to know: - Number Retries - (Increased) Wait period - Caching 15 Not in this example, but not rarely found: - Strategies for dealing with request peaks (fallbacks, throttling, prioritization etc.) - Timeout thresholds ((milli-) seconds)

Slide 16

Slide 16 text

What to do? - Service objects, maybe as an extra library - Job queues if possible - Ambassador Pattern (see later) - Use for example Finagle or Linkerd 16

Slide 17

Slide 17 text

The reactive manifesto “The system stays responsive in the face of failure. This applies not only to highly-available, mission-critical systems — any system that is not resilient will be unresponsive after a failure. “ 17 www.reactivemanifesto.org

Slide 18

Slide 18 text

The reactive manifesto, paradigms for resilience: ● Delegation ● Replication ● Containment ● Isolation 18 www.reactivemanifesto.org

Slide 19

Slide 19 text

Delegation “Delegating a task asynchronously to another component means that the execution of the task will take place in the context of that other component. This delegated context could entail running in a different error handling context, on a different thread, in a different process, or on a different network node, to name a few possibilities.” 19

Slide 20

Slide 20 text

Replication “Executing a component simultaneously in different places is referred to as replication. This can mean executing on different threads or thread pools, processes, network nodes, or computing centers.” 20

Slide 21

Slide 21 text

Isolation (and Containment) “Isolation can be deﬁned in terms of decoupling, both in time and space. Decoupling in time means that the sender and receiver can have independent life-cycles—they do not need to be present at the same time for communication to be possible. It is enabled by adding asynchronous boundaries between the components, communicating through message-passing. ” 21

Slide 22

Slide 22 text

General basic Do`s for preparing resilience ● Logging ● Monitoring ● Integration Testing ● System Testing 22

Slide 23

Slide 23 text

Service Design ● Cluster services into use cases: What SLAs do you need for that? ● Make a list of your services: Which are most important for your business? ● What can are the challenges? ○ Response Time (e.g. Online Shop, UX) ○ Are there any potential Peaks to expect? ○ Data intensive services (network bandwidth, cpu, gpu, io writes) 23

Slide 24

Slide 24 text

Patterns 24 Source: www.reddit.com/r/ProgrammerHumor/comments/72fwhc/modern_application_architecture

Slide 25

Slide 25 text

Ambassador pattern (aka sidecar) 25 Application Main functionality docs.microsoft.com/en-us/azure/architecture/patterns/ambassador Ambassador Proxy to handle: ● Retry ● Circuit breaking ● Monitoring ● Security Host Remote Service

Slide 26

Slide 26 text

Anti-Corruption Layer pattern 26 Microservice docs.microsoft.com/en-us/azure/architecture/patterns/anti-corruption-layer Anti- Corruption Layer Subsystem A Subsystem B (aka legacy) Microservice Microservice

Slide 27

Slide 27 text

Message Queues 27 Producer Broker Consumer Consumer Consumer T o p i c Queue 1 Queue 2 Queue 3 Consumer

Slide 28

Slide 28 text

Compensating Transaction pattern: SaaS shop 28 1. Register User 5. Send conﬁrmation email 2. Acquire Package 3. Choose a Plan 4. Issue payment Compensate Compensate Compensate Compensate Compensate Send cancelation email Cancel payment / trigger refund Delete Plan association Delete Package association Delete User Counter operation in each step of the long running transaction

Slide 29

Slide 29 text

Microservice Antipatterns ● Microlith: Everything depends on everything else ● The More The Merrier: Too much of everything, i.e. communication, too complex, bad service cuts ● Flying Blind: Too less logging & monitoring ● Pride and Wrath of Distributed Services: “Yeah, it’s going to work just ﬁne” 29 Nice read: itnext.io/anti-patterns-of-microservices-6e802553bd46

Slide 30

Slide 30 text

General (Resiliency) Anti-patterns Do not violate these programming principles: ● Separation of concern → enable change / improvements ● Information hiding → enable change / improvements ● Loose coupling → enable change / improvements ● Fail-fast (but of course be resilient as a system) → detect failures / corrupt data You might violate: ● “Don't repeat yourself”, “Cohesion” 30

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Testing with Chaos 32

Slide 33

Slide 33 text

33 A QA engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd. @brenankeller, 2018-11-30, 1:21PM

Slide 34

Slide 34 text

34 @brenankeller, 2018-11-30, 1:21PM First real customer walks in and asks where the bathroom is. The bar bursts into flames, killing everyone.

Slide 35

Slide 35 text

Principles of Chaos Engineering “Chaos Engineering is the discipline of experimenting on a system in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production. “ 35 principlesofchaos.org

Slide 36

Slide 36 text

Cyneﬁn complexity framework 36 (pronounced kun-EV-in) means haunt, habitat, acquainted, accustomed, or familiar.

Slide 37

Slide 37 text

37 Complex Unknown unknowns Chaotic Unknowables Obvious Known knowns Complicated Known unknowns Disorder Source: https://learning.oreilly.com/library/view/the-agile-developers/9781787280205/

Slide 38

Slide 38 text

“Chaos doesn’t cause problems it reveals them” 38 *(really hard to ﬁnd the original source, but quoted very frequently - must be somehow a valid quote) Nora Jones, Senior Software Engineer, Netﬂix

Slide 39

Slide 39 text

Chaos engineering general approach 39 1. Formulate hypothesis: What might go wrong in the system? 2. Setup experiment: Can you recreate the failure without impacting users? 3. Minimize blast radius: Try the smallest experiment first to learn something. 4. Run the experiment: Monitor the results and the system behavior carefully. 5. Analyze: If the system did not work as expected, well, you found a bug. If everything worked as it should, increase the blast radius and repeat. 6. Fix it: If an error occurred try to fix it using your metrics for finding the problem. Repeat the experiment.

Slide 40

Slide 40 text

40 Formulate hypothesis Setup experiment Analyze Run the experiment Do not forget the blast radius!!! Fix it

Slide 41

Slide 41 text

Failure Injection 41 1. Application Level 2. Host failure 3. Resource attacks (RAM, CPU) 4. Network attacks (latency, dependencies) 5. Region attacks

Slide 42

Slide 42 text

Examples for what you can test 42 ● Maxing out CPU cores on an Elasticsearch cluster. ● Time travel: Forcing system clocks out of sync with each other. ● Simulating the failure of an entire region or datacenter. ● Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.

Slide 43

Slide 43 text

Examples for what you can test (continued) 43 ● Injecting latency between services for a select percentage of trafﬁc over a predetermined period of time. ● Function-based chaos (runtime injection): Randomly causing functions to throw exceptions. ● Executing a routine in driver code emulating I/O errors. learning.oreilly.com/library/view/chaos-engineering/9781491988459

Slide 44

Slide 44 text

Tools for Chaos Engineering 44 1. Istio: istio.io 2. Chaos Toolkit: chaostoolkit.org 3. Kube Monkey: github.com/asobti/kube-monkey 4. Powerful Seal: github.com/bloomberg/powerfulseal 5. Gremlin: www.gremlin.com 6. Chaosmonkey: github.com/netﬂix/chaosmonkey 7. Some python scripts ;-)

Slide 45

Slide 45 text

Take away: Be pragmatic 45 1. Know your most important services 2. Consider the End-to-End User Experience (UX) 3. Be defensive to failure 4. Architecture to change 5. Expect the worst and deal with it 6. What is the most simple solution that works for me? 7. Play around with Chaos 8. Never forget about people: Ask other teams, and experts and bring them onboard.

Slide 46

Slide 46 text

46 CTO @woidda xing.com/profile/WalterChristian_Kammergruber linkedin.com/in/dr-walter-kammergruber-9b643130 Dr. Walter Kammergruber

Slide 47

Slide 47 text

Links 47 ● Velocity Conf: conferences.oreilly.com/velocity/vl-eu-2018/public/schedule/full/public ● Chaos Engineering resources github.com/dastergon/awesome-chaos-engineering ● Design patterns for microservices: azure.microsoft.com/en-us/blog/design-patterns-for-microservices/ ○ Especially: docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency ● O`Reilly stuff: www.oreilly.com/tags/microservices ○ Especially: learning.oreilly.com/library/view/release-it/9781680500264/