Resilient Microservices Architecture: A Pragmatic Approach to Dealing with Chaos

Resilient Software Architecture for Microservices: Dr. Walter Kammergruber A Pragmatic
Approach to Dealing with Chaos MicroService Meetup Munich, 26.06.2019

1. Why is resilience important for Microservice (especially)? 2. Best
practices for designing resilient distributed architectures 3. Testing resilience in applications Overview 2

Why Resilience? 3

4 turnoff.us/geek/are-you-ready-for-microservices/

“Anything that can possibly go wrong, does.” 5 Sack, John.
The Butcher: The Ascent of Yerupaja epigraph (1952), reprinted in Shapiro, Fred R., ed., The Yale Book of Quotations 529 (2006) Murphy's law

‘Resilience is an imperative. Our software runs on the truly
dismal computers we call data centers. Besides being heinously complex ...they are unreliable and prone to operator error.’ 6 Marius Eriksen, former “peddler of abstraction” at Twitter

E. B. Walker Photography https://www.flickr.com/photos/premierehdr/

Potential Problems with microservices in general 9 • Performance •
Correctness • Monitoring • Debugging • Efﬁciency • Security • Operability • Resilience

Resilience in Microservice is hard At least one of those
• software you didn't write • hardware you can't touch • network you can't conﬁgure 10 Stuff • breaks in new and surprising ways • and your customers shouldn't notice

Best Practices 11

12 Very Simple Example Invoice Service Exchange Rate Service $€£¥₽
?

Defensive Programming 13

14 def retrieve_exchange_rates MAX_RETRIES = 7, retry_count = 0 while
retry_count < MAX_RETRIES do response = call_exchange_rates() if response.present? cache_rates(response) return response end sleep(wait_time) wait_time = increase_wait_time(retry_count) retry_count = retry_count + 1 end return cached_rates() if cached_rates().present? raise Error.new('Could not fetch Exchange Rate') end

What's wrong? Code smells - not job of the business
logic / your code to know: - Number Retries - (Increased) Wait period - Caching 15 Not in this example, but not rarely found: - Strategies for dealing with request peaks (fallbacks, throttling, prioritization etc.) - Timeout thresholds ((milli-) seconds)

What to do? - Service objects, maybe as an extra
library - Job queues if possible - Ambassador Pattern (see later) - Use for example Finagle or Linkerd 16

The reactive manifesto “The system stays responsive in the face
of failure. This applies not only to highly-available, mission-critical systems — any system that is not resilient will be unresponsive after a failure. “ 17 www.reactivemanifesto.org

The reactive manifesto, paradigms for resilience: • Delegation • Replication
• Containment • Isolation 18 www.reactivemanifesto.org

Delegation “Delegating a task asynchronously to another component means that
the execution of the task will take place in the context of that other component. This delegated context could entail running in a different error handling context, on a different thread, in a different process, or on a different network node, to name a few possibilities.” 19

Replication “Executing a component simultaneously in different places is referred
to as replication. This can mean executing on different threads or thread pools, processes, network nodes, or computing centers.” 20

Isolation (and Containment) “Isolation can be deﬁned in terms of
decoupling, both in time and space. Decoupling in time means that the sender and receiver can have independent life-cycles—they do not need to be present at the same time for communication to be possible. It is enabled by adding asynchronous boundaries between the components, communicating through message-passing. ” 21

General basic Do`s for preparing resilience • Logging • Monitoring
• Integration Testing • System Testing 22

Service Design • Cluster services into use cases: What SLAs
do you need for that? • Make a list of your services: Which are most important for your business? • What can are the challenges? ◦ Response Time (e.g. Online Shop, UX) ◦ Are there any potential Peaks to expect? ◦ Data intensive services (network bandwidth, cpu, gpu, io writes) 23

Patterns 24 Source: www.reddit.com/r/ProgrammerHumor/comments/72fwhc/modern_application_architecture

Ambassador pattern (aka sidecar) 25 Application Main functionality docs.microsoft.com/en-us/azure/architecture/patterns/ambassador Ambassador
Proxy to handle: • Retry • Circuit breaking • Monitoring • Security Host Remote Service

Anti-Corruption Layer pattern 26 Microservice docs.microsoft.com/en-us/azure/architecture/patterns/anti-corruption-layer Anti- Corruption Layer Subsystem
A Subsystem B (aka legacy) Microservice Microservice

Message Queues 27 Producer Broker Consumer Consumer Consumer T o
p i c Queue 1 Queue 2 Queue 3 Consumer

Compensating Transaction pattern: SaaS shop 28 1. Register User 5.
Send conﬁrmation email 2. Acquire Package 3. Choose a Plan 4. Issue payment Compensate Compensate Compensate Compensate Compensate Send cancelation email Cancel payment / trigger refund Delete Plan association Delete Package association Delete User Counter operation in each step of the long running transaction

Microservice Antipatterns • Microlith: Everything depends on everything else •
The More The Merrier: Too much of everything, i.e. communication, too complex, bad service cuts • Flying Blind: Too less logging & monitoring • Pride and Wrath of Distributed Services: “Yeah, it’s going to work just ﬁne” 29 Nice read: itnext.io/anti-patterns-of-microservices-6e802553bd46

General (Resiliency) Anti-patterns Do not violate these programming principles: •
Separation of concern → enable change / improvements • Information hiding → enable change / improvements • Loose coupling → enable change / improvements • Fail-fast (but of course be resilient as a system) → detect failures / corrupt data You might violate: • “Don't repeat yourself”, “Cohesion” 30

General (Resiliency) Anti-patterns Do not violate these programming principles: •
Separation of concern → enable change / improvements • Information hiding → enable change / improvements • Loose coupling → enable change / improvements • Fail-fast (but of course be resilient as a system) → detect failures / corrupt data You might violate: • “Don't repeat yourself”, “Cohesion” 31 Do Violate: "If it ain't broke, don't ﬁx it."

Testing with Chaos 32

33 A QA engineer walks into a bar. Orders a
beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd. @brenankeller, 2018-11-30, 1:21PM

34 @brenankeller, 2018-11-30, 1:21PM First real customer walks in and
asks where the bathroom is. The bar bursts into flames, killing everyone.

Principles of Chaos Engineering “Chaos Engineering is the discipline of
experimenting on a system in order to build conﬁdence in the system’s capability to withstand turbulent conditions in production. “ 35 principlesofchaos.org

Cyneﬁn complexity framework 36 (pronounced kun-EV-in) means haunt, habitat, acquainted,
accustomed, or familiar.

37 Complex Unknown unknowns Chaotic Unknowables Obvious Known knowns Complicated
Known unknowns Disorder Source: https://learning.oreilly.com/library/view/the-agile-developers/9781787280205/

“Chaos doesn’t cause problems it reveals them” 38 *(really hard
to ﬁnd the original source, but quoted very frequently - must be somehow a valid quote) Nora Jones, Senior Software Engineer, Netﬂix

Chaos engineering general approach 39 1. Formulate hypothesis: What might
go wrong in the system? 2. Setup experiment: Can you recreate the failure without impacting users? 3. Minimize blast radius: Try the smallest experiment first to learn something. 4. Run the experiment: Monitor the results and the system behavior carefully. 5. Analyze: If the system did not work as expected, well, you found a bug. If everything worked as it should, increase the blast radius and repeat. 6. Fix it: If an error occurred try to fix it using your metrics for finding the problem. Repeat the experiment.

40 Formulate hypothesis Setup experiment Analyze Run the experiment Do
not forget the blast radius!!! Fix it

Failure Injection 41 1. Application Level 2. Host failure 3.
Resource attacks (RAM, CPU) 4. Network attacks (latency, dependencies) 5. Region attacks

Examples for what you can test 42 • Maxing out
CPU cores on an Elasticsearch cluster. • Time travel: Forcing system clocks out of sync with each other. • Simulating the failure of an entire region or datacenter. • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.

Examples for what you can test (continued) 43 • Injecting
latency between services for a select percentage of trafﬁc over a predetermined period of time. • Function-based chaos (runtime injection): Randomly causing functions to throw exceptions. • Executing a routine in driver code emulating I/O errors. learning.oreilly.com/library/view/chaos-engineering/9781491988459

Tools for Chaos Engineering 44 1. Istio: istio.io 2. Chaos
Toolkit: chaostoolkit.org 3. Kube Monkey: github.com/asobti/kube-monkey 4. Powerful Seal: github.com/bloomberg/powerfulseal 5. Gremlin: www.gremlin.com 6. Chaosmonkey: github.com/netﬂix/chaosmonkey 7. Some python scripts ;-)

Take away: Be pragmatic 45 1. Know your most important
services 2. Consider the End-to-End User Experience (UX) 3. Be defensive to failure 4. Architecture to change 5. Expect the worst and deal with it 6. What is the most simple solution that works for me? 7. Play around with Chaos 8. Never forget about people: Ask other teams, and experts and bring them onboard.

46 CTO @woidda xing.com/profile/WalterChristian_Kammergruber linkedin.com/in/dr-walter-kammergruber-9b643130 Dr. Walter Kammergruber

Links 47 • Velocity Conf: conferences.oreilly.com/velocity/vl-eu-2018/public/schedule/full/public • Chaos Engineering resources
github.com/dastergon/awesome-chaos-engineering • Design patterns for microservices: azure.microsoft.com/en-us/blog/design-patterns-for-microservices/ ◦ Especially: docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency • O`Reilly stuff: www.oreilly.com/tags/microservices ◦ Especially: learning.oreilly.com/library/view/release-it/9781680500264/

Resilient Microservices Architecture: A Pragma...

Resilient Microservices Architecture: A Pragmatic Approach to Dealing with Chaos

More Decks by Woidda

Other Decks in Technology

Featured

Transcript