Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient Microservices Architecture: A Pragma...

Avatar for Woidda Woidda
June 26, 2019

Resilient Microservices Architecture: A Pragmatic Approach to Dealing with Chaos

You need to keep your application running. Downtime costs you money, in some cases a lot, and in the worst case it can destroy a company.
Let us talk about architectural styles and other approaches to make your system more resilient. Failures are normal and everything will fail at some point of time: be prepared, even if you are not working for Netflix, Google or Amazon.
We will have a quick look at tools out there to set up some simple "chaos" or better resilience tests to ensure your stack is running.

This talk covers:

1) Why is resilience important for Microservice (especially)
2) Best practices for designing resilient distributed architectures
3) Testing resilience in applications

We are going to talk more about general concepts rather than discussing how to configure and setup tool x, y, z. The concepts should be applicable to a simple web application as well as for larger systems with tons of legacy code written in a rather exotic language only some retired guys can understand.

Avatar for Woidda

Woidda

June 26, 2019
Tweet

More Decks by Woidda

Other Decks in Technology

Transcript

  1. Resilient Software Architecture for Microservices: Dr. Walter Kammergruber A Pragmatic

    Approach to Dealing with Chaos MicroService Meetup Munich, 26.06.2019
  2. 1. Why is resilience important for Microservice (especially)? 2. Best

    practices for designing resilient distributed architectures 3. Testing resilience in applications Overview 2
  3. “Anything that can possibly go wrong, does.” 5 Sack, John.

    The Butcher: The Ascent of Yerupaja epigraph (1952), reprinted in Shapiro, Fred R., ed., The Yale Book of Quotations 529 (2006) Murphy's law
  4. ‘Resilience is an imperative. Our software runs on the truly

    dismal computers we call data centers. Besides being heinously complex ...they are unreliable and prone to operator error.’ 6 Marius Eriksen, former “peddler of abstraction” at Twitter
  5. 7

  6. Potential Problems with microservices in general 9 • Performance •

    Correctness • Monitoring • Debugging • Efficiency • Security • Operability • Resilience
  7. Resilience in Microservice is hard At least one of those

    • software you didn't write • hardware you can't touch • network you can't configure 10 Stuff • breaks in new and surprising ways • and your customers shouldn't notice
  8. 14 def retrieve_exchange_rates MAX_RETRIES = 7, retry_count = 0 while

    retry_count < MAX_RETRIES do response = call_exchange_rates() if response.present? cache_rates(response) return response end sleep(wait_time) wait_time = increase_wait_time(retry_count) retry_count = retry_count + 1 end return cached_rates() if cached_rates().present? raise Error.new('Could not fetch Exchange Rate') end
  9. What's wrong? Code smells - not job of the business

    logic / your code to know: - Number Retries - (Increased) Wait period - Caching 15 Not in this example, but not rarely found: - Strategies for dealing with request peaks (fallbacks, throttling, prioritization etc.) - Timeout thresholds ((milli-) seconds)
  10. What to do? - Service objects, maybe as an extra

    library - Job queues if possible - Ambassador Pattern (see later) - Use for example Finagle or Linkerd 16
  11. The reactive manifesto “The system stays responsive in the face

    of failure. This applies not only to highly-available, mission-critical systems — any system that is not resilient will be unresponsive after a failure. “ 17 www.reactivemanifesto.org
  12. The reactive manifesto, paradigms for resilience: • Delegation • Replication

    • Containment • Isolation 18 www.reactivemanifesto.org
  13. Delegation “Delegating a task asynchronously to another component means that

    the execution of the task will take place in the context of that other component. This delegated context could entail running in a different error handling context, on a different thread, in a different process, or on a different network node, to name a few possibilities.” 19
  14. Replication “Executing a component simultaneously in different places is referred

    to as replication. This can mean executing on different threads or thread pools, processes, network nodes, or computing centers.” 20
  15. Isolation (and Containment) “Isolation can be defined in terms of

    decoupling, both in time and space. Decoupling in time means that the sender and receiver can have independent life-cycles—they do not need to be present at the same time for communication to be possible. It is enabled by adding asynchronous boundaries between the components, communicating through message-passing. ” 21
  16. General basic Do`s for preparing resilience • Logging • Monitoring

    • Integration Testing • System Testing 22
  17. Service Design • Cluster services into use cases: What SLAs

    do you need for that? • Make a list of your services: Which are most important for your business? • What can are the challenges? ◦ Response Time (e.g. Online Shop, UX) ◦ Are there any potential Peaks to expect? ◦ Data intensive services (network bandwidth, cpu, gpu, io writes) 23
  18. Ambassador pattern (aka sidecar) 25 Application Main functionality docs.microsoft.com/en-us/azure/architecture/patterns/ambassador Ambassador

    Proxy to handle: • Retry • Circuit breaking • Monitoring • Security Host Remote Service
  19. Compensating Transaction pattern: SaaS shop 28 1. Register User 5.

    Send confirmation email 2. Acquire Package 3. Choose a Plan 4. Issue payment Compensate Compensate Compensate Compensate Compensate Send cancelation email Cancel payment / trigger refund Delete Plan association Delete Package association Delete User Counter operation in each step of the long running transaction
  20. Microservice Antipatterns • Microlith: Everything depends on everything else •

    The More The Merrier: Too much of everything, i.e. communication, too complex, bad service cuts • Flying Blind: Too less logging & monitoring • Pride and Wrath of Distributed Services: “Yeah, it’s going to work just fine” 29 Nice read: itnext.io/anti-patterns-of-microservices-6e802553bd46
  21. General (Resiliency) Anti-patterns Do not violate these programming principles: •

    Separation of concern → enable change / improvements • Information hiding → enable change / improvements • Loose coupling → enable change / improvements • Fail-fast (but of course be resilient as a system) → detect failures / corrupt data You might violate: • “Don't repeat yourself”, “Cohesion” 30
  22. General (Resiliency) Anti-patterns Do not violate these programming principles: •

    Separation of concern → enable change / improvements • Information hiding → enable change / improvements • Loose coupling → enable change / improvements • Fail-fast (but of course be resilient as a system) → detect failures / corrupt data You might violate: • “Don't repeat yourself”, “Cohesion” 31 Do Violate: "If it ain't broke, don't fix it."
  23. 33 A QA engineer walks into a bar. Orders a

    beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd. @brenankeller, 2018-11-30, 1:21PM
  24. 34 @brenankeller, 2018-11-30, 1:21PM First real customer walks in and

    asks where the bathroom is. The bar bursts into flames, killing everyone.
  25. Principles of Chaos Engineering “Chaos Engineering is the discipline of

    experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. “ 35 principlesofchaos.org
  26. 37 Complex Unknown unknowns Chaotic Unknowables Obvious Known knowns Complicated

    Known unknowns Disorder Source: https://learning.oreilly.com/library/view/the-agile-developers/9781787280205/
  27. “Chaos doesn’t cause problems it reveals them” 38 *(really hard

    to find the original source, but quoted very frequently - must be somehow a valid quote) Nora Jones, Senior Software Engineer, Netflix
  28. Chaos engineering general approach 39 1. Formulate hypothesis: What might

    go wrong in the system? 2. Setup experiment: Can you recreate the failure without impacting users? 3. Minimize blast radius: Try the smallest experiment first to learn something. 4. Run the experiment: Monitor the results and the system behavior carefully. 5. Analyze: If the system did not work as expected, well, you found a bug. If everything worked as it should, increase the blast radius and repeat. 6. Fix it: If an error occurred try to fix it using your metrics for finding the problem. Repeat the experiment.
  29. Failure Injection 41 1. Application Level 2. Host failure 3.

    Resource attacks (RAM, CPU) 4. Network attacks (latency, dependencies) 5. Region attacks
  30. Examples for what you can test 42 • Maxing out

    CPU cores on an Elasticsearch cluster. • Time travel: Forcing system clocks out of sync with each other. • Simulating the failure of an entire region or datacenter. • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.
  31. Examples for what you can test (continued) 43 • Injecting

    latency between services for a select percentage of traffic over a predetermined period of time. • Function-based chaos (runtime injection): Randomly causing functions to throw exceptions. • Executing a routine in driver code emulating I/O errors. learning.oreilly.com/library/view/chaos-engineering/9781491988459
  32. Tools for Chaos Engineering 44 1. Istio: istio.io 2. Chaos

    Toolkit: chaostoolkit.org 3. Kube Monkey: github.com/asobti/kube-monkey 4. Powerful Seal: github.com/bloomberg/powerfulseal 5. Gremlin: www.gremlin.com 6. Chaosmonkey: github.com/netflix/chaosmonkey 7. Some python scripts ;-)
  33. Take away: Be pragmatic 45 1. Know your most important

    services 2. Consider the End-to-End User Experience (UX) 3. Be defensive to failure 4. Architecture to change 5. Expect the worst and deal with it 6. What is the most simple solution that works for me? 7. Play around with Chaos 8. Never forget about people: Ask other teams, and experts and bring them onboard.
  34. Links 47 • Velocity Conf: conferences.oreilly.com/velocity/vl-eu-2018/public/schedule/full/public • Chaos Engineering resources

    github.com/dastergon/awesome-chaos-engineering • Design patterns for microservices: azure.microsoft.com/en-us/blog/design-patterns-for-microservices/ ◦ Especially: docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency • O`Reilly stuff: www.oreilly.com/tags/microservices ◦ Especially: learning.oreilly.com/library/view/release-it/9781680500264/