Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Resilient Microservice Architecture

Designing Resilient Microservice Architecture

Suat Köse

June 20, 2020
Tweet

More Decks by Suat Köse

Other Decks in Programming

Transcript

  1. Designing Resilient Microservice Architecture Suat Köse Email : [email protected] Blog

    : medium.com/@suadev Code : github.com/suadev Tweet : kose__suat Pres. : speakerdeck.com/suadev • Backend Software Engineer • +9 Years Experience • Banking - Telecommunication - Energy - Hospitality 20th June, 2020 [Online]
  2. What Is Resılıency & Fault Tolerance ? Resılıency In Mıcroservice

    Archıtecture How To Achıeve a Resılıent Archıtecture ? • Independent Mıcroservices • Applyıng Some Patterns • Chaos Engıneerıng AGENDA
  3. “In the fields of engineering and construction, Resiliency is the

    ability to absorb or avoid damage without suffering complete failure.” - wikipedia >> Failure might be observed. But the rest of the system continues to run normally. ---------- O ---------- “Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components.” - wikipedia >> Does not feel any impact except for some delay during the failover. What Is Resiliency & Fault Tolerance? What is Resiliency & Fault Tolerance ?
  4. Resiliency is all about being Reactive to all kinds of

    failure scenarios. “Today's demands are simply not met by yesterday’s software architectures. We want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems.” - the reactive manifesto Resılıency In Mıcroservice Archıtecture Resılıency In Mıcroservice Archıtecture
  5. Resılıency In Mıcroservice Archıtecture Starting to split a monolith to

    various microservices is also the first step for resiliency. What’s next? • Infrastructure ( High Av., Redundancy, Autoscaling, IaC, etc. ) • Independent Microservices • Applying Some Patterns • Chaos Engineering Resılıency In Mıcroservice Archıtecture
  6. Why? • The more your services depend on each other,

    the harder to achieve resiliency. • If a service goes down, this should not affect other services. ( SPoF ) • SPoF is not only about service-to-service. ( DB, Message Bus, Load Balancer, API Gateway, etc. ) So, how to build independent microservices? Independent Microservices Independent Microservices
  7. Continuously refactor to improve the isolation level of your services.

    • Well defined service boundaries. • Reduce synchronous communications. Consider async communication when it’s an option. • Keep a local copy of the data that a service needs. ( event-driven or transaction log tailing) • Service instance per Container / VM / Host deployment strategy. * Remember all of these patterns have trade-offs. Independent Mıcroservices Independent Microservices
  8. Having loosely-coupled, highly independent services is great, but we need

    more. • Bulkhead Pattern • Retry Pattern • Circuit Breaker Pattern • Fallback Pattern • Rate Limiting Pattern • Timeout Pattern Applyıng Some Patterns Applyıng Some Patterns
  9. How can we apply these patterns ? One by one

    for each service. ❌ Isolate from services and manage in a central point. ✅ • Api Gateway for external communication. • Service Mesh for inter-service communication Applyıng Some Patterns Applyıng Some Patterns source: niis.org/blog/2020/1/20/interoperability-puzzle
  10. Bulkhead Pattern Isolate compartments of the ship so that even

    if one compartment fills up with water, the ship can rescue. The Bulkhead is ‘segregate resources’ in Mic. Arc. Use Case: Isolating thread pool for each process in a microservice to prevent thread starvation. * Microservice Architecture is also an implementation of bulkhead pattern. Applyıng Some Patterns
  11. Retry Pattern If a service detects a *transient* error while

    interacting with another service (or database, broker etc. ), it can retry its request. What is a non-transient error? If the error is about business logic, authentication, authorization, or, something like these, then it will probably repeat. Applyıng Some Patterns
  12. Stop sending additional requests to a service (or another component)

    which is already down. Fail fast. Give the service time to recover itself. Retry -> Retries with the expectation that it'll succeed. Circuit Breaker -> Prevents operation that is likely to fail. Circuit Breaker Pattern Applyıng Some Patterns
  13. Fallback Pattern In some cases, retrying the request keeps failing.

    The client service accepts a default response to be able to complete its transaction. This default response is our Fallback. It’s a kind of plan B. Applyıng Some Patterns Payment Service Fraud Check Service . . . Payment Service Fraud Check Service | Fallback : no fraud Request Http 500 retry ‘n’ times
  14. Rate limiting helps us to control the throughput of a

    service. Throughput: The rate of messages transferred in a certain amount of time. Why do we need to control it? • Security concerns Auth service is called tens of times in a minute for the same user? • Preventing resource starvation Throttle lower-priority traffic to give enough resources to critical transactions. Rate Lımıtıng Applyıng Some Patterns
  15. In Synchronous Communication, one service should not wait for another

    service for an undetermined amount of time. Your job is to find the best timeout value. Try to keep it as short as possible. Timeout Pattern Benefits ; • Make the service work even when the dependent service is not available. ( Sync I/O ) • Don’t block any thread for a long time. ( Async I/O ) Applyıng Some Patterns
  16. Chaos Engıneerıng Chaos Engineering is aiming to reveal weaknesses in

    a system before they cause an outage. While Netflix was migrating from physical infra to AWS in 2010, they created a chaos engineering tool called ‘Chaos Monkey’. Chaos Engineering Chaos Monkey was continuously running in Netflix’s production environment and regularly shutting down EC2 instances to ensure that services to be resilient to instance failures. Netflix Chaos Monkey
  17. Chaos Engıneerıng They wanted to make sure a loss of

    an instance would not affect the entire streaming experience. After the success of first soldier, Chaos Monkey, they created Simian Army; ➔ Latency Monkey, ➔ Janitor Monkey, ➔ Doctor Monkey, ➔ Chaos Gorilla, ➔ Chaos Kong etc. Chaos Engineering Netflix Simian Army
  18. • Tests need to be done in production. • It’s

    something like injecting a microbe to a body and see how it fights it. Test Steps ; Form a theory --> Carry out the experiment --> See if it validates the theory or not. Some Real-World Events ; Inject latency between services, Randomly terminate VMs or containers, CPU Load, Memory Load, Fill up disk, Turn off networking, etc. Chaos Engıneerıng Chaos Engineering
  19. The motivation to achieve resiliency should exist during the whole

    application life cycle. Designing, Coding, Refactoring, Testing, Deploying, and Maintaining. • How do you test your code before deploying to production ? • Deployment strategy ( backup, roll back, isolation, etc. ) • Load testing • Resources consumption • Sync/Async communication Resiliency is not only about infra or some patterns, but also about network, application, people and culture. Conclusion Conclusion