Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developer's Toolbox for Designing Resilient Mic...

Developer's Toolbox for Designing Resilient Microservices

Failures in microservice architectures are inevitable—but resilience is a choice. In this session, you’ll learn how to design systems that not only survive when things go wrong but continue to deliver value. We’ll walk through real-world examples and practical patterns to:
• Measure and multiply availability—why “three nines” vs. “four nines” matters
• Identify key failure modes: dependencies, internal bugs, network glitches, hardware faults, and cascading breakdowns
• Implement retries safely with back-off and jitter
• Apply fallback strategies: graceful degradation, caching, functional redundancy, and stubbed data
• Enforce timeouts and propagated deadlines to prevent resource exhaustion
• Leverage circuit breakers to isolate and contain repeated failures

By the end, you’ll have a developer’s toolbox for building microservices that stay calm under pressure.

Avatar for Zikriye Ürkmez

Zikriye Ürkmez

May 26, 2025
Tweet

More Decks by Zikriye Ürkmez

Other Decks in Technology

Transcript

  1. HELLO! backend developer since 2011 Cross functional team management Sustainable

    productivity Great tech culture building Navlungo: Engineering Manager Codewarts Bülten: Newsletter writer
  2. trusting your infrastructure can be dangerous perfect code doesn’t mean

    a perfect system small human error, massive consequences Failures are inevitable in any complex system Reality Check dependencies amplify risks It’s impossible to eliminate failures
  3. Availability: more than just uptime Beyond Uptime: Calculating Availability 99.9%

    99.99% Equates to 8.76 hours of downtime per year. Reduces downtime to just 52 minutes annually. Four Nines Three Nines Downtime impact: hours vs. minutes Reliability defined by weakest link Single failure domino effect The critical gap: 99.9% vs. 99.99%
  4. The Chain Reaction of Dependencies Dependency chains increase vulnerability Availability

    multiplication effect Your system: strong as the weakest dependency Reliability is collective, not individual When one service falls, it pulls the whole system down
  5. It’s important to be pragmatic Balancing Risk And Cost The

    cost to design, build, deploy, and operate a defensive solution the nature of your business and expectations of your customers
  6. Everything fails eventually One service’s fault, user’s problem Trace your

    dependency graph Every point of interaction between your service and another component indicates a possible area of failure What Can Go Wrong? Dependency Failures Internal Failures Network Failures Hardware Failures Cascading Failures Not my code! The greatest threat often lives within. Silence can be deadlier than errors. Cloud fails too
  7. One failure ignites a system-wide meltdown. cascading failures Invisible until

    it’s too late Domino effect Positive feedback loop Invisible chain reactions Recovery becomes impossible
  8. How do we keep systems running when everything breaks? Solution

    Strategies Overview Retry Fallback Timeout Circuit Breaker Four pillars of RESILIENCE When should you press retry—and when should you step back? When you can’t fix it, how do you still deliver value? When waiting longer breaks the system. When repeated failures demand a pause.
  9. Retry When should you press retry—and when should you step

    back? SOLUTION STRATEGIES Idempotent only If the failure is isolated and transient, then a retry is a reasonable option. Retry budget: cap attempts Exponential back-off curve Jitter to de-synchronize Requests Retries 0 20 40 60 80 100 Tx Number of Calls if the failure is persistent — for example, if the capacity of inventory service is reduced — then subsequent calls may worsen the issue and further destabilize the system.
  10. Fallbacks When you can’t fix it, how do you still

    deliver value? SOLUTION STRATEGIES Graceful degradation Caching Functional redundancy Stubbed Data When full functionality fails, give users something useful. When fresh data isn’t critical, serve from memory. When one source fails, switch to another. When all else fails, fake it gracefully.
  11. Individual timeouts Propagated deadlines Prevent resource exhaustion Timeouts & Deadlines

    When waiting longer breaks the system SOLUTION STRATEGIES Fail Fast If it’is too long they can consume unnecessary resources for a calling service if a service is unresponsive. instead of wait forever Picking a deadline is difficult If it’s too short, they can cause higher levels of failure for expensive requests.
  12. Failure threshold Monitor circuit state Half-Open (Testing) Circuit Breakers When

    repeated failures demand a pause SOLUTION STRATEGIES Closed (Normal) Open (Failing) Requests are short- circuited, fallbacks are used Trial requests check if service has recovered Requests flow normally, failures are tracked
  13. Join the conversation at Codewarts Newsletter. 🧠 Insights on software

    culture, code quality, and leadership Thanks! Keep Thinking, Keep Building