Developer's Toolbox for Designing Resilient Microservices

DESIGNING RESILIENT MICROSERVICES DEVELOPER’S TOOLBOX FOR FAILURES AGAINST ZIKRIYE ÜRKMEZ

HELLO! backend developer since 2011 Cross functional team management Sustainable
productivity Great tech culture building Navlungo: Engineering Manager Codewarts Bülten: Newsletter writer

trusting your infrastructure can be dangerous perfect code doesn’t mean
a perfect system small human error, massive consequences Failures are inevitable in any complex system Reality Check dependencies amplify risks It’s impossible to eliminate failures

Availability: more than just uptime Beyond Uptime: Calculating Availability 99.9%
99.99% Equates to 8.76 hours of downtime per year. Reduces downtime to just 52 minutes annually. Four Nines Three Nines Downtime impact: hours vs. minutes Reliability defined by weakest link Single failure domino effect The critical gap: 99.9% vs. 99.99%

The Chain Reaction of Dependencies Dependency chains increase vulnerability Availability
multiplication effect Your system: strong as the weakest dependency Reliability is collective, not individual When one service falls, it pulls the whole system down

It’s important to be pragmatic Balancing Risk And Cost The
cost to design, build, deploy, and operate a defensive solution the nature of your business and expectations of your customers

Everything fails eventually One service’s fault, user’s problem Trace your
dependency graph Every point of interaction between your service and another component indicates a possible area of failure What Can Go Wrong? Dependency Failures Internal Failures Network Failures Hardware Failures Cascading Failures Not my code! The greatest threat often lives within. Silence can be deadlier than errors. Cloud fails too

One failure ignites a system-wide meltdown. cascading failures Invisible until
it’s too late Domino effect Positive feedback loop Invisible chain reactions Recovery becomes impossible

How do we keep systems running when everything breaks? Solution
Strategies Overview Retry Fallback Timeout Circuit Breaker Four pillars of RESILIENCE When should you press retry—and when should you step back? When you can’t fix it, how do you still deliver value? When waiting longer breaks the system. When repeated failures demand a pause.

Retry When should you press retry—and when should you step
back? SOLUTION STRATEGIES Idempotent only If the failure is isolated and transient, then a retry is a reasonable option. Retry budget: cap attempts Exponential back-off curve Jitter to de-synchronize Requests Retries 0 20 40 60 80 100 Tx Number of Calls if the failure is persistent — for example, if the capacity of inventory service is reduced — then subsequent calls may worsen the issue and further destabilize the system.

Fallbacks When you can’t fix it, how do you still
deliver value? SOLUTION STRATEGIES Graceful degradation Caching Functional redundancy Stubbed Data When full functionality fails, give users something useful. When fresh data isn’t critical, serve from memory. When one source fails, switch to another. When all else fails, fake it gracefully.

Individual timeouts Propagated deadlines Prevent resource exhaustion Timeouts & Deadlines
When waiting longer breaks the system SOLUTION STRATEGIES Fail Fast If it’is too long they can consume unnecessary resources for a calling service if a service is unresponsive. instead of wait forever Picking a deadline is difficult If it’s too short, they can cause higher levels of failure for expensive requests.

Failure threshold Monitor circuit state Half-Open (Testing) Circuit Breakers When
repeated failures demand a pause SOLUTION STRATEGIES Closed (Normal) Open (Failing) Requests are short- circuited, fallbacks are used Trial requests check if service has recovered Requests flow normally, failures are tracked

Circuit Breaker States

Join the conversation at Codewarts Newsletter. 🧠 Insights on software
culture, code quality, and leadership Thanks! Keep Thinking, Keep Building

Developer's Toolbox for Designing Resilient Mic...

Developer's Toolbox for Designing Resilient Microservices

Zikriye Ürkmez

More Decks by Zikriye Ürkmez

Other Decks in Technology

Featured

Transcript

DESIGNING RESILIENT MICROSERVICES DEVELOPER’S TOOLBOX FOR FAILURES AGAINST ZIKRIYE ÜRKMEZ

HELLO! backend developer since 2011 Cross functional team management Sustainable

trusting your infrastructure can be dangerous perfect code doesn’t mean

Availability: more than just uptime Beyond Uptime: Calculating Availability 99.9%

The Chain Reaction of Dependencies Dependency chains increase vulnerability Availability

It’s important to be pragmatic Balancing Risk And Cost The

Everything fails eventually One service’s fault, user’s problem Trace your

One failure ignites a system-wide meltdown. cascading failures Invisible until

How do we keep systems running when everything breaks? Solution

Retry When should you press retry—and when should you step

Fallbacks When you can’t fix it, how do you still

Individual timeouts Propagated deadlines Prevent resource exhaustion Timeouts & Deadlines

Failure threshold Monitor circuit state Half-Open (Testing) Circuit Breakers When

Circuit Breaker States

Join the conversation at Codewarts Newsletter. 🧠 Insights on software