When Dependencies Fail: Building Antifragile Applications in a Fragile World

SELÇUK USTA Head of Engineering /in/selcukusta selcukusta.com selcukusta ustasoglu selcukusta
(at)gmail.com

AGENDA > A Small Change, A Global Disruption > Why
Are Modern Applications So Fragile? > Resilience ≠ Just Keeping Things Running > Why Is It Often Overlooked? > Storms We May Face > Being Prepared for the Storm > Planned Flexibility + Learning from Mistakes

A Small Change, A Global Disruption On June 12, 2025,
a small update created a very large impact. A policy control module had not been fully tested before the rollout, because it only failed under very speciﬁc conditions. Once triggered, it caused Google’s global infrastructure to return many ‘503’ errors. Thousands of users, business processes, and applications were affected, including production systems.

Why Are Modern Applications So Fragile? Modern applications cannot stand
on their own. Databases, message queues, third-party APIs… They are all part of a chain. But a chain is only as strong as its weakest link. And when that link breaks, the whole system breaks.

Resilience -neq Just Keeping Things Running Expect Failure, Don’t Assume
Stability Every part of a system can fail. Don’t believe that databases, APIs, or queues will always work. Inject Chaos in a Controlled Way Create test situations like network delays, database timeouts, or API errors before going to production. Observe, Measure, and Learn Watch how the system reacts, collect data, and use it to improve resilience. Automate Recovery and Build for Self-Healing Use tools like failover, retries with backoff, and circuit breakers to recover automatically. Balance Experiments with User Impact Run chaos tests in safe environments so users are not harmed by experiments.

Why Is It Often Overlooked? Deadlines over Timeplans In the
triangle of quality, time, and cost, time usually becomes the main priority. Invisible Dependencies These dependencies run in the background, so the risk is often overlooked. Testing Gap Between Staging and Production Most resilience tests are done in staging, and the “real-world chaos” in production is missed. Short-Term Thinking The “quick ﬁx” approach often leads to bigger problems in the future. Comfort of Ready-to-Use Frameworks We often believe the framework solves everything without simulating real scenarios.

Storms We May Faced DDD (Disaster-Driven Design) Principles > Design
with failures in mind. > Discuss weak points early. > Turn risks into acceptance criteria. > Make resilience part of the design, not an afterthought.

BEING PREPARED FOR THE STORM

… lets you simulate real network issues in your test
and CI pipelines. It combines controlled fault injection with chaotic scenarios, helping you prove your system has no single point of failure. Toxiproxy

Planning > App (Basic Web API) > MongoDB (Database) >
Varnish (Reverse proxy with caching) > Toxiproxy (Simulation proxy)

Limit Data Transmitted data exceeded limit.

Network Timeouts Stops all data from getting through, and closes
the connection.

Latency Add a delay to all data going through the
proxy.

Custom Faults Network testing should not focus only on latency;
even small protocol-level faults can lead to major issues.

THANK YOU! Your feedback matters, just scan the QR. /in/selcukusta
selcukusta.com selcukusta ustasoglu selcukusta (at)gmail.com

When Dependencies Fail: Building Antifragile Ap...

When Dependencies Fail: Building Antifragile Applications in a Fragile World

Selçuk Usta

More Decks by Selçuk Usta

Other Decks in Programming

Featured

Transcript