Application Resilience Engineering & Operations at Netflix

Application Resilience Engineering & Operations at Netflix

Presented at Velocity 2013 Santa Clara

Distributed applications are complex systems full of latent failures (bugs), latency and ever changing behavior in the relationships between components. Systems easily “drift” from a state of resilience and failure can emerge from component relationships. Thus, applications (as components of a complex system) must be resilient to latency and failure on all of its system relationships and not rely upon infrastructure alone to implement this resilience.

Common resilience patterns used by Netflix in production will be shared including:

- Bulkhead isolation using threads and semaphores
- Circuit breaker
- Fail Fast
- Fail Silent
- Static Fallback
- Stubbed Fallback
- Fallback via Network Cache

With these common patterns we can achieve resilience to system relationships failing, but systems are complex and always changing so operating and maintaining a resilient system includes finding weaknesses and managing drift. Operating such systems at Netflix with resilience patterns over the past 18 months has shown that implementing them in code is only half the battle – knowing how to deploy, configure, operate and maintain resilience is a different set of knowledge.


Ben Christensen

June 19, 2013