Speaker Deck

Application Resilience Engineering & Operations at Netflix

by Ben Christensen

Published June 19, 2013 in Programming

Presented at Velocity 2013 Santa Clara http://velocityconf.com/velocity2013/public/schedule/detail/28267

Distributed applications are complex systems full of latent failures (bugs), latency and ever changing behavior in the relationships between components. Systems easily “drift” from a state of resilience and failure can emerge from component relationships. Thus, applications (as components of a complex system) must be resilient to latency and failure on all of its system relationships and not rely upon infrastructure alone to implement this resilience.

Common resilience patterns used by Netflix in production will be shared including:

- Bulkhead isolation using threads and semaphores
- Circuit breaker
- Fail Fast
- Fail Silent
- Static Fallback
- Stubbed Fallback
- Fallback via Network Cache

With these common patterns we can achieve resilience to system relationships failing, but systems are complex and always changing so operating and maintaining a resilient system includes finding weaknesses and managing drift. Operating such systems at Netflix with resilience patterns over the past 18 months has shown that implementing them in code is only half the battle – knowing how to deploy, configure, operate and maintain resilience is a different set of knowledge.