Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How Netflix does Failovers in 7 minutes
Search
Amjith
May 12, 2018
Programming
0
710
How Netflix does Failovers in 7 minutes
Amjith
May 12, 2018
Tweet
Share
More Decks by Amjith
See All by Amjith
Awesome Commandline Tools
amjith
0
540
FuzzyFind
amjith
0
110
Awesome Command Line Tools
amjith
1
170
Modern Command Line
amjith
2
360
Introduction to Docker
amjith
8
1.5k
Thread Profiling in Python
amjith
7
860
Python Profiling
amjith
3
370
Statistical Thread Profiler
amjith
1
160
Debugging Live Python Web Applications
amjith
8
1.8k
Other Decks in Programming
See All in Programming
【CA.ai #3】ワークフローから見直すAIエージェント — 必要な場面と“選ばない”判断
satoaoaka
0
260
実はマルチモーダルだった。ブラウザの組み込みAI🧠でWebの未来を感じてみよう #jsfes #gemini
n0bisuke2
3
1.2k
エディターってAIで操作できるんだぜ
kis9a
0
740
ZOZOにおけるAI活用の現在 ~モバイルアプリ開発でのAI活用状況と事例~
zozotech
PRO
9
5.8k
LLM Çağında Backend Olmak: 10 Milyon Prompt'u Milisaniyede Sorgulamak
selcukusta
0
130
ローカルLLMを⽤いてコード補完を⾏う VSCode拡張機能を作ってみた
nearme_tech
PRO
0
110
認証・認可の基本を学ぼう後編
kouyuume
0
240
AIの誤りが許されない業務システムにおいて“信頼されるAI” を目指す / building-trusted-ai-systems
yuya4
6
3.8k
LLMで複雑な検索条件アセットから脱却する!! 生成的検索インタフェースの設計論
po3rin
4
850
re:Invent 2025 のイケてるサービスを紹介する
maroon1st
0
130
俺流レスポンシブコーディング 2025
tak_dcxi
14
8.9k
「コードは上から下へ読むのが一番」と思った時に、思い出してほしい話
panda728
PRO
39
26k
Featured
See All Featured
How GitHub (no longer) Works
holman
316
140k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.5k
Site-Speed That Sticks
csswizardry
13
1k
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
21
Believing is Seeing
oripsolob
0
9
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
11
Amusing Abliteration
ianozsvald
0
62
Building a Scalable Design System with Sketch
lauravandoore
463
34k
Documentation Writing (for coders)
carmenintech
77
5.2k
The Cult of Friendly URLs
andyhume
79
6.7k
Making Projects Easy
brettharned
120
6.5k
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
0
200
Transcript
None
None
Whoops, something went wrong… Netflix Streaming Error We’re having trouble
playing this title right now. Please try again later or select a different title.
None
None
Cloud Prod Engineering AMJITH RAMANUJAM (@amjithr) Sr. Software Engineer, Traffic
Team
Regional Failover in 7 minutes AMJITH RAMANUJAM (@amjithr) Sr. Software
Engineer, Traffic Team
Regional Standby system takes over when the main system fails
Failover
None
Amazon Web Services Everything else Netflix Open Connect All video
delivery
Regional Failover
Christmas Eve 2012
None
None
None
None
ELB Outage 7 Hours
16,000 years in 1 day content watched ~14,000 BC: First
colonization of America 2018 0 AD 4,600 years in 7 hours
Regional Failover
Active - standby system is also serving traffic Active vs
Passive Passive - standby system is NOT serving traffic
Stateless services Prerequisites Regional replication of data
Infrastructure problem isolated to one region Failover Candidate Problem won’t
follow if we move traffic Bad code deploy in a region
Detect the problem Regional Failover Process Scale the savior regions
Shift traffic
None
Detect the problem
Is it working?
None
One metric to rule them all - Dumbledore
SPS Stream Starts Per Second
None
Scale Saviors
None
None
None
Scaling Pattern Linear Regression
Shift Traffic
Proxy Traffic Traffic Shift Switch DNS
None
Detect the problem - 5 minutes Regional Failover Process Scale
the savior regions - 35 minutes Shift traffic - 10 minutes Total = 45 mins
Nimble Goals • Fast failover (<10mins) ◦ Pre-scale • Transparent
to service owners ◦ No code changes for service owners ◦ No auto-scaling changes
API lolomo nccp Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp starting starting starting starting
Playready nimble_Playready us-west
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready
Detect the problem - 2 minutes Regional Failover Process Scale
the savior regions - 4 minutes Shift traffic - 3 minutes Total = 7 mins
Architecture
Actions Periodic Tasks Triggered Tasks
Periodic Tasks Fetch historical data Predict cluster sizes Manage dark
clusters
Triggered Tasks Ungate dark instances Transplant instances Traffic shift
Nanoservices Python Flask RQ - Redis
Characteristics Anticipate Failure Eventually Consistent Garbage Collect
Anticipate Failure Multi-region Rebuild state from scratch Fallbacks
Fallbacks AWS State - EDDA Historical data - ATLAS Local
Cache - Redis, Filesystem
Eventual Consistency AWS is eventually consistent Favor idempotent actions
Orphan Cleaner • Terminate detached instances • Safety features ◦
Terminate slowly ◦ Don’t terminate large volume of instances
FAQs How often do you failover? Why not have dark
clusters take traffic? How much did Nimble cost?
Suggestions Fallbacks, Fallbacks, Fallbacks Exercise it often Provide visibility
We’re hiring! @amjithr