Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How Netflix does Failovers in 7 minutes
Search
Amjith
May 12, 2018
Programming
0
680
How Netflix does Failovers in 7 minutes
Amjith
May 12, 2018
Tweet
Share
More Decks by Amjith
See All by Amjith
Awesome Commandline Tools
amjith
0
520
FuzzyFind
amjith
0
89
Awesome Command Line Tools
amjith
1
150
Modern Command Line
amjith
2
340
Introduction to Docker
amjith
8
1.5k
Thread Profiling in Python
amjith
7
850
Python Profiling
amjith
3
350
Statistical Thread Profiler
amjith
1
160
Debugging Live Python Web Applications
amjith
8
1.8k
Other Decks in Programming
See All in Programming
11年かかって やっとVibe Codingに 時代が追いつきましたね
yimajo
1
240
MySQL9でベクトルカラム登場!PHP×AWSでのAI/類似検索はこう変わる
suguruooki
1
290
CEDEC 2025 『ゲームにおけるリアルタイム通信への QUIC導入事例の紹介』
segadevtech
3
780
Bedrock AgentCore ObservabilityによるAIエージェントの運用
licux
8
560
Constant integer division faster than compiler-generated code
herumi
2
440
React は次の10年を生き残れるか:3つのトレンドから考える
oukayuka
41
16k
コーディングは技術者(エンジニア)の嗜みでして / Learning the System Development Mindset from Rock Lady
mackey0225
2
220
SQLアンチパターン第2版 データベースプログラミングで陥りがちな失敗とその対策 / Intro to SQL Antipatterns 2nd
twada
PRO
38
11k
副作用と戦う PHP リファクタリング ─ ドメインイベントでビジネスロジックを解きほぐす
kajitack
3
530
Go製CLIツールをnpmで配布するには
syumai
2
1.1k
React 使いじゃなくても知っておきたい教養としての React
oukayuka
18
5.4k
202507_ADKで始めるエージェント開発の基本 〜デモを通じて紹介〜(奥田りさ)The Basics of Agent Development with ADK — A Demo-Focused Introduction
risatube
PRO
6
1.4k
Featured
See All Featured
A Tale of Four Properties
chriscoyier
160
23k
Balancing Empowerment & Direction
lara
1
530
Building Flexible Design Systems
yeseniaperezcruz
328
39k
Measuring & Analyzing Core Web Vitals
bluesmoon
8
540
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
21
1.4k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.5k
Being A Developer After 40
akosma
90
590k
Become a Pro
speakerdeck
PRO
29
5.5k
How to Think Like a Performance Engineer
csswizardry
25
1.8k
The World Runs on Bad Software
bkeepers
PRO
70
11k
Transcript
None
None
Whoops, something went wrong… Netflix Streaming Error We’re having trouble
playing this title right now. Please try again later or select a different title.
None
None
Cloud Prod Engineering AMJITH RAMANUJAM (@amjithr) Sr. Software Engineer, Traffic
Team
Regional Failover in 7 minutes AMJITH RAMANUJAM (@amjithr) Sr. Software
Engineer, Traffic Team
Regional Standby system takes over when the main system fails
Failover
None
Amazon Web Services Everything else Netflix Open Connect All video
delivery
Regional Failover
Christmas Eve 2012
None
None
None
None
ELB Outage 7 Hours
16,000 years in 1 day content watched ~14,000 BC: First
colonization of America 2018 0 AD 4,600 years in 7 hours
Regional Failover
Active - standby system is also serving traffic Active vs
Passive Passive - standby system is NOT serving traffic
Stateless services Prerequisites Regional replication of data
Infrastructure problem isolated to one region Failover Candidate Problem won’t
follow if we move traffic Bad code deploy in a region
Detect the problem Regional Failover Process Scale the savior regions
Shift traffic
None
Detect the problem
Is it working?
None
One metric to rule them all - Dumbledore
SPS Stream Starts Per Second
None
Scale Saviors
None
None
None
Scaling Pattern Linear Regression
Shift Traffic
Proxy Traffic Traffic Shift Switch DNS
None
Detect the problem - 5 minutes Regional Failover Process Scale
the savior regions - 35 minutes Shift traffic - 10 minutes Total = 45 mins
Nimble Goals • Fast failover (<10mins) ◦ Pre-scale • Transparent
to service owners ◦ No code changes for service owners ◦ No auto-scaling changes
API lolomo nccp Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp starting starting starting starting
Playready nimble_Playready us-west
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready
Detect the problem - 2 minutes Regional Failover Process Scale
the savior regions - 4 minutes Shift traffic - 3 minutes Total = 7 mins
Architecture
Actions Periodic Tasks Triggered Tasks
Periodic Tasks Fetch historical data Predict cluster sizes Manage dark
clusters
Triggered Tasks Ungate dark instances Transplant instances Traffic shift
Nanoservices Python Flask RQ - Redis
Characteristics Anticipate Failure Eventually Consistent Garbage Collect
Anticipate Failure Multi-region Rebuild state from scratch Fallbacks
Fallbacks AWS State - EDDA Historical data - ATLAS Local
Cache - Redis, Filesystem
Eventual Consistency AWS is eventually consistent Favor idempotent actions
Orphan Cleaner • Terminate detached instances • Safety features ◦
Terminate slowly ◦ Don’t terminate large volume of instances
FAQs How often do you failover? Why not have dark
clusters take traffic? How much did Nimble cost?
Suggestions Fallbacks, Fallbacks, Fallbacks Exercise it often Provide visibility
We’re hiring! @amjithr