Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How Netflix does Failovers in 7 minutes
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Amjith
May 12, 2018
Programming
730
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
How Netflix does Failovers in 7 minutes
Amjith
May 12, 2018
More Decks by Amjith
See All by Amjith
Awesome Commandline Tools
amjith
0
560
FuzzyFind
amjith
0
120
Awesome Command Line Tools
amjith
1
180
Modern Command Line
amjith
2
390
Introduction to Docker
amjith
8
1.5k
Thread Profiling in Python
amjith
7
870
Python Profiling
amjith
3
380
Statistical Thread Profiler
amjith
1
170
Debugging Live Python Web Applications
amjith
8
1.8k
Other Decks in Programming
See All in Programming
フロントエンドとバックエンドで「1文字」を揃えよう
youkidearitai
PRO
0
390
「AIで開発し、AIを届ける」をEvalでつなぐ 〜AIネイティブに始めるプロダクト開発の実践〜 / Connecting "Develop with AI, deliver AI" with Eval
rkaga
4
5k
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
780
Javaの型とAI時代に型が大事な理由 / java types and type in AI era
kishida
2
130
Datadog × OpenTelemetry 入門と実践のあいだ
kn_to_maxpno
1
150
代数的データ型って何が嬉しいの? #frontend_phpcon_do
kajitack
8
3.5k
決定論的オーケストレーションの設計と実装 / Design and Implementation of Deterministic Orchestration
nrslib
3
1.3k
AIで効率化できた業務・日常
ochtum
0
130
作って学ぶ、 JSX (TSX) ランタイムの基本
syumai
7
1.6k
気づいたらRubyで100作品 ー クリエイティブコーディングが生活の一部になるまで / 100 Ruby Sketches Later: How Creative Coding Became Part of My Life
chobishiba
3
570
Webフレームワークの ベンチマークについて
yusukebe
0
160
Spring Security 実践 ─ GraphQL APIで実務に役立つ 認証・認可 を学ぶ
wagyu
0
220
Featured
See All Featured
The World Runs on Bad Software
bkeepers
PRO
72
12k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
160
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.8k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Discover your Explorer Soul
emna__ayadi
2
1.1k
Art, The Web, and Tiny UX
lynnandtonic
304
22k
First, design no harm
axbom
PRO
2
1.2k
Test your architecture with Archunit
thirion
1
2.3k
Automating Front-end Workflow
addyosmani
1370
210k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.2k
Imperfection Machines: The Place of Print at Facebook
scottboms
270
14k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.9k
Transcript
None
None
Whoops, something went wrong… Netflix Streaming Error We’re having trouble
playing this title right now. Please try again later or select a different title.
None
None
Cloud Prod Engineering AMJITH RAMANUJAM (@amjithr) Sr. Software Engineer, Traffic
Team
Regional Failover in 7 minutes AMJITH RAMANUJAM (@amjithr) Sr. Software
Engineer, Traffic Team
Regional Standby system takes over when the main system fails
Failover
None
Amazon Web Services Everything else Netflix Open Connect All video
delivery
Regional Failover
Christmas Eve 2012
None
None
None
None
ELB Outage 7 Hours
16,000 years in 1 day content watched ~14,000 BC: First
colonization of America 2018 0 AD 4,600 years in 7 hours
Regional Failover
Active - standby system is also serving traffic Active vs
Passive Passive - standby system is NOT serving traffic
Stateless services Prerequisites Regional replication of data
Infrastructure problem isolated to one region Failover Candidate Problem won’t
follow if we move traffic Bad code deploy in a region
Detect the problem Regional Failover Process Scale the savior regions
Shift traffic
None
Detect the problem
Is it working?
None
One metric to rule them all - Dumbledore
SPS Stream Starts Per Second
None
Scale Saviors
None
None
None
Scaling Pattern Linear Regression
Shift Traffic
Proxy Traffic Traffic Shift Switch DNS
None
Detect the problem - 5 minutes Regional Failover Process Scale
the savior regions - 35 minutes Shift traffic - 10 minutes Total = 45 mins
Nimble Goals • Fast failover (<10mins) ◦ Pre-scale • Transparent
to service owners ◦ No code changes for service owners ◦ No auto-scaling changes
API lolomo nccp Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
API lolomo nccp nimble_API nimble_lolomo nimble_nccp starting starting starting starting
Playready nimble_Playready us-west
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Failover API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready
Detect the problem - 2 minutes Regional Failover Process Scale
the savior regions - 4 minutes Shift traffic - 3 minutes Total = 7 mins
Architecture
Actions Periodic Tasks Triggered Tasks
Periodic Tasks Fetch historical data Predict cluster sizes Manage dark
clusters
Triggered Tasks Ungate dark instances Transplant instances Traffic shift
Nanoservices Python Flask RQ - Redis
Characteristics Anticipate Failure Eventually Consistent Garbage Collect
Anticipate Failure Multi-region Rebuild state from scratch Fallbacks
Fallbacks AWS State - EDDA Historical data - ATLAS Local
Cache - Redis, Filesystem
Eventual Consistency AWS is eventually consistent Favor idempotent actions
Orphan Cleaner • Terminate detached instances • Safety features ◦
Terminate slowly ◦ Don’t terminate large volume of instances
FAQs How often do you failover? Why not have dark
clusters take traffic? How much did Nimble cost?
Suggestions Fallbacks, Fallbacks, Fallbacks Exercise it often Provide visibility
We’re hiring! @amjithr