×
Copy
Open
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
No content
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
Whoops, something went wrong… Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
Cloud Prod Engineering AMJITH RAMANUJAM (@amjithr) Sr. Software Engineer, Traffic Team
Slide 7
Slide 7 text
Regional Failover in 7 minutes AMJITH RAMANUJAM (@amjithr) Sr. Software Engineer, Traffic Team
Slide 8
Slide 8 text
Regional Standby system takes over when the main system fails Failover
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
Amazon Web Services Everything else Netflix Open Connect All video delivery
Slide 11
Slide 11 text
Regional Failover
Slide 12
Slide 12 text
Christmas Eve 2012
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
No content
Slide 17
Slide 17 text
ELB Outage 7 Hours
Slide 18
Slide 18 text
16,000 years in 1 day content watched ~14,000 BC: First colonization of America 2018 0 AD 4,600 years in 7 hours
Slide 19
Slide 19 text
Regional Failover
Slide 20
Slide 20 text
Active - standby system is also serving traffic Active vs Passive Passive - standby system is NOT serving traffic
Slide 21
Slide 21 text
Stateless services Prerequisites Regional replication of data
Slide 22
Slide 22 text
Infrastructure problem isolated to one region Failover Candidate Problem won’t follow if we move traffic Bad code deploy in a region
Slide 23
Slide 23 text
Detect the problem Regional Failover Process Scale the savior regions Shift traffic
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
Detect the problem
Slide 26
Slide 26 text
Is it working?
Slide 27
Slide 27 text
No content
Slide 28
Slide 28 text
One metric to rule them all - Dumbledore
Slide 29
Slide 29 text
SPS Stream Starts Per Second
Slide 30
Slide 30 text
No content
Slide 31
Slide 31 text
Scale Saviors
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
Scaling Pattern Linear Regression
Slide 36
Slide 36 text
Shift Traffic
Slide 37
Slide 37 text
Proxy Traffic Traffic Shift Switch DNS
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
Detect the problem - 5 minutes Regional Failover Process Scale the savior regions - 35 minutes Shift traffic - 10 minutes Total = 45 mins
Slide 40
Slide 40 text
Nimble Goals ● Fast failover (<10mins) ○ Pre-scale ● Transparent to service owners ○ No code changes for service owners ○ No auto-scaling changes
Slide 41
Slide 41 text
API lolomo nccp Playready us-west
Slide 42
Slide 42 text
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
Slide 43
Slide 43 text
API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west
Slide 44
Slide 44 text
API lolomo nccp nimble_API nimble_lolomo nimble_nccp starting starting starting starting Playready nimble_Playready us-west
Slide 45
Slide 45 text
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Slide 46
Slide 46 text
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Slide 47
Slide 47 text
Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp
Slide 48
Slide 48 text
Failover API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready
Slide 49
Slide 49 text
Detect the problem - 2 minutes Regional Failover Process Scale the savior regions - 4 minutes Shift traffic - 3 minutes Total = 7 mins
Slide 50
Slide 50 text
Architecture
Slide 51
Slide 51 text
Actions Periodic Tasks Triggered Tasks
Slide 52
Slide 52 text
Periodic Tasks Fetch historical data Predict cluster sizes Manage dark clusters
Slide 53
Slide 53 text
Triggered Tasks Ungate dark instances Transplant instances Traffic shift
Slide 54
Slide 54 text
Nanoservices Python Flask RQ - Redis
Slide 55
Slide 55 text
Characteristics Anticipate Failure Eventually Consistent Garbage Collect
Slide 56
Slide 56 text
Anticipate Failure Multi-region Rebuild state from scratch Fallbacks
Slide 57
Slide 57 text
Fallbacks AWS State - EDDA Historical data - ATLAS Local Cache - Redis, Filesystem
Slide 58
Slide 58 text
Eventual Consistency AWS is eventually consistent Favor idempotent actions
Slide 59
Slide 59 text
Orphan Cleaner ● Terminate detached instances ● Safety features ○ Terminate slowly ○ Don’t terminate large volume of instances
Slide 60
Slide 60 text
FAQs How often do you failover? Why not have dark clusters take traffic? How much did Nimble cost?
Slide 61
Slide 61 text
Suggestions Fallbacks, Fallbacks, Fallbacks Exercise it often Provide visibility
Slide 62
Slide 62 text
We’re hiring! @amjithr