Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Whoops, something went wrong… Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Cloud Prod Engineering AMJITH RAMANUJAM (@amjithr) Sr. Software Engineer, Traffic Team

Slide 7

Slide 7 text

Regional Failover in 7 minutes AMJITH RAMANUJAM (@amjithr) Sr. Software Engineer, Traffic Team

Slide 8

Slide 8 text

Regional Standby system takes over when the main system fails Failover

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Amazon Web Services Everything else Netflix Open Connect All video delivery

Slide 11

Slide 11 text

Regional Failover

Slide 12

Slide 12 text

Christmas Eve 2012

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

ELB Outage 7 Hours

Slide 18

Slide 18 text

16,000 years in 1 day content watched ~14,000 BC: First colonization of America 2018 0 AD 4,600 years in 7 hours

Slide 19

Slide 19 text

Regional Failover

Slide 20

Slide 20 text

Active - standby system is also serving traffic Active vs Passive Passive - standby system is NOT serving traffic

Slide 21

Slide 21 text

Stateless services Prerequisites Regional replication of data

Slide 22

Slide 22 text

Infrastructure problem isolated to one region Failover Candidate Problem won’t follow if we move traffic Bad code deploy in a region

Slide 23

Slide 23 text

Detect the problem Regional Failover Process Scale the savior regions Shift traffic

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Detect the problem

Slide 26

Slide 26 text

Is it working?

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

One metric to rule them all - Dumbledore

Slide 29

Slide 29 text

SPS Stream Starts Per Second

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Scale Saviors

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Scaling Pattern Linear Regression

Slide 36

Slide 36 text

Shift Traffic

Slide 37

Slide 37 text

Proxy Traffic Traffic Shift Switch DNS

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

Detect the problem - 5 minutes Regional Failover Process Scale the savior regions - 35 minutes Shift traffic - 10 minutes Total = 45 mins

Slide 40

Slide 40 text

Nimble Goals ● Fast failover (<10mins) ○ Pre-scale ● Transparent to service owners ○ No code changes for service owners ○ No auto-scaling changes

Slide 41

Slide 41 text

API lolomo nccp Playready us-west

Slide 42

Slide 42 text

API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west

Slide 43

Slide 43 text

API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready us-west

Slide 44

Slide 44 text

API lolomo nccp nimble_API nimble_lolomo nimble_nccp starting starting starting starting Playready nimble_Playready us-west

Slide 45

Slide 45 text

Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp

Slide 46

Slide 46 text

Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp

Slide 47

Slide 47 text

Failover API lolomo nccp Playready nimble_API nimble_Playready nimble_lolomo nimble_nccp

Slide 48

Slide 48 text

Failover API lolomo nccp nimble_API nimble_lolomo nimble_nccp Playready nimble_Playready

Slide 49

Slide 49 text

Detect the problem - 2 minutes Regional Failover Process Scale the savior regions - 4 minutes Shift traffic - 3 minutes Total = 7 mins

Slide 50

Slide 50 text

Architecture

Slide 51

Slide 51 text

Actions Periodic Tasks Triggered Tasks

Slide 52

Slide 52 text

Periodic Tasks Fetch historical data Predict cluster sizes Manage dark clusters

Slide 53

Slide 53 text

Triggered Tasks Ungate dark instances Transplant instances Traffic shift

Slide 54

Slide 54 text

Nanoservices Python Flask RQ - Redis

Slide 55

Slide 55 text

Characteristics Anticipate Failure Eventually Consistent Garbage Collect

Slide 56

Slide 56 text

Anticipate Failure Multi-region Rebuild state from scratch Fallbacks

Slide 57

Slide 57 text

Fallbacks AWS State - EDDA Historical data - ATLAS Local Cache - Redis, Filesystem

Slide 58

Slide 58 text

Eventual Consistency AWS is eventually consistent Favor idempotent actions

Slide 59

Slide 59 text

Orphan Cleaner ● Terminate detached instances ● Safety features ○ Terminate slowly ○ Don’t terminate large volume of instances

Slide 60

Slide 60 text

FAQs How often do you failover? Why not have dark clusters take traffic? How much did Nimble cost?

Slide 61

Slide 61 text

Suggestions Fallbacks, Fallbacks, Fallbacks Exercise it often Provide visibility

Slide 62

Slide 62 text

We’re hiring! @amjithr