Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Netflix does Failovers in 7 minutes

How Netflix does Failovers in 7 minutes

Amjith

May 12, 2018
Tweet

More Decks by Amjith

Other Decks in Programming

Transcript

  1. Whoops, something went wrong… Netflix Streaming Error We’re having trouble

    playing this title right now. Please try again later or select a different title.
  2. 16,000 years in 1 day content watched ~14,000 BC: First

    colonization of America 2018 0 AD 4,600 years in 7 hours
  3. Active - standby system is also serving traffic Active vs

    Passive Passive - standby system is NOT serving traffic
  4. Infrastructure problem isolated to one region Failover Candidate Problem won’t

    follow if we move traffic Bad code deploy in a region
  5. Detect the problem - 5 minutes Regional Failover Process Scale

    the savior regions - 35 minutes Shift traffic - 10 minutes Total = 45 mins
  6. Nimble Goals • Fast failover (<10mins) ◦ Pre-scale • Transparent

    to service owners ◦ No code changes for service owners ◦ No auto-scaling changes
  7. Detect the problem - 2 minutes Regional Failover Process Scale

    the savior regions - 4 minutes Shift traffic - 3 minutes Total = 7 mins
  8. Orphan Cleaner • Terminate detached instances • Safety features ◦

    Terminate slowly ◦ Don’t terminate large volume of instances
  9. FAQs How often do you failover? Why not have dark

    clusters take traffic? How much did Nimble cost?