Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Black Friday: Lessons in Resiliency and Inciden...

Black Friday: Lessons in Resiliency and Incident Response at Shopify

Talk from DevOpsDays Toronto 2018 on building resilient systems that respond well to failure, and the story of two particular incidents and the lessons learned from them.

John Arthorne

May 31, 2018
Tweet

More Decks by John Arthorne

Other Decks in Technology

Transcript

  1. Black Friday: Lessons in Resiliency and Incident Response at Shopify

    1 John Arthorne @jarthorne Shopify Production Engineering
  2. Failure != Impact Failures will always happen Embrace failures as

    a way to learn about unknowns in your system 2 Resiliency Failures
  3. Black Friday / Cyber Monday 4 In retail, the holiday

    period is the make or break moment. It all goes well, the red ink of loss turns into the black of profit. Shopify provides the platform for over half a million retailers. Half a million businesses rely on us to help them turn a profit for the year.
  4. Best laid plans... • All high risk work early in

    the year (“run towards the risk”) • Months of preparation including gameday drills, capacity build out, heavy load testing • Backup plans for backup plans • Code and infrastructure freeze 5
  5. Thursday November 23 8:52 pm Low power alarm for one

    rack in one Data Center Likely cause: failure of one PDU (fancy power bar for servers) Panic level: Decision: assessment 7
  6. Reducing blast radius 10 • Shop database is broken into

    shards • Most other persistent state is also shard-local, which we call “pods”. • Even a catastrophic failure should have a blast radius of a single pod Pod 1 Pod 2 Web Jobs
  7. Thursday November 23 10:16 pm All servers still running Four

    chassis affected: ▪ 11 web hosts ▪ 4 job hosts ▪ pod 36 MySQL standby ▪ Pod 39 Redis failover Panic level: Decision: leave it alone 11
  8. Thursday November 23 11:02 pm Second PDU fails, four chassis

    down One pod without redundant Redis One pod without standby MySQL Panic level: Decision: Start incident response 12
  9. Thursday November 23 11:39 pm “If it is important, have

    more than one” Decision: Failover two pods Every pod has a shadow copy in another data center 13
  10. Thursday November 24 11:53 pm Failover complete 14 Writer Reader

    Standby Master Failover Standby MySQL Redis Writer Reader Standby DC 1 DC 2 Writer Reader Standby DC 1 DC 2
  11. 15

  12. Friday November 24 12:00 am - 4:00 pm 17 Top

    indicators that things are going well 3. The worst problem is throttled emails 2. Staring at dashboards showing minor signs of trouble
  13. Friday November 24 4:08pm 19 Out of memory error for

    pod 0 Redis ▪ Failover Redis also out of memory ▪ Standby Redis available ▪ Shadow pod in other DC is fine ▪ Blast radius only one pod Panic level: Decision: Failover pod 0 to West
  14. Friday November 24 4:17pm 20 ▪ Pod 0 recovering in

    West ▪ Job queues growing in East ▪ Theory: workers still stuck on connections to bad Redis Panic level: Decision: Restart Workers in East
  15. Friday November 24 4:25pm 21 All job workers in East

    enter crash loop on startup Panic level: Decision: Prepare complete evac to West, Failover bad pod 0 redis
  16. Friday November 24 4:30pm 22 Standby Redis active (one of

    many Redises) Worker crash loops stopped Jobs still not running Panic level:
  17. Friday November 24 4:37pm 24 ▪ Now at 10 minutes

    with checkouts not completing ▪ Jobs still not running and no idea why
  18. What should happen... 29 ▪ If Redis fails health check,

    automatically failover to secondary Redis ▪ If Redis is completely down, it should only affect one pod ▪ We have tests that simulate all kinds of Redis connection failure to ensure system behaves
  19. What actually happened 30 ▪ Health check does a read

    against Redis ▪ Reads don’t fail when Redis is OOM ▪ When a worker starts, it registers itself with each Redis instance, which requires a write ▪ Writes fail due to OOM, so worker fails during startup
  20. Inefficient failure recovery code Redis circuit breaker overload 32 ▪

    On start, workers remove reservations for any worker with no heartbeat in 5 minutes ▪ This was O(n2) on failed worker count ▪ Had never recovered from such as massive simultaneous failure before ▪ Use circuit breaker pattern to prevent too many concurrent accesses to Redis ▪ Because of the heavy load many circuits tripped and went through backoff
  21. Summary 1. If it can break, it will break 2.

    If it is important, have more than one 3. Redundancy at multiple levels of abstraction 4. Degraded is better than down 33 We had many failures with almost no impact for shoppers. The worst impact was a 10 minute delay in order notification emails, but no sales were lost. By embracing and learning from failures we can make systems that still fail, but with minimal impact.