Black Friday: Lessons in Resiliency and Incident Response at Shopify
Talk from DevOpsDays Toronto 2018 on building resilient systems that respond well to failure, and the story of two particular incidents and the lessons learned from them.
period is the make or break moment. It all goes well, the red ink of loss turns into the black of profit. Shopify provides the platform for over half a million retailers. Half a million businesses rely on us to help them turn a profit for the year.
the year (“run towards the risk”) • Months of preparation including gameday drills, capacity build out, heavy load testing • Backup plans for backup plans • Code and infrastructure freeze 5
shards • Most other persistent state is also shard-local, which we call “pods”. • Even a catastrophic failure should have a blast radius of a single pod Pod 1 Pod 2 Web Jobs
pod 0 Redis ▪ Failover Redis also out of memory ▪ Standby Redis available ▪ Shadow pod in other DC is fine ▪ Blast radius only one pod Panic level: Decision: Failover pod 0 to West
automatically failover to secondary Redis ▪ If Redis is completely down, it should only affect one pod ▪ We have tests that simulate all kinds of Redis connection failure to ensure system behaves
against Redis ▪ Reads don’t fail when Redis is OOM ▪ When a worker starts, it registers itself with each Redis instance, which requires a write ▪ Writes fail due to OOM, so worker fails during startup
On start, workers remove reservations for any worker with no heartbeat in 5 minutes ▪ This was O(n2) on failed worker count ▪ Had never recovered from such as massive simultaneous failure before ▪ Use circuit breaker pattern to prevent too many concurrent accesses to Redis ▪ Because of the heavy load many circuits tripped and went through backoff
If it is important, have more than one 3. Redundancy at multiple levels of abstraction 4. Degraded is better than down 33 We had many failures with almost no impact for shoppers. The worst impact was a 10 minute delay in order notification emails, but no sales were lost. By embracing and learning from failures we can make systems that still fail, but with minimal impact.