Black Friday: Lessons in Resiliency and Incident Response at Shopify

Black Friday: Lessons in Resiliency and Incident Response at Shopify
1 John Arthorne @jarthorne Shopify Production Engineering

Failure != Impact Failures will always happen Embrace failures as
a way to learn about unknowns in your system 2 Resiliency Failures

“ Principle #1 If it can break, it will break
3

Black Friday / Cyber Monday 4 In retail, the holiday
period is the make or break moment. It all goes well, the red ink of loss turns into the black of profit. Shopify provides the platform for over half a million retailers. Half a million businesses rely on us to help them turn a profit for the year.

Best laid plans... • All high risk work early in
the year (“run towards the risk”) • Months of preparation including gameday drills, capacity build out, heavy load testing • Backup plans for backup plans • Code and infrastructure freeze 5

SO IT BEGINS

Thursday November 23 8:52 pm Low power alarm for one
rack in one Data Center Likely cause: failure of one PDU (fancy power bar for servers) Panic level: Decision: assessment 7

“ Principle #2 If it is important, have more than
one 8

Redundancy: Stateful Things 9 Writer Reader Standby Master Failover Standby
MySQL Redis

Reducing blast radius 10 • Shop database is broken into
shards • Most other persistent state is also shard-local, which we call “pods”. • Even a catastrophic failure should have a blast radius of a single pod Pod 1 Pod 2 Web Jobs

Thursday November 23 10:16 pm All servers still running Four
chassis affected: ▪ 11 web hosts ▪ 4 job hosts ▪ pod 36 MySQL standby ▪ Pod 39 Redis failover Panic level: Decision: leave it alone 11

Thursday November 23 11:02 pm Second PDU fails, four chassis
down One pod without redundant Redis One pod without standby MySQL Panic level: Decision: Start incident response 12

Thursday November 23 11:39 pm “If it is important, have
more than one” Decision: Failover two pods Every pod has a shadow copy in another data center 13

Thursday November 24 11:53 pm Failover complete 14 Writer Reader
Standby Master Failover Standby MySQL Redis Writer Reader Standby DC 1 DC 2 Writer Reader Standby DC 1 DC 2

“ Principle #3 Redundancy at multiple levels of abstraction 16

Friday November 24 12:00 am - 4:00 pm 17 Top
indicators that things are going well 3. The worst problem is throttled emails 2. Staring at dashboards showing minor signs of trouble

18 1. War room chicken cam

Friday November 24 4:08pm 19 Out of memory error for
pod 0 Redis ▪ Failover Redis also out of memory ▪ Standby Redis available ▪ Shadow pod in other DC is fine ▪ Blast radius only one pod Panic level: Decision: Failover pod 0 to West

Friday November 24 4:17pm 20 ▪ Pod 0 recovering in
West ▪ Job queues growing in East ▪ Theory: workers still stuck on connections to bad Redis Panic level: Decision: Restart Workers in East

Friday November 24 4:25pm 21 All job workers in East
enter crash loop on startup Panic level: Decision: Prepare complete evac to West, Failover bad pod 0 redis

Friday November 24 4:30pm 22 Standby Redis active (one of
many Redises) Worker crash loops stopped Jobs still not running Panic level:

“ Principle #4 Degraded is better than down 23

Friday November 24 4:37pm 24 ▪ Now at 10 minutes
with checkouts not completing ▪ Jobs still not running and no idea why

Friday November 24 4:40pm 25 ▪ 100% recovery

Why did we run out of memory? 26

Misconfigured API throttle Monitoring on hosts over instances 27

Why did all workers in the data center crash? 28

What should happen... 29 ▪ If Redis fails health check,
automatically failover to secondary Redis ▪ If Redis is completely down, it should only affect one pod ▪ We have tests that simulate all kinds of Redis connection failure to ensure system behaves

What actually happened 30 ▪ Health check does a read
against Redis ▪ Reads don’t fail when Redis is OOM ▪ When a worker starts, it registers itself with each Redis instance, which requires a write ▪ Writes fail due to OOM, so worker fails during startup

Why was there a recovery delay? 31

Inefficient failure recovery code Redis circuit breaker overload 32 ▪
On start, workers remove reservations for any worker with no heartbeat in 5 minutes ▪ This was O(n2) on failed worker count ▪ Had never recovered from such as massive simultaneous failure before ▪ Use circuit breaker pattern to prevent too many concurrent accesses to Redis ▪ Because of the heavy load many circuits tripped and went through backoff

Summary 1. If it can break, it will break 2.
If it is important, have more than one 3. Redundancy at multiple levels of abstraction 4. Degraded is better than down 33 We had many failures with almost no impact for shoppers. The worst impact was a 10 minute delay in order notification emails, but no sales were lost. By embracing and learning from failures we can make systems that still fail, but with minimal impact.

Thanks! John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering DevOps Days
Toronto May 31, 2018 Images courtesy of burst.shopify.com

Black Friday: Lessons in Resiliency and Inciden...

Black Friday: Lessons in Resiliency and Incident Response at Shopify

More Decks by John Arthorne

Other Decks in Technology

Featured

Transcript