Slide 1

Slide 1 text

Black Friday: Lessons in Resiliency and Incident Response at Shopify 1 John Arthorne @jarthorne Shopify Production Engineering

Slide 2

Slide 2 text

Failure != Impact Failures will always happen Embrace failures as a way to learn about unknowns in your system 2 Resiliency Failures

Slide 3

Slide 3 text

“ Principle #1 If it can break, it will break 3

Slide 4

Slide 4 text

Black Friday / Cyber Monday 4 In retail, the holiday period is the make or break moment. It all goes well, the red ink of loss turns into the black of profit. Shopify provides the platform for over half a million retailers. Half a million businesses rely on us to help them turn a profit for the year.

Slide 5

Slide 5 text

Best laid plans... ● All high risk work early in the year (“run towards the risk”) ● Months of preparation including gameday drills, capacity build out, heavy load testing ● Backup plans for backup plans ● Code and infrastructure freeze 5

Slide 6

Slide 6 text

SO IT BEGINS

Slide 7

Slide 7 text

Thursday November 23 8:52 pm Low power alarm for one rack in one Data Center Likely cause: failure of one PDU (fancy power bar for servers) Panic level: Decision: assessment 7

Slide 8

Slide 8 text

“ Principle #2 If it is important, have more than one 8

Slide 9

Slide 9 text

Redundancy: Stateful Things 9 Writer Reader Standby Master Failover Standby MySQL Redis

Slide 10

Slide 10 text

Reducing blast radius 10 ● Shop database is broken into shards ● Most other persistent state is also shard-local, which we call “pods”. ● Even a catastrophic failure should have a blast radius of a single pod Pod 1 Pod 2 Web Jobs

Slide 11

Slide 11 text

Thursday November 23 10:16 pm All servers still running Four chassis affected: ▪ 11 web hosts ▪ 4 job hosts ▪ pod 36 MySQL standby ▪ Pod 39 Redis failover Panic level: Decision: leave it alone 11

Slide 12

Slide 12 text

Thursday November 23 11:02 pm Second PDU fails, four chassis down One pod without redundant Redis One pod without standby MySQL Panic level: Decision: Start incident response 12

Slide 13

Slide 13 text

Thursday November 23 11:39 pm “If it is important, have more than one” Decision: Failover two pods Every pod has a shadow copy in another data center 13

Slide 14

Slide 14 text

Thursday November 24 11:53 pm Failover complete 14 Writer Reader Standby Master Failover Standby MySQL Redis Writer Reader Standby DC 1 DC 2 Writer Reader Standby DC 1 DC 2

Slide 15

Slide 15 text

15

Slide 16

Slide 16 text

“ Principle #3 Redundancy at multiple levels of abstraction 16

Slide 17

Slide 17 text

Friday November 24 12:00 am - 4:00 pm 17 Top indicators that things are going well 3. The worst problem is throttled emails 2. Staring at dashboards showing minor signs of trouble

Slide 18

Slide 18 text

18 1. War room chicken cam

Slide 19

Slide 19 text

Friday November 24 4:08pm 19 Out of memory error for pod 0 Redis ▪ Failover Redis also out of memory ▪ Standby Redis available ▪ Shadow pod in other DC is fine ▪ Blast radius only one pod Panic level: Decision: Failover pod 0 to West

Slide 20

Slide 20 text

Friday November 24 4:17pm 20 ▪ Pod 0 recovering in West ▪ Job queues growing in East ▪ Theory: workers still stuck on connections to bad Redis Panic level: Decision: Restart Workers in East

Slide 21

Slide 21 text

Friday November 24 4:25pm 21 All job workers in East enter crash loop on startup Panic level: Decision: Prepare complete evac to West, Failover bad pod 0 redis

Slide 22

Slide 22 text

Friday November 24 4:30pm 22 Standby Redis active (one of many Redises) Worker crash loops stopped Jobs still not running Panic level:

Slide 23

Slide 23 text

“ Principle #4 Degraded is better than down 23

Slide 24

Slide 24 text

Friday November 24 4:37pm 24 ▪ Now at 10 minutes with checkouts not completing ▪ Jobs still not running and no idea why

Slide 25

Slide 25 text

Friday November 24 4:40pm 25 ▪ 100% recovery

Slide 26

Slide 26 text

Why did we run out of memory? 26

Slide 27

Slide 27 text

Misconfigured API throttle Monitoring on hosts over instances 27

Slide 28

Slide 28 text

Why did all workers in the data center crash? 28

Slide 29

Slide 29 text

What should happen... 29 ▪ If Redis fails health check, automatically failover to secondary Redis ▪ If Redis is completely down, it should only affect one pod ▪ We have tests that simulate all kinds of Redis connection failure to ensure system behaves

Slide 30

Slide 30 text

What actually happened 30 ▪ Health check does a read against Redis ▪ Reads don’t fail when Redis is OOM ▪ When a worker starts, it registers itself with each Redis instance, which requires a write ▪ Writes fail due to OOM, so worker fails during startup

Slide 31

Slide 31 text

Why was there a recovery delay? 31

Slide 32

Slide 32 text

Inefficient failure recovery code Redis circuit breaker overload 32 ▪ On start, workers remove reservations for any worker with no heartbeat in 5 minutes ▪ This was O(n2) on failed worker count ▪ Had never recovered from such as massive simultaneous failure before ▪ Use circuit breaker pattern to prevent too many concurrent accesses to Redis ▪ Because of the heavy load many circuits tripped and went through backoff

Slide 33

Slide 33 text

Summary 1. If it can break, it will break 2. If it is important, have more than one 3. Redundancy at multiple levels of abstraction 4. Degraded is better than down 33 We had many failures with almost no impact for shoppers. The worst impact was a 10 minute delay in order notification emails, but no sales were lost. By embracing and learning from failures we can make systems that still fail, but with minimal impact.

Slide 34

Slide 34 text

Thanks! John Arthorne @jarthorne github.com/jarthorn Shopify Production Engineering DevOps Days Toronto May 31, 2018 Images courtesy of burst.shopify.com