GOTO Copenhagen 2017: Shopify’s Architecture to Handle 80K RPS Sales

GOTO Copenhagen 2017: Shopify’s Architecture to Handle 80K RPS Sales

Video: https://www.youtube.com/watch?v=N8NWDHgWA28

What do you do when some of the most ubiquitous celebrity personalities launch products on your platform, driving tens of thousands of requests per second? You pull up your sleeves and architect for it. Throughout the past decade, Shopify's infrastructure has evolved to serve some of the largest online sales on the planet. In this talk, we dive into our multi-tenant architecture that allows us to failover between regions with zero downtime, move shops between shards, minimize the blast radius of catastrophes, as well as throttling and serving cache hits out of the load-balancers. We'll walk through how this architecture served us beautifully to minimize risk during our on-going, gradual migration to the Cloud.

4d931adc15dea47e6e5854f745719269?s=128

Simon Hørup Eskildsen

October 03, 2017
Tweet

Transcript

  1. Simon Eskildsen Shopify’s Architecture to Handle 80K RPS Sales

  2. Shopify is handling some of the largest sales in the

    world from Kylie Jenner, Kanye, Superbowl, and others
  3. — Tobi Lütke, CEO in internal essay on why we

    optimize for flash sales “We learned to absorb these shocks and become stronger as a result. [..] The school of hard knocks has taught us well.”
  4. 500K $5.8B Merchants powered Processed Q2, 2017 80K 40+ Peak

    RPS Daily deploys Rails 2000+ Ruby on Rails since 2006 Employees
  5. Traffic Application Data Application Data Region A Region B

  6. Traffic Application Data Application Data Region A Region B

  7. • Global Routing • Openresty • Bots • Cache hits

    • Checkout Throttling Traffic
  8. ISP ISP ISP ISP ISP ISP ISP ISP ISP ISP

    Region A BGP ANNOUNCE 23.227.38.0/24 BGP ANNOUNCE 23.227.38.0/24 Region B walrusser.myshopify.com 23.227.38.64
  9. OpenResty allows Lua scripting of your load balancers, it’s been

    one of the most impactful additions to our stack in recent memory https://github.com/openresty/openresty Nginx with OpenResty Rule Banner Kafka Logging Edgecache Checkout Throttle
  10. worker_processes 1; error_log logs/error.log; events { worker_connections 1024; } http

    { server { listen 8080; location / { default_type text/html; content_by_lua ' ngx.say("<p>hello, world</p>") '; } } }
  11. Bot squasher analyzes the Kafka stream of incoming requests to

    ban bots with a rule banner module Nginx with OpenResty Rule Banner Kafka Bot Squasher Kafka Logger POST /checkout BAN 23.227.38.178
  12. Nginx with OpenResty Edgecache Memcached GET /collections/walruses HIT Edgecache can

    serve full page cache hits out of the load-balancers in microseconds Web Process MISS FILL
  13. Nginx with OpenResty Checkout Throttle GET /checkout Queue /wait_area /checkout

    Throttle Checkout Throttle throttles the number of customers in the processing heavy checkout path
  14. Traffic Application Data Application Data Region A Region B

  15. Pod is an isolated unit of one or more shops

  16. shop1 shop4 shop9 shop17 shop72 Data in Region A shop3

    shop72 shop92 shop18 shop64 shop22 shop88 shop0 sho52 shop23 Pod 14 Pod 2 Pod 7
  17. Pod 14 Each Pod in Region A Pod 2 Pod

    7 MySQL Redis Memcache MySQL Redis Memcache MySQL Redis Memcache Cron Cron Cron
  18. Pod 14 Pod 2 Pod 7 MySQL Redis Memcache MySQL

    Redis Memcache MySQL Redis Memcache Cron Cron Cron Shared Workers
  19. Pod 14 Pod 2 Pod 7 MySQL Redis Memcache MySQL

    Redis Memcache MySQL Redis Memcache Cron Cron Cron Shared Load Balancing
  20. Genghis is our load-testing tool to test scale

  21. Pod Balancer balances shops between pods with minimal downtime to

    keep load and size even
  22. shop1 shop4 shop9 shop17 shop72 Pod Balancer shop3 shop72 shop92

    shop18 shop64 shop22 shop88 shop0 shop52 shop23 Pod 14 Pod 2 Pod 7
  23. shop1 shop4 shop9 shop17 shop72 Pod Balancer shop3 shop72 shop92

    shop18 shop64 shop22 shop88 shop0 shop52 shop23 Pod 14 Pod 2 Pod 7
  24. shop1 shop4 shop9 shop17 shop72 Pod Balancer shop3 shop72 shop92

    shop18 shop64 shop22 shop88 shop0 shop52 shop23 Pod 14 Pod 2 Pod 7 shop98
  25. shop1 shop4 shop9 shop17 shop72 Pod Balancer shop3 shop72 shop92

    shop18 shop64 shop22 shop88 shop0 shop52 shop23 Pod 14 Pod 2 Pod 7 shop98 shop99 shop100
  26. shop1 shop4 shop9 shop17 shop72 Pod Balancer shop3 shop72 shop92

    shop18 shop64 shop22 shop88 shop0 shop52 shop23 Pod 14 Pod 2 Pod 7 shop98 shop99 shop100 Pod 74
  27. shop1 shop4 shop9 shop17 shop72 Pod Balancer shop3 shop72 shop92

    shop18 shop64 shop22 shop88 shop0 shop52 shop23 Pod 14 Pod 2 Pod 7 shop98 shop99 shop100 Pod 74
  28. MySQL Redis MySQL Redis COPY SHOP SELECT * FROM products

    WHERE shop_id = 38493 SELECT * from orders WHERE shop_id = 38493 Source Pod 9 Target Pod 23
  29. MySQL Redis MySQL Redis COPY SHOP SELECT * FROM products

    WHERE shop_id = 38493 SELECT * from orders WHERE shop_id = 38493 NEW CHECKOUT INSERT INTO CHECKOUTS … Source Pod 9 Target Pod 23
  30. MySQL Redis Source Pod 9 MySQL Redis Target Pod 23

    COPY SHOP_ID 238 SELECT * FROM products WHERE shop_id = 238 SELECT * from orders WHERE shop_id = 238 Bin Log REPLICATE SHOP_ID 238 CHECKOUT id: 383293
  31. MySQL Redis Source Pod 9 MySQL Redis Target Pod 23

    LOCK SHOP_ID 238 Routing UPDATE SHOP_ID 238 pod_id=23
  32. Traffic Application Data Application Data Region A Region B

  33. Sorting Hat routes requests for a shop to the region

    the pod is active in
  34. Traffic Region A Region B Active Pod 7 Inactive Pod

    2 Active Pod 14 Pod 14 Inactive Inactive Active Pod 2 Pod 7 Pod 14 Sorting Hat GET /products Host: sneakershop.com Routing ROUTE sneakershop.com shop238 pod2:B
  35. Traffic Application Data Application Data Region A Region B

  36. Pod Mover moves pods between regions with minimal downtime

  37. Traffic Region A Region B Active Pod 7 Pod 2

    Active Pod 14 Pod 14 Inactive Inactive Active Pod 2 Pod 7 Pod 14 Sorting Hat Inactive Pod 2
  38. Traffic Region A Region B Active Pod 7 Pod 2

    Active Pod 14 Pod 14 Inactive Inactive Active Pod 2 Pod 7 Pod 14 Sorting Hat Inactive Pod 2
  39. Update Routing for pod to target region pod2:b -> pod2:a

    Sorting Hat routes requests to target region Disable cron in both regions Fail over MySQL to target region Enable cron in both regions Transfer jobs to target region
  40. What about errors while the database fails over?

  41. Nginx with OpenResty Pauser POST /checkout (during failover) Pauser will

    pause requests in the middle of failovers to avoid serving errors Queue Throttle HTTP 200 (seconds later)
  42. Update Routing for pod to target region pod2:b -> pod2:a

    Sorting Hat routes requests to target region and pause requests Disable cron in both regions Fail over MySQL to target region Enable cron in both regions Resume requests Transfer jobs to target region
  43. None
  44. Cloud Migration with the Pods Architecture

  45. shop1 shop4 shop9 shop17 shop72 Region A shop3 shop72 shop92

    shop18 shop64 shop22 shop88 shop0 sho52 shop23 Cloud Region C
  46. None
  47. Thanks! @Sirupsen