Full Stack Fest 2016: Shopify in Multiple Datacenters

Full Stack Fest 2016: Shopify in Multiple Datacenters

How did Shopify evolve from a single tenant platform supporting a single snowboard store, to the massive beast it is today boasting 300,000 stores? What infrastructure challenges did we have to solve along the way? This is the story of the evolution of Shopify's infrastructure from one store to 300,000, sprinkled with the hoops of celebrities with devastating amounts of traffic.

Video: https://www.youtube.com/playlist?list=PLe9psSNJBf76DOOKMkDpyo_A5PfZk7JWc

4d931adc15dea47e6e5854f745719269?s=128

Simon Hørup Eskildsen

September 05, 2016
Tweet

Transcript

  1. Shopify in Multiple DCs BY @SI RUP SEN

  2. None
  3. 300,000+ SHOPS $20 BILLION+ 1500+ EMPLOYEES 1000+ SERVERS 2.5 DATACENTRES

    RUBY ON RAILS 10Ks PEAK RPS 10,000 CHECKOUTS/M PEAK 30+ DAILY DEPLOYS
  4. The Flash Sale Problem + =

  5. 5 Kanye West

  6. 6 Super Bowl Ads

  7. 7 #KylieLipKit

  8. 8

  9. scale

  10. 10 Multi Tenant Single Tenant 2004 - 2016 Evolution of

    Shopify’s Infrastructure
  11. 11 Multi Tenant Single Tenant Large Capacity Low Capacity Good

    Utilization Poor Utilization Good for Flash Sales Poor for Flash Sales Cheaper Expensive No Isolation Complete Isolation Poor Scalability Great Scalability
  12. 12 Multi Tenant Single Tenant 2004 Snowdevil Launch

  13. None
  14. 14 Multi Tenant Single Tenant 2006 Shopify Launches

  15. 15 Multi Tenant Single Tenant Where is the golden middle?

  16. 16 Multi Tenant Single Tenant 2006 - 2012 Vertical Scaling

    and Optimizations
  17. Vertical scaling Performance optimizations Caching Throttling

  18. 18 Multi Tenant Single Tenant 2013 Database Sharding

  19. sharding db workers lb1 lb2 lb3 redis memcached

  20. sharding shard0 workers lb1 lb2 lb3 redis memcached shard1 shard2

    shard3 shard4
  21. 21 Multi Tenant Single Tenant 2014 Resiliency

  22. platform surface area = timeɾscale errors are proportional to surface

    area
  23. Resiliency Maturity Pyramid 23 No resiliency effort Testing with mocks

    Toxiproxy tests and resiliency matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
  24. 24 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
  25. 25 Multi Tenant Single Tenant 2015 Multiple Datacenters

  26. datacenter 1 shard 1 shard 3 shard 5 shard 2

    shard 4 shard 6 datacenter 2 shard 1 shard 3 shard 5 shard 2 shard 4 shard 6
  27. bin/dc-failover

  28. datacenter 1 shard 1 shard 3 shard 5 shard 2

    shard 4 shard 6 datacenter 2 shard 1 shard 3 shard 5 shard 2 shard 4 shard 6
  29. 29 Multi Tenant Single Tenant 2016 Pods

  30. pods: isolated group of shops goal: run pods in multiple

    datacenters
  31. previous shared architecture shard0 workers lb1 lb2 lb3 redis memcached

    shard1 shard2 shard3 shard4
  32. pod worker memcache worker redis database shard1 pod worker memcache

    worker redis database shard2
  33. datacenter 1 pod 1 pod 3 pod 5 pod 2

    pod 4 pod 6 pod 1 pod 3 pod 5 pod 2 pod 4 pod 6 datacenter 2
  34. 34 Multi Tenant Single Tenant Large Capacity Low Capacity Good

    Utilization Poor Utilization Good for Flash Sales Poor for Flash Sales Cheaper Expensive No Isolation Complete Isolation Poor Scalability Great Scalability Excellent Utilization Pods in Multiple Datacenters Excellent Capacity Great for Flash Sales Cheap Pod Isolation Good Scalability
  35. flash sales: leveraging platform size to steal capacity from other

    pods when required
  36. pod 1 pod 3 pod 2 pod 4 pod 1

    pod 3 pod 2 pod 4 GET /products/beautiful-shoe HTTP/1.1 Host: myshop.com sorting hat
  37. openresty = nginx + lua + ❤

  38. rule 1: any request must be annotated with a pod

    or shop rule 2: any request can only touch one pod
  39. POST /webhook?shop_id=12 HTTP/1.1 Host: app.shopify.com

  40. count = 0 with_each_shard do count += Shop.count end render

    “shops: #{count}”
  41. shitlist driven development

  42. 42 if Shitlist.include?(klass) super else error = <<-EOE New usage

    of this API is deprecated. Please come talk to the Pods team in #pods and we'll help you out! EOE raise ShitList::Error, error end
  43. post “/webhooks/amazon?shop_id=\d+”, “webhooks#amazon”, post “/signup”, “signup#create”, post “/check_availability”, “signup#check_availability” post

    “/recover_password”, “accounts#recover”, get “*“, “storefront#index”,
  44. post “/webhooks/amazon?shop_id=\d+”, “webhooks#amazon”, routing_method: :shop_from_param post “/signup”, “signup#create”, routing_method: :signup

    post “/check_availability”, “signup#check_availability” routing_method: :try_all_pods post “/recover_password”, “accounts#recover”, routing_method: :try_all_pods get “*“, “storefront#index”, routing_method: :shop_from_host
  45. 45 Serialize Rails Routes to JSON Distribute to load balancer’s

    Sorting Hat
  46. [ { “pattern”: “^/webhooks/amazon$”, “http”: [“POST”], “method”: “shop_from_param” }, {

    “pattern”: “^/signup/create$”, “http”: [“POST”], “method”: “signup” }, { “pattern”: “”, “http”: [“GET”, “HEAD”, “POST”, ..], “method”: “shop_from_host” } ]
  47. pod 1 pod 3 pod 2 pod 4 pod 1

    pod 3 pod 2 pod 4 GET /products/beautiful-shoe HTTP/1.1 Host: myshop.com sorting hat
  48. rule 1: any request must be annotated with a pod

    or shop rule 2: any request can only touch one pod
  49. resiliency through isolation resiliency through multi-dc multi-dc through isolation scalability

    through multi-dc, isolation and sharding simplification through isolation isolation through resiliency
  50. 50 Multi Tenant Single Tenant 2017 Unknown

  51. Zero Downtime Pod Failovers Pause Write Requests Move Pod Resume

    Requests
  52. isolating shops further

  53. thank you simon eskildsen @sirupsen

  54. Gear by Alexandr Cherkinsky from the Noun Project snowboard by

    Andrew Cooper from the Noun Project database by Creative Stall from the Noun Project Bomb By Arthur Shlain from the Noun Project Line Graph by Jules Dominic from the Noun Project people pattern by Eliricon from the Noun Project online shopping by creative outlet from the Noun Project magnify by Tim Smith from the Noun Project Lightning by Sergey Demushkin from the Noun Project lever by Nick Abrams from the Noun Project train tracks by Guilhem from the Noun Project JSON File by useiconic.com from the Noun Project Gandalf by Abel Tan from the Noun Project pause by Chad Pennings from the Noun Project Truck by Lemon Liu from the Noun Project Cargo container by Icon Fair from the Noun Project