Scaling Rails for Black Friday / Cyber Monday at Shopify

Scaling Rails for Black Friday / Cyber Monday at Shopify

Talk given at ConFoo 2015 on February 18th, 2015 and RailsConf 2015.

85b03650a2ec5235376b0b983a49511a?s=128

Christian Joudrey

February 18, 2015
Tweet

Transcript

  1. scaling rails for Black Friday Cyber Monday

  2. cjoudrey @

  3. None
  4. None
  5. None
  6. None
  7. None
  8. the stack

  9. nginx unicorn • rails 4 • mysql 5.6 (percona) ruby

    2.1 •
  10. 95 app servers 1,800 unicorn workers 18 job servers 1,400

    job workers
  11. scale?

  12. None
  13. 400,000 reqs/min

  14. 3.7B$ annual GMV that’s 7,000$ per min

  15. Black Friday Cyber Monday

  16. Black Friday Cyber FUNday

  17. None
  18. ~ 600,000 reqs/min

  19. 1 request 1 process =

  20. scale++ ↓ resp. time ↑ workers

  21. page caching

  22. None
  23. None
  24. shopify/cacheable

  25. generational caching

  26. gzip • etag + 304 not modified

  27. class PostsController < ApplicationController def index response_cache do @posts =

    @shop.posts.paginate(params[:page]) respond_with(@posts) end end def cache_key_data { shop_id: @shop.id, path: request.path, format: request.format, params: params.slice(:page), shop_version: @shop.version } end end
  28. md5( { shop_id: 1, path: '/posts', format: 'text/html', params: {

    page: 2 }, shop_version: 123 }.to_s ) GET /posts?page=2
  29. None
  30. sale

  31. query caching

  32. shopify/identity_cache

  33. full model caching

  34. opt-in by design

  35. after_commit expiry

  36. class Product < ActiveRecord::Base include IdentityCache has_many :images cache_has_many :images,

    :embed => true end product = Product.fetch(id) images = product.fetch_images
  37. class Product < ActiveRecord::Base include IdentityCache cache_index :shop_id, :handle, :unique

    => true end Product.fetch_by_shop_id_and_handle(shop_id, handle)
  38. None
  39. sale

  40. background jobs

  41. webhooks emails • fraud detection • payment processing

  42. None
  43. priority queues payment • default • low realtime •

  44. throttling

  45. now what?

  46. measure it! if it moves...

  47. statsd

  48. shopify/statsd-instrument

  49. Liquid::Template.extend StatsD::Instrument Liquid::Template.statsd_measure :render, 'Liquid.Template.render'

  50. PaymentProcessingJob.statsd_count :perform, 'PaymentProcessingJob.processed'

  51. None
  52. None
  53. load testing

  54. genghis khan

  55. simulate Black Friday Cyber Monday before it happens

  56. several times per week

  57. slow queries

  58. # User@Host: shopify[shopify] @ [127.0.0.1] # Thread_id: 264419969 Schema: shopify

    Last_errno: 0 Killed: 0 # Query_time: 0.150491 Lock_time: 0.000057 Rows_sent: 1 Rows_examined: 147841 Rows_affected: 0 Rows_read: 147841 # Bytes_sent: 1214 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: FF7021AAA # QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: Yes Filesort_on_disk: No Merge_passes: 0 # InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 475 SET timestamp=1393385020; SELECT `discounts`.* FROM `discounts` WHERE `discounts`.`shop_id` = 1745470 AND `discounts`.`status` = 'enabled' ORDER BY ISNULL(ends_at) DESC, ends_at DESC LIMIT 1
  59. determining root cause

  60. https://github.com/newobj/nginx-x-rid-header nginx request_id header proxy_set_header X-Request-ID "$request_id"; log_format main '...

    $request_id' step 1
  61. https://gist.github.com/mnutt/566725 Complete 200 OK in 100ms (Views: 60ms | ActiveRecord:

    40ms | request_id=bc12813bce...) log_process_action ActionController::Instrumentation step 2
  62. https://github.com/basecamp/marginalia User Load (0.3ms) SELECT `users`.* FROM `users` WHERE `users`.`id`

    = 1 LIMIT 1 /*application:Shopify, controller:users,action:show, request_id:bc12813bce...*/ basecamp/marginalia step 3
  63. # User@Host: shopify[shopify] @ [127.0.0.1] # Thread_id: 264419969 Schema: shopify

    Last_errno: 0 Killed: 0 # Query_time: 0.150491 Lock_time: 0.000057 Rows_sent: 1 Rows_examined: 147841 Rows_affected: 0 Rows_read: 147841 # Bytes_sent: 1214 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: FF7021AAA # QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: Yes Filesort_on_disk: No Merge_passes: 0 # InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 475 SET timestamp=1393385020; SELECT `discounts`.* FROM `discounts` WHERE `discounts`.`shop_id` = 1745470 AND `discounts`.`status` = 'enabled' ORDER BY ISNULL(ends_at) DESC, ends_at DESC LIMIT 1 /*application:Shopify,controller:orders,action:pay, request_id:bc12813bce...*/ profit!
  64. access.log rails.log slow_query.log profit! (2)

  65. bonus! background jobs

  66. schema migration with zero downtime

  67. soundcloud/lhm

  68. class NewIndexOnOrders < ActiveRecord::Migration def self.up Lhm.change_table :orders do |m|

    m.add_index [:shop_id, :customer_id] end end def self.down # end end
  69. orders 20140520_orders

  70. insert/delete/update triggers

  71. INSERT INTO ... SELECT ... insert/delete/update triggers

  72. async caveat

  73. resiliency

  74. A resilient system is one that functions with one or

    more components being unavailable or unacceptably slow. -
  75. don’t let a take you down minor dependencies

  76. shopify/toxiproxy ☢ redis memcached sessions your app toxiproxy

  77. def load_customer if customer_id = session[:customer_id] @customer = Customer.find_by_id(customer_id) end

    end
  78. def load_customer if customer_id = session[:customer_id] @customer = Customer.find_by_id(customer_id) end

    rescue Sessions::DataStoreUnavailable @customer = nil end
  79. def test_storefront_resilient_to_sessions_down Toxiproxy[:sessions_data_store].down do get '/' assert_response :success end end

  80. rinse & repeat http://www.shopify.com/technology/16906928-building-and-testing-resilient-ruby-on-rails-applications

  81. what about resources? slow

  82. shard 2 shard 3 shard 1 shop 4, 5, 6

    shop 7, 8, 9 shop 1, 2, 3
  83. rails request shard 2 shard 3 shard 1 shop 4,

    5, 6 shop 7, 8, 9 shop 1, 2, 3
  84. rails request shard 2 shard 3 shard 1 shop 4,

    5, 6 shop 7, 8, 9 shop 1, 2, 3
  85. how can we fail fast?

  86. shopify/semian smart circuit-breaker

  87. shopify/semian Semian.register(:mysql_shard_1, tickets: 5, timeout: 0.5, error_threshold: 100, error_timeout: 10,

    success_threshold: 2)
  88. shopify/semian Semian[:mysql_shard_1].acquire do # Query the resource end

  89. rails request shard 2 shard 3 shard 1 shop 4,

    5, 6 shop 7, 8, 9 shop 1, 2, 3
  90. what else can go wrong?

  91. Shopify Shipping rate providers Payment gateways Fulfillment services (FedEX, UPS,

    USPS, etc..) (Stripe, PayPal, etc..) (Shipwire, etc…) Internal services (MySQL, Memcached, etc..)
  92. manual circuit breakers around external dependencies

  93. thanks! :)