RailsConf 2016: 5 Years of Scaling Rails to 80K RPS

RailsConf 2016: 5 Years of Scaling Rails to 80K RPS

Video: http://confreaks.tv/videos/railsconf2017-5-years-of-rails-scaling-to-80k-rps

Shopify has taken Rails through some of the world's largest sales: Superbowl, Celebrity Launches, and Black Friday. In this talk, we will go through the evolution of the Shopify infrastructure: from re-architecting and caching in 2012, sharding in 2013, and reducing the blast radius of every point of failure in 2014. To 2016, where we accomplished running our 325,000+ stores out of multiple datacenters. It'll be whirlwind tour of the lessons learned scaling one of the world's largest Rails deployments for half a decade.

4d931adc15dea47e6e5854f745719269?s=128

Simon Hørup Eskildsen

April 27, 2017
Tweet

Transcript

  1. 5 YEARS OF SCALING RAILS SIMON ESKILDSEN @SIRUPSEN

  2. None
  3. 377,500+ SHOPS $29 BILLION+ 1900+ EMPLOYEES 2 DATACENTRES RUBY ON

    RAILS SINCE 2006 80K PEAK RPS 40+ DAILY DEPLOYS 20K-40K+ STEADY RPS
  4. 4 STOREFRONT CHECKOUT ADMIN API HEAVY READS CACHEABLE AVAILABILITY 80%

    TRAFFIC HEAVY WRITES EXTERNALS CONSISTENCY COMPLEX R/W CONSISTENCY COMPLEX R/W CONSISTENCY FAST COMPUTERS
  5. FLASH SALES: SCHOOL OF HARD KNOCKS + =

  6. ORIGIN OF THE PLATFORM 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY

    2015 MULTI-DC 2016 ACTIVE:ACTIVE
  7. 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

  8. OPTIMIZATIONS OPTIMIZING THE HOT PATHS Debug logs were printed to

    identify all the work going into requests BACKGROUNDING CHECKOUTS Payment processing was pushed to background jobs INVENTORY OPTIMIZATIONS MYSQL lock contention too high with 1,000s of customers
  9. None
  10. LOAD-TESTING FEEDBACK LOOP Are we actually improving? FULL PRODUCTION INTEGRATION

    TESTING Execute full checkout flow, simulate real users.
  11. IDENTITYCACHE class Product < ActiveRecord::Base include IdentityCache cache_index :handle, :unique

    => true cache_index :vendor, :product_type end product = Product.fetch_by_handle(handle) products = Product.fetch_by_vendor_and_product_type(vendor, product_type)
  12. 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

  13. FAST INFLEXIBLE SLOW FLEXIBLE OPTIMIZATION FALLACY

  14. SHARDING 101 Sharding.with_shard(shop.shard_id) do Product.find(shop_id: shop.id, id: product_id) end

  15. class ProductController < ApplicationController around_filter :with_shop def show @product =

    @shop.products.find(params[:id]) end private def with_shop(&block) @shop = Shop.find_by_host(request.host) Sharding.with_shard(@shop.shard_id, &block) end end
  16. SHARDING DON’T SHARD (WHERE ARE YOU ON THE OPTIMIZATION SPECTRUM?)

    Sharding is hard, it took us a year! ARCHITECTURE DRAWBACKS Common-cases easy, edge-cases can now violate fundamentals. For example, cross-database transactions are now impossible. APPLICATION-LEVEL SHARDING Why did we choose it over a proxy or changing datastores?
  17. 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

  18. SURFACE AREA

  19. 19 Availability 70 80 90 100 Components 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  20. None
  21. 21 single component failure should not be able to compromise

    the performance or availability of the entire system
  22. 22 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
  23. 23 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate network problems with Toxiproxy
  24. Write a Toxiproxy test for each cell 24 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  25. 25 SHOPIFY/SEMI AN

  26. Resiliency Maturity Pyramid 26 No resiliency effort Testing with mocks

    Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
  27. 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

  28. bin/failover

  29. ACTIVEFAILOVER: 10-60S FAILOVERS 3. FAILOVER DATABASE Move the writer for

    all shards to the new primary datacenter 1. FAILOVER TRAFFIC Set flag on load balancers to redirect traffic to new datacenter 2. READ-ONLY SHOPIFY Traffic going to new datacenter, but is read-only (no checkouts, changes) 4. TRANSFER JOBS Queued and delayed jobs are transferred to the new primary DC
  30. 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

  31. previous shared architecture shard0 workers lb1 lb2 lb3 redis memcached

    shard1 shard2 shard3 shard4
  32. pod worker memcache worker redis database shard1 pod worker memcache

    worker redis database shard2
  33. datacenter 1 pod 1 pod 3 pod 5 pod 2

    pod 4 pod 6 pod 1 pod 3 pod 5 pod 2 pod 4 pod 6 datacenter 2
  34. pod 1 pod 3 pod 2 pod 4 pod 1

    pod 3 pod 2 pod 4 GET /products/beautiful-shoe HTTP/1.1 Host: myshop.com sorting hat
  35. rule 1: any request must be annotated with a pod

    or shop rule 2: any request can only touch one pod
  36. count = 0 with_each_shard do count += Shop.count end render

    “shops: #{count}”
  37. shitlist driven development

  38. 38 if Shitlist.include?(klass) super else error = <<-EOE New usage

    of this API is deprecated. Please come talk to the Pods team in #pods and we'll help you out! EOE raise ShitList::Error, error end
  39. 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

  40. THANK YOU SIMON ESKILDSEN @SIRUPSEN FEEL FREE TO TWEET QUESTIONS

    AT ME!