Slide 1

Slide 1 text

5 YEARS OF SCALING RAILS SIMON ESKILDSEN @SIRUPSEN

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

377,500+ SHOPS $29 BILLION+ 1900+ EMPLOYEES 2 DATACENTRES RUBY ON RAILS SINCE 2006 80K PEAK RPS 40+ DAILY DEPLOYS 20K-40K+ STEADY RPS

Slide 4

Slide 4 text

4 STOREFRONT CHECKOUT ADMIN API HEAVY READS CACHEABLE AVAILABILITY 80% TRAFFIC HEAVY WRITES EXTERNALS CONSISTENCY COMPLEX R/W CONSISTENCY COMPLEX R/W CONSISTENCY FAST COMPUTERS

Slide 5

Slide 5 text

FLASH SALES: SCHOOL OF HARD KNOCKS + =

Slide 6

Slide 6 text

ORIGIN OF THE PLATFORM 2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

Slide 7

Slide 7 text

2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

Slide 8

Slide 8 text

OPTIMIZATIONS OPTIMIZING THE HOT PATHS Debug logs were printed to identify all the work going into requests BACKGROUNDING CHECKOUTS Payment processing was pushed to background jobs INVENTORY OPTIMIZATIONS MYSQL lock contention too high with 1,000s of customers

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

LOAD-TESTING FEEDBACK LOOP Are we actually improving? FULL PRODUCTION INTEGRATION TESTING Execute full checkout flow, simulate real users.

Slide 11

Slide 11 text

IDENTITYCACHE class Product < ActiveRecord::Base include IdentityCache cache_index :handle, :unique => true cache_index :vendor, :product_type end product = Product.fetch_by_handle(handle) products = Product.fetch_by_vendor_and_product_type(vendor, product_type)

Slide 12

Slide 12 text

2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

Slide 13

Slide 13 text

FAST INFLEXIBLE SLOW FLEXIBLE OPTIMIZATION FALLACY

Slide 14

Slide 14 text

SHARDING 101 Sharding.with_shard(shop.shard_id) do Product.find(shop_id: shop.id, id: product_id) end

Slide 15

Slide 15 text

class ProductController < ApplicationController around_filter :with_shop def show @product = @shop.products.find(params[:id]) end private def with_shop(&block) @shop = Shop.find_by_host(request.host) Sharding.with_shard(@shop.shard_id, &block) end end

Slide 16

Slide 16 text

SHARDING DON’T SHARD (WHERE ARE YOU ON THE OPTIMIZATION SPECTRUM?) Sharding is hard, it took us a year! ARCHITECTURE DRAWBACKS Common-cases easy, edge-cases can now violate fundamentals. For example, cross-database transactions are now impossible. APPLICATION-LEVEL SHARDING Why did we choose it over a proxy or changing datastores?

Slide 17

Slide 17 text

2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

Slide 18

Slide 18 text

SURFACE AREA

Slide 19

Slide 19 text

19 Availability 70 80 90 100 Components 10 50 100 500 1000 99.98 99.99 99.999 99.95

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

21 single component failure should not be able to compromise the performance or availability of the entire system

Slide 22

Slide 22 text

22 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix

Slide 23

Slide 23 text

23 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate network problems with Toxiproxy

Slide 24

Slide 24 text

Write a Toxiproxy test for each cell 24 # test/integration/resiliency_matrix_test.rb def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell

Slide 25

Slide 25 text

25 SHOPIFY/SEMI AN

Slide 26

Slide 26 text

Resiliency Maturity Pyramid 26 No resiliency effort Testing with mocks Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC

Slide 27

Slide 27 text

2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

Slide 28

Slide 28 text

bin/failover

Slide 29

Slide 29 text

ACTIVEFAILOVER: 10-60S FAILOVERS 3. FAILOVER DATABASE Move the writer for all shards to the new primary datacenter 1. FAILOVER TRAFFIC Set flag on load balancers to redirect traffic to new datacenter 2. READ-ONLY SHOPIFY Traffic going to new datacenter, but is read-only (no checkouts, changes) 4. TRANSFER JOBS Queued and delayed jobs are transferred to the new primary DC

Slide 30

Slide 30 text

2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

Slide 31

Slide 31 text

previous shared architecture shard0 workers lb1 lb2 lb3 redis memcached shard1 shard2 shard3 shard4

Slide 32

Slide 32 text

pod worker memcache worker redis database shard1 pod worker memcache worker redis database shard2

Slide 33

Slide 33 text

datacenter 1 pod 1 pod 3 pod 5 pod 2 pod 4 pod 6 pod 1 pod 3 pod 5 pod 2 pod 4 pod 6 datacenter 2

Slide 34

Slide 34 text

pod 1 pod 3 pod 2 pod 4 pod 1 pod 3 pod 2 pod 4 GET /products/beautiful-shoe HTTP/1.1 Host: myshop.com sorting hat

Slide 35

Slide 35 text

rule 1: any request must be annotated with a pod or shop rule 2: any request can only touch one pod

Slide 36

Slide 36 text

count = 0 with_each_shard do count += Shop.count end render “shops: #{count}”

Slide 37

Slide 37 text

shitlist driven development

Slide 38

Slide 38 text

38 if Shitlist.include?(klass) super else error = <<-EOE New usage of this API is deprecated. Please come talk to the Pods team in #pods and we'll help you out! EOE raise ShitList::Error, error end

Slide 39

Slide 39 text

2012 OPTIMIZATION 2013 SHARDING 2014 RESILIENCY 2015 MULTI-DC 2016 ACTIVE:ACTIVE

Slide 40

Slide 40 text

THANK YOU SIMON ESKILDSEN @SIRUPSEN FEEL FREE TO TWEET QUESTIONS AT ME!