Slide 1

Slide 1 text

Ariel Caplan • @amcaplan • THE TRAIL TO SCALE WITHOUT FAIL: RAILS? RAILSCONF 2021

Slide 2

Slide 2 text

WHAT IS THE BIGGEST RAILS APP IN THE WORLD?

Slide 3

Slide 3 text

The Recipe for the World's Largest Rails Monolith, Akira Matsuda, Ruby on Ales 2015

Slide 4

Slide 4 text

The Recipe for the World's Largest Rails Monolith, Akira Matsuda, Ruby on Ales 2015

Slide 5

Slide 5 text

The Recipe for the World's Largest Rails Monolith, Akira Matsuda, Ruby on Ales 2015

Slide 6

Slide 6 text

5 Years of Scaling Rails to 80K RPS, Simon Hørup Eskildsen, RailsConf 2016

Slide 7

Slide 7 text

GitHub's journey towards microservices and more: 'We actually have our own version of Ruby that we maintain’ The Register, 1 Dec 2020

Slide 8

Slide 8 text

NORMALIZED TO REQ/DAY 1.3 billion (2015) 2.6 billion (2016, likely 2x by now) 1 billion (2020, just API) 15 billion (2021) *Media requests

Slide 9

Slide 9 text

A TYPICAL RAILS APP

Slide 10

Slide 10 text

CLOUDINARY

Slide 11

Slide 11 text

CLOUDINARY

Slide 12

Slide 12 text

WHAT DOES CLOUDINARY ACTUALLY DO?

Slide 13

Slide 13 text

-Upload to Cloudinary -Stored in S3, GCS -Apply transformations -Deliver via CDN 1 IMAGE’S JOURNEY

Slide 14

Slide 14 text

-Upload to Cloudinary -Stored in S3, GCS -Apply transformations -Deliver via CDN 1 IMAGE’S JOURNEY

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

https://res.cloudinary.com/caplan/image/upload/e_improve:indoor/ w_1014,h_591,c_thumb,g_face,q_auto/e_pixelate_faces:15/ l_cloudinary_logo_blue_0720_2x,g_north_east,x_20,y_20,w_260,o_50/ariel_thinking.jpg

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

-Upload to Cloudinary -Stored in S3, GCS -Apply transformations -Deliver via CDN 1 IMAGE’S JOURNEY

Slide 27

Slide 27 text

-Media Library (GUI for media cataloguing and editing) -Developer APIs -Bulk operations -All this ⬆ is 0.3% of total requests handled -99.7% is delivery: transforming/serving media -P.S. We also do video! SOME OTHER STUFF…

Slide 28

Slide 28 text

ABOUT YOUR PRESENTER -Hi, I’m Ariel Caplan! -Working in Rails since 2015 -Cloudinary since 2018 -@amcaplan -I ♥ conferences!

Slide 29

Slide 29 text

HOW CLOUDINARY SCALES

Slide 30

Slide 30 text

HOW CLOUDINARY SCALES -Layers -Sharding -Location -Deduplicating Work -Not Scaling -Human Factors

Slide 31

Slide 31 text

LAYERS

Slide 32

Slide 32 text

“LAYERS. ONIONS HAVE LAYERS. OGRES HAVE LAYERS. ONIONS HAVE LAYERS. YOU GET IT? WE BOTH HAVE LAYERS.”  SHREK

Slide 33

Slide 33 text

REQUEST LIFECYCLE CDN S3get IO CPU

Slide 34

Slide 34 text

REQUEST LIFECYCLE CDN S3get IO CPU 15B/day 1B/day 150M/day 125M/day

Slide 35

Slide 35 text

REQUEST LIFECYCLE CDN S3get IO CPU 15B/day 1B/day 150M/day 125M/day

Slide 36

Slide 36 text

-We clearly need, the question is build vs. buy) -Pro: ~95% of traffic isn’t our problem! -Pro: Leverage best-in-class service/features -Pro: Multi-CDN for reliability -Con: Need to play by their rules -We write lots of custom rules (f_auto) -Every provider has their own invalidation system -Need to parse their logfiles to bill our customers CDN

Slide 37

Slide 37 text

-High-throughput, simple service written in Go -Handles 85% of requests it receives -Takes up ~10% of the computing resources vs IO -Pro: It’s super fast -Con: Need to duplicate some Ruby logic in Go S3GET

Slide 38

Slide 38 text

-Layers can scale independently -Horizontal slices also exist! (image vs. video) -Security (CPU can’t access internet or DB OTHER LAYER ADVANTAGES

Slide 39

Slide 39 text

SHARDING

Slide 40

Slide 40 text

“SHARDS OF GLASS CAN CUT AND WOUND OR MAGNIFY A VISION”  TERRY TEMPEST WILLIAMS

Slide 41

Slide 41 text

-Vertical partitioning ≠ sharding -Each database contains different tables -Increases simultaneous writes/reads -Supported in Rails 6.0 WHAT IS SHARDING?

Slide 42

Slide 42 text

-Horizontal partitioning = sharding -1 table, multiple databases -Keeps table size under control -Supported in Rails 6.1 WHAT IS SHARDING?

Slide 43

Slide 43 text

-1 main shard for app-wide data -Every cloud “lives” 100% on 1 of several shards SHARDING @ CLOUDINARY 1 2 3 4

Slide 44

Slide 44 text

-1 main shard for app-wide data -Every cloud “lives” 100% on 1 of several shards -Code includes thousands of shard references -Test environments must be sharded too! SHARDING @ CLOUDINARY cloud = find_cloud_from_request_params cloud.on_shard do # do the work end

Slide 45

Slide 45 text

-Pro: Fast is good -Pro: Flexibility -Con: Error-prone cloud = find_cloud_from_request_params assets = [] cloud.on_shard do assets = cloud.assets.where(tag: "duck") end render json: assets SHARDING @ CLOUDINARY

Slide 46

Slide 46 text

-Pro: Fast is good -Pro: Flexibility -Con: Error-prone cloud = find_cloud_from_request_params assets = [] cloud.on_shard do assets = cloud.assets.where(tag: "duck") end render json: assets SHARDING @ CLOUDINARY ActiveRecord::Relation Queried outside shard block!

Slide 47

Slide 47 text

-Pro: Fast is good -Pro: Flexibility -Con: Error-prone cloud = find_cloud_from_request_params assets = [] cloud.on_shard do assets = cloud.assets.where(tag: "duck").to_a end render json: assets SHARDING @ CLOUDINARY Load eagerly

Slide 48

Slide 48 text

-Pro: Fast is good -Pro: Flexibility -Con: Error-prone SHARDING @ CLOUDINARY people = ActiveRecord::Base.connected_to(role: :reading, shard: :shard_one) do Person.all end people #=> preloaded ActiveRecord::Relation from shard

Slide 49

Slide 49 text

-Pro: Fast is good -Pro: Flexibility -Con: Error-prone SHARDING @ CLOUDINARY people = nil ActiveRecord::Base.connected_to(role: :reading, shard: :shard_one) do people = Person.all "some other code" end people #=> ActiveRecord::Relation is loaded from default shard

Slide 50

Slide 50 text

LOCATION

Slide 51

Slide 51 text

“LOCATION, LOCATION, LOCATION”  EVERY REAL ESTATE AGENT EVER

Slide 52

Slide 52 text

-3 regions: US (default), EU, AP -Premium customers choose closest to their users -Dedicated shards per-region -What about the primary DB? -Option A Run 3 completely independent systems -Option B EU and AP will be a little slower -Option C Multi-Primary DB -Option D NrtCache (Near RealTime) REGIONS @ CLOUDINARY

Slide 53

Slide 53 text

NRTCACHE US

Slide 54

Slide 54 text

NRTCACHE US EU

Slide 55

Slide 55 text

NRTCACHE US EU Syncer

Slide 56

Slide 56 text

“WHAT COULD GO WRONG?”  FAMOUS LAST WORDS

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

-Following a routine deploy, error rate spiked -15 min “high” error rate -The cause: A problematic migration -Nothing we could do THE BIG FAILURE Cloud.find_each do |cloud| # update some attributes cloud.save! end

Slide 59

Slide 59 text

-The solutions: -Review EVERY code change for cloud updates -Migrate to new && improved NrtCache -Following a routine deploy, error rate spiked -15 min “high” error rate -The cause: A problematic migration -Nothing we could do THE BIG FAILURE

Slide 60

Slide 60 text

-Pro: Multi-region is mostly great for customers/users -Con: Hard to do right LOCATION

Slide 61

Slide 61 text

DEDUPLICATING WORK

Slide 62

Slide 62 text

“THERE ARE ALL KINDS OF LOVE IN THIS WORLD, BUT NEVER THE SAME LOVE TWICE.”  F. SCOTT FITZGERALD

Slide 63

Slide 63 text

THE PROBLEM ubyShoes r-ubyShoes

Slide 64

Slide 64 text

THE PROBLEM ubyShoes r-ubyShoes

Slide 65

Slide 65 text

THE PROBLEM ubyShoes r-ubyShoes

Slide 66

Slide 66 text

-Goals: 1. Don’t repeat transformations 2. Never block a job (2x > 0x) -Implementation: Best-effort locking system LOCKING

Slide 67

Slide 67 text

LOPTR

Slide 68

Slide 68 text

LOPTR CDN S3get IO CPU

Slide 69

Slide 69 text

LOPTR CDN S3get IO CPU Loptr

Slide 70

Slide 70 text

-Read lock on asset before working on it -No writing while the lock is held -Write lock on derivation before generating -This process can write -Exclusive -Depends on a well-behaved client LOPTR MAY I…

Slide 71

Slide 71 text

-In-memory lock table for speed -Written in Scala for high concurrency LOPTR IMPLEMENTATION

Slide 72

Slide 72 text

-Failure to release -Timeout -Downtime -Pretend every lock request succeeded -Scaling Loptr -Cluster with request targeted by consistent hash LOPTR CONCERNS

Slide 73

Slide 73 text

-Pro: Resiliency to traffic surges -Con: Unreleased locks can cause timeouts -Note: Not 100% reliable (but still net positive) LOPTR

Slide 74

Slide 74 text

NOT SCALING

Slide 75

Slide 75 text

“WHAT YOU DON’T DO DETERMINES WHAT YOU CAN DO.”  TIM FERRISS

Slide 76

Slide 76 text

-Limit individual customer impact -Would you rather deal with: -1 dissatisfied customer -Thousands of dissatisfied customers -Rate limits -Manage scarcity while eliminating scarcity -Fair queueing -Background jobs HOW NOT TO SCALE

Slide 77

Slide 77 text

-Heavy API calls have a strict limit -Locking effectively throttles non-rate-limited calls RATE LIMITS

Slide 78

Slide 78 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 79

Slide 79 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 80

Slide 80 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 81

Slide 81 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 82

Slide 82 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 83

Slide 83 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 84

Slide 84 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 85

Slide 85 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 86

Slide 86 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 87

Slide 87 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 88

Slide 88 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 89

Slide 89 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 90

Slide 90 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 91

Slide 91 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 92

Slide 92 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 93

Slide 93 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 94

Slide 94 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 95

Slide 95 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 96

Slide 96 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 97

Slide 97 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 98

Slide 98 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 99

Slide 99 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 100

Slide 100 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU

Slide 101

Slide 101 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 102

Slide 102 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 103

Slide 103 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 104

Slide 104 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 105

Slide 105 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 106

Slide 106 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 107

Slide 107 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 108

Slide 108 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 109

Slide 109 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 110

Slide 110 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 111

Slide 111 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 112

Slide 112 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 113

Slide 113 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 114

Slide 114 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 115

Slide 115 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 116

Slide 116 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 117

Slide 117 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 118

Slide 118 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 119

Slide 119 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 120

Slide 120 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 121

Slide 121 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 122

Slide 122 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 123

Slide 123 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 124

Slide 124 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 125

Slide 125 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1

Slide 126

Slide 126 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 127

Slide 127 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 128

Slide 128 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 129

Slide 129 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 130

Slide 130 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 131

Slide 131 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 132

Slide 132 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 133

Slide 133 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4

Slide 134

Slide 134 text

FAIR QUEUE ON CPU CDN S3get IO CPU Loptr

Slide 135

Slide 135 text

FAIR QUEUE ON CPU CDN S3get IO CPU Loptr Fair Queue

Slide 136

Slide 136 text

-Jobs are assigned a number of slots -“Queue of queues” mechanism allots slots to clouds -Prefer sync to async requests FAIR QUEUE ON CPU LAYER

Slide 137

Slide 137 text

-“Queue of queues” throttles per-cloud concurrency -One big monkeypatch -ActiveRecord::ConnectionAdapters::ConnectionPool FAIR QUEUE FOR DB ACCESS

Slide 138

Slide 138 text

-Anything that can be done out-of-band, should be -Examples: -CDN invalidations -Webhooks -Eager transformations BACKGROUND JOBS

Slide 139

Slide 139 text

HUMAN FACTORS

Slide 140

Slide 140 text

“THE GOOD NEWS ABOUT COMPUTERS IS THAT THEY DO WHAT YOU TELL THEM TO DO. THE BAD NEWS IS THAT THEY DO WHAT YOU TELL THEM TO DO.”  TED NELSON

Slide 141

Slide 141 text

-Education -Encourage practices like eager generation -Relationships -Understand customer use cases -They inform us about changes in use patterns -Look for win-wins! SCALING VIA HUMANS

Slide 142

Slide 142 text

CLOUDINARY ON RAILS

Slide 143

Slide 143 text

-Most of our traffic never touches Rails -The fastest parts of the system aren’t in Ruby -The computation-heavy parts are low-level utils or APIs -The database scaling is language-independent IT WAS NEVER ABOUT RAILS

Slide 144

Slide 144 text

-2 developers had 4 incredibly productive years -Ruby is actually " for creating interfaces -Just needed to move a few things out of Ruby RAILS IS GREAT!

Slide 145

Slide 145 text

-Upgrading Rails is hard -Upgrading monkeypatched Rails is harder -Specific to Israel: -Difficult to recruit Ruby devs -Difficult to recruit devs who want to learn Ruby RAILS IS CHALLENGING

Slide 146

Slide 146 text

-Here to stay, but… -Polyglot microservices -Build the app our next employee wants to work on THE FUTURE OF RAILS @ CLOUDINARY

Slide 147

Slide 147 text

THANKS! Ariel Caplan • @amcaplan • amcaplan.ninja Special thanks to advisors/reviewers: Itai Benari Max Rozenoer Vladimir Shteinman