Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Trail to Scale Without Fail: Rails?

The Trail to Scale Without Fail: Rails?

Let's be blunt: Most web apps aren’t so computation-heavy and won't hit scaling issues.

What if yours is the exception? Can Rails handle it?

Cue Exhibit A: Cloudinary, which serves billions of image and video requests daily, including on-the-fly edits, QUICKLY, running on Rails since Day 1. Case closed?

Not so fast. Beyond the app itself, we needed creative solutions to ensure that, as traffic rises and falls at the speed of the internet, we handle the load gracefully, and no customer overwhelms the system.

The real question isn't whether Rails is up to the challenge, but rather: Are you?

7b5a451ee25044b9c869e3e98b79425d?s=128

Ariel Caplan

April 12, 2021
Tweet

More Decks by Ariel Caplan

Other Decks in Technology

Transcript

  1. Ariel Caplan • @amcaplan • THE TRAIL TO SCALE WITHOUT

    FAIL: RAILS? RAILSCONF 2021
  2. WHAT IS THE BIGGEST RAILS APP IN THE WORLD?

  3. The Recipe for the World's Largest Rails Monolith, Akira Matsuda,

    Ruby on Ales 2015
  4. The Recipe for the World's Largest Rails Monolith, Akira Matsuda,

    Ruby on Ales 2015
  5. The Recipe for the World's Largest Rails Monolith, Akira Matsuda,

    Ruby on Ales 2015
  6. 5 Years of Scaling Rails to 80K RPS, Simon Hørup

    Eskildsen, RailsConf 2016
  7. GitHub's journey towards microservices and more: 'We actually have our

    own version of Ruby that we maintain’ The Register, 1 Dec 2020
  8. NORMALIZED TO REQ/DAY 1.3 billion (2015) 2.6 billion (2016, likely

    2x by now) 1 billion (2020, just API) 15 billion (2021) *Media requests
  9. A TYPICAL RAILS APP

  10. CLOUDINARY

  11. CLOUDINARY

  12. WHAT DOES CLOUDINARY ACTUALLY DO?

  13. -Upload to Cloudinary -Stored in S3, GCS -Apply transformations -Deliver

    via CDN 1 IMAGE’S JOURNEY
  14. -Upload to Cloudinary -Stored in S3, GCS -Apply transformations -Deliver

    via CDN 1 IMAGE’S JOURNEY
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. https://res.cloudinary.com/caplan/image/upload/e_improve:indoor/ w_1014,h_591,c_thumb,g_face,q_auto/e_pixelate_faces:15/ l_cloudinary_logo_blue_0720_2x,g_north_east,x_20,y_20,w_260,o_50/ariel_thinking.jpg

  24. None
  25. None
  26. -Upload to Cloudinary -Stored in S3, GCS -Apply transformations -Deliver

    via CDN 1 IMAGE’S JOURNEY
  27. -Media Library (GUI for media cataloguing and editing) -Developer APIs

    -Bulk operations -All this ⬆ is 0.3% of total requests handled -99.7% is delivery: transforming/serving media -P.S. We also do video! SOME OTHER STUFF…
  28. ABOUT YOUR PRESENTER -Hi, I’m Ariel Caplan! -Working in Rails

    since 2015 -Cloudinary since 2018 -@amcaplan -I ♥ conferences!
  29. HOW CLOUDINARY SCALES

  30. HOW CLOUDINARY SCALES -Layers -Sharding -Location -Deduplicating Work -Not Scaling

    -Human Factors
  31. LAYERS

  32. “LAYERS. ONIONS HAVE LAYERS. OGRES HAVE LAYERS. ONIONS HAVE LAYERS.

    YOU GET IT? WE BOTH HAVE LAYERS.”  SHREK
  33. REQUEST LIFECYCLE CDN S3get IO CPU

  34. REQUEST LIFECYCLE CDN S3get IO CPU 15B/day 1B/day 150M/day 125M/day

  35. REQUEST LIFECYCLE CDN S3get IO CPU 15B/day 1B/day 150M/day 125M/day

  36. -We clearly need, the question is build vs. buy) -Pro:

    ~95% of traffic isn’t our problem! -Pro: Leverage best-in-class service/features -Pro: Multi-CDN for reliability -Con: Need to play by their rules -We write lots of custom rules (f_auto) -Every provider has their own invalidation system -Need to parse their logfiles to bill our customers CDN
  37. -High-throughput, simple service written in Go -Handles 85% of requests

    it receives -Takes up ~10% of the computing resources vs IO -Pro: It’s super fast -Con: Need to duplicate some Ruby logic in Go S3GET
  38. -Layers can scale independently -Horizontal slices also exist! (image vs.

    video) -Security (CPU can’t access internet or DB OTHER LAYER ADVANTAGES
  39. SHARDING

  40. “SHARDS OF GLASS CAN CUT AND WOUND OR MAGNIFY A

    VISION”  TERRY TEMPEST WILLIAMS
  41. -Vertical partitioning ≠ sharding -Each database contains different tables -Increases

    simultaneous writes/reads -Supported in Rails 6.0 WHAT IS SHARDING?
  42. -Horizontal partitioning = sharding -1 table, multiple databases -Keeps table

    size under control -Supported in Rails 6.1 WHAT IS SHARDING?
  43. -1 main shard for app-wide data -Every cloud “lives” 100%

    on 1 of several shards SHARDING @ CLOUDINARY 1 2 3 4
  44. -1 main shard for app-wide data -Every cloud “lives” 100%

    on 1 of several shards -Code includes thousands of shard references -Test environments must be sharded too! SHARDING @ CLOUDINARY cloud = find_cloud_from_request_params cloud.on_shard do # do the work end
  45. -Pro: Fast is good -Pro: Flexibility -Con: Error-prone cloud =

    find_cloud_from_request_params assets = [] cloud.on_shard do assets = cloud.assets.where(tag: "duck") end render json: assets SHARDING @ CLOUDINARY
  46. -Pro: Fast is good -Pro: Flexibility -Con: Error-prone cloud =

    find_cloud_from_request_params assets = [] cloud.on_shard do assets = cloud.assets.where(tag: "duck") end render json: assets SHARDING @ CLOUDINARY ActiveRecord::Relation Queried outside shard block!
  47. -Pro: Fast is good -Pro: Flexibility -Con: Error-prone cloud =

    find_cloud_from_request_params assets = [] cloud.on_shard do assets = cloud.assets.where(tag: "duck").to_a end render json: assets SHARDING @ CLOUDINARY Load eagerly
  48. -Pro: Fast is good -Pro: Flexibility -Con: Error-prone SHARDING @

    CLOUDINARY people = ActiveRecord::Base.connected_to(role: :reading, shard: :shard_one) do Person.all end people #=> preloaded ActiveRecord::Relation from shard
  49. -Pro: Fast is good -Pro: Flexibility -Con: Error-prone SHARDING @

    CLOUDINARY people = nil ActiveRecord::Base.connected_to(role: :reading, shard: :shard_one) do people = Person.all "some other code" end people #=> ActiveRecord::Relation is loaded from default shard
  50. LOCATION

  51. “LOCATION, LOCATION, LOCATION”  EVERY REAL ESTATE AGENT EVER

  52. -3 regions: US (default), EU, AP -Premium customers choose closest

    to their users -Dedicated shards per-region -What about the primary DB? -Option A Run 3 completely independent systems -Option B EU and AP will be a little slower -Option C Multi-Primary DB -Option D NrtCache (Near RealTime) REGIONS @ CLOUDINARY
  53. NRTCACHE US

  54. NRTCACHE US EU

  55. NRTCACHE US EU Syncer

  56. “WHAT COULD GO WRONG?”  FAMOUS LAST WORDS

  57. None
  58. -Following a routine deploy, error rate spiked -15 min “high”

    error rate -The cause: A problematic migration -Nothing we could do THE BIG FAILURE Cloud.find_each do |cloud| # update some attributes cloud.save! end
  59. -The solutions: -Review EVERY code change for cloud updates -Migrate

    to new && improved NrtCache -Following a routine deploy, error rate spiked -15 min “high” error rate -The cause: A problematic migration -Nothing we could do THE BIG FAILURE
  60. -Pro: Multi-region is mostly great for customers/users -Con: Hard to

    do right LOCATION
  61. DEDUPLICATING WORK

  62. “THERE ARE ALL KINDS OF LOVE IN THIS WORLD, BUT

    NEVER THE SAME LOVE TWICE.”  F. SCOTT FITZGERALD
  63. THE PROBLEM ubyShoes r-ubyShoes

  64. THE PROBLEM ubyShoes r-ubyShoes

  65. THE PROBLEM ubyShoes r-ubyShoes

  66. -Goals: 1. Don’t repeat transformations 2. Never block a job

    (2x > 0x) -Implementation: Best-effort locking system LOCKING
  67. LOPTR

  68. LOPTR CDN S3get IO CPU

  69. LOPTR CDN S3get IO CPU Loptr

  70. -Read lock on asset before working on it -No writing

    while the lock is held -Write lock on derivation before generating -This process can write -Exclusive -Depends on a well-behaved client LOPTR MAY I…
  71. -In-memory lock table for speed -Written in Scala for high

    concurrency LOPTR IMPLEMENTATION
  72. -Failure to release -Timeout -Downtime -Pretend every lock request succeeded

    -Scaling Loptr -Cluster with request targeted by consistent hash LOPTR CONCERNS
  73. -Pro: Resiliency to traffic surges -Con: Unreleased locks can cause

    timeouts -Note: Not 100% reliable (but still net positive) LOPTR
  74. NOT SCALING

  75. “WHAT YOU DON’T DO DETERMINES WHAT YOU CAN DO.” 

    TIM FERRISS
  76. -Limit individual customer impact -Would you rather deal with: -1

    dissatisfied customer -Thousands of dissatisfied customers -Rate limits -Manage scarcity while eliminating scarcity -Fair queueing -Background jobs HOW NOT TO SCALE
  77. -Heavy API calls have a strict limit -Locking effectively throttles

    non-rate-limited calls RATE LIMITS
  78. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  79. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  80. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  81. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  82. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  83. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  84. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  85. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  86. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  87. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  88. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  89. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  90. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  91. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  92. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  93. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  94. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  95. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  96. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  97. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  98. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  99. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  100. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU
  101. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  102. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  103. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  104. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  105. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  106. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  107. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  108. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  109. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  110. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  111. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  112. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  113. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  114. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  115. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  116. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  117. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  118. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  119. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  120. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  121. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  122. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  123. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  124. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  125. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1
  126. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  127. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  128. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  129. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  130. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  131. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  132. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  133. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds FAIR QUEUE ON CPU 1 3 2 4 2 3 1 2 3 1 4
  134. FAIR QUEUE ON CPU CDN S3get IO CPU Loptr

  135. FAIR QUEUE ON CPU CDN S3get IO CPU Loptr Fair

    Queue
  136. -Jobs are assigned a number of slots -“Queue of queues”

    mechanism allots slots to clouds -Prefer sync to async requests FAIR QUEUE ON CPU LAYER
  137. -“Queue of queues” throttles per-cloud concurrency -One big monkeypatch -ActiveRecord::ConnectionAdapters::ConnectionPool

    FAIR QUEUE FOR DB ACCESS
  138. -Anything that can be done out-of-band, should be -Examples: -CDN

    invalidations -Webhooks -Eager transformations BACKGROUND JOBS
  139. HUMAN FACTORS

  140. “THE GOOD NEWS ABOUT COMPUTERS IS THAT THEY DO WHAT

    YOU TELL THEM TO DO. THE BAD NEWS IS THAT THEY DO WHAT YOU TELL THEM TO DO.”  TED NELSON
  141. -Education -Encourage practices like eager generation -Relationships -Understand customer use

    cases -They inform us about changes in use patterns -Look for win-wins! SCALING VIA HUMANS
  142. CLOUDINARY ON RAILS

  143. -Most of our traffic never touches Rails -The fastest parts

    of the system aren’t in Ruby -The computation-heavy parts are low-level utils or APIs -The database scaling is language-independent IT WAS NEVER ABOUT RAILS
  144. -2 developers had 4 incredibly productive years -Ruby is actually

    " for creating interfaces -Just needed to move a few things out of Ruby RAILS IS GREAT!
  145. -Upgrading Rails is hard -Upgrading monkeypatched Rails is harder -Specific

    to Israel: -Difficult to recruit Ruby devs -Difficult to recruit devs who want to learn Ruby RAILS IS CHALLENGING
  146. -Here to stay, but… -Polyglot microservices -Build the app our

    next employee wants to work on THE FUTURE OF RAILS @ CLOUDINARY
  147. THANKS! Ariel Caplan • @amcaplan • amcaplan.ninja Special thanks to

    advisors/reviewers: Itai Benari Max Rozenoer Vladimir Shteinman