Slide 1

Slide 1 text

Scaling Ruby @ @jhawthorn

Slide 2

Slide 2 text

John Hawthorn GitHub: @jhawthorn Twi tt er: @jhawthorn Bluesky: @jhawthorn.com Mastodon: @[email protected] δϣϯ ϗʔιʔϯ Sta ff Software Engineer @ GitHub Ruby commi tt er Rails core team

Slide 3

Slide 3 text

• Joined GitHub in 2018 • "Ruby Architecture" Team • Developing & upgrading Ruby • Developing & upgrading Rails • Integrating Ruby/Rails into GitHub • Some database stu f

Slide 4

Slide 4 text

Scaling Ruby @ @jhawthorn

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Scaling users Scaling teams Scaling

Slide 7

Slide 7 text

Scaling users Scaling teams Customers Tra ff ic Servers Developers Lines of code Features

Slide 8

Slide 8 text

GitHub's beginning + +

Slide 9

Slide 9 text

Ruby upgrades Ruby 1.8 Ruby 2.3 2.0 (fork) 2.1 (fork)

Slide 10

Slide 10 text

Rails upgrades Rails 1.2 Rails 2.3 2.3 (fork) 3.0 (fork) Rails 4.2 Rails 5.2 4.2

Slide 11

Slide 11 text

Today GitHub is running Rails 8.1.0.alpha in production!

Slide 12

Slide 12 text

We upgrade Rails most weeks

Slide 13

Slide 13 text

Scheduled GitHub action update Rails from main branch Open pull request

Slide 14

Slide 14 text

Today GitHub is running Ruby 3.4.1

Slide 15

Slide 15 text

GitHub runs with YJIT ~15-20% performance boost Since Ruby 3.3

Slide 16

Slide 16 text

We upgrade Ruby most weeks (in CI only, production runs stable Ruby)

Slide 17

Slide 17 text

Smaller changes Upgrading weekly Stay closer to upstream Encourages contributing upstream

Slide 18

Slide 18 text

It's easier to see the language and framework as part of our application

Slide 19

Slide 19 text

A mostly normal Ruby/Rails application "The Monolith"

Slide 20

Slide 20 text

"The Monolith" ~3.8 million lines of code ~2 million lines of tests >1000 developers 2500 pull requests / month

Slide 21

Slide 21 text

Problems: I can't read 3.8 million lines of code I can't review 2500 pull requests / month

Slide 22

Slide 22 text

CODEOWNERS h tt ps://github.blog/engineering/architecture-optimization/how-we-organize-and-get-things-done-with-serviceowners/ # CODEOWNERS app/models/issues*.rb @github/issues-team app/models/user.rb @github/users-team app/models/profile.rb @github/users-team

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Error A tt ribution ActiveRecord::StatementInvalid active_record/lib/active_record/query.rb:123 in `query` app/models/user.rb:124:in 'Array#each' app/models/user.rb:123:in 'find_user' app/models/issue.rb:123:in 'some_user' app/controllers/issues_controller.rb:123:in 'show' lib/middleware/cache_middleware.rb:123:in 'call'

Slide 25

Slide 25 text

ActiveRecord::StatementInvalid active_record/lib/active_record/query.rb:123 in `query` app/models/user.rb:124:in 'Array#each' app/models/user.rb:123:in 'find_user' app/models/issue.rb:123:in 'some_user' app/controllers/issues_controller.rb:123:in 'show' lib/middleware/cache_middleware.rb:123:in 'call' Error A tt ribution

Slide 26

Slide 26 text

ActiveRecord::StatementInvalid active_record/lib/active_record/query.rb:123 in `query` app/models/user.rb:124:in 'Array#each' app/models/user.rb:123:in 'find_user' app/models/issue.rb:123:in 'some_user' app/controllers/issues_controller.rb:123:in 'show' lib/middleware/cache_middleware.rb:123:in 'call' user.rb "users" service @github/users-team Error A tt ribution

Slide 27

Slide 27 text

ActiveRecord::StatementInvalid active_record/lib/active_record/query.rb:123 in `query` app/models/user.rb:124:in 'Array#each' app/models/user.rb:123:in 'find_user' app/models/issue.rb:123:in 'some_user' app/controllers/issues_controller.rb:123:in 'show' lib/middleware/cache_middleware.rb:123:in 'call' issues_controller.rb "issues" service @github/issues-team Error A tt ribution

Slide 28

Slide 28 text

ActiveRecord::StatementInvalid active_record/lib/active_record/query.rb:123 in `query` app/models/user.rb:124:in 'Array#each' app/models/user.rb:123:in 'find_user' app/models/issue.rb:123:in 'some_user' app/controllers/issues_controller.rb:123:in 'show' lib/middleware/cache_middleware.rb:123:in 'call' service: users (@github/users-team) cause service: issues (@github/issues-team) Error A tt ribution

Slide 29

Slide 29 text

Not everything can be divided by file

Slide 30

Slide 30 text

Remove OpenStruct Goal:

Slide 31

Slide 31 text

# test/linters/openstruct_test.rb openstruct_count = grep(/OpenStruct/).count assert_equal 0, openstruct_count Remove OpenStruct # CODEOWNERS test/linters/openstruct_test.rb @github/ruby-architecture Our team has to approve any new OpenStruct 50

Slide 32

Slide 32 text

# test/linters/openstruct_test.rb openstruct_count = grep(/OpenStruct/).count if openstruct_count > 50 flunk "OpenStruct has performance issues." \ "Please use Struct or a plain Ruby class instead." elsif openstruct_count < 50 flunk "You removed an OpenStruct. Thank you!" \ "Please update the counter in #{__FILE__}" end "Ratchet" technique 🔧

Slide 33

Slide 33 text

😬 Hard to review Likely to have merge con fl icts commit: "Remove OpenStruct everywhere" +1243 -1243

Slide 34

Slide 34 text

commit: "Remove OpenStruct everywhere" +1243 -1243 script/create-prs-by-codeowner commit: "Remove OpenStruct from issues" +30 -30 commit: "Remove OpenStruct from users" +20 -20 ...

Slide 35

Slide 35 text

Deploying https://github.blog/engineering/engineering-principles/how-github-uses-merge-queue-to-ship-hundreds-of-changes-every-day/ ~2500 pull requests / month ~20 deploys a day

Slide 36

Slide 36 text

Branch deploys https://github.blog/engineering/engineering-principles/deploying-branches-to-github-com/ 1. git branch some_branch 2. Commit and push work on some_branch 3. Deploy some_branch to production 4. Merge some_branch to main "main is always production-ready"

Slide 37

Slide 37 text

Branch deploys jhawthorn .deploy my_branch to github hubot is already deploying

Slide 38

Slide 38 text

Deploy queue jhawthorn .queue me to deploy my_branch hubot There are two PRs ahead of you

Slide 39

Slide 39 text

Merge queue https://github.blog/engineering/engineering-principles/how-github-uses-merge-queue-to-ship-hundreds-of-changes-every-day/

Slide 40

Slide 40 text

If one PR has a bug The whole merge group fails The next group is delayed (waiting for the rollback, waiting for CI, etc)

Slide 41

Slide 41 text

As your application grows the cost of a bug grows Users impacted Developers impacted

Slide 42

Slide 42 text

Feature fl ags

Slide 43

Slide 43 text

Feature fl ags if Flipper.enabled?(:some_feature) 100% enabled true Enabled 50% of the time rand() < 0.5 Gates "dark shipping" 0% enabled false

Slide 44

Slide 44 text

Feature fl ags if Flipper.enabled?(:some_feature, current_user) actor actor could be any type User Organization @jhawthorn @ruby Repository @ruby/ruby

Slide 45

Slide 45 text

Feature fl ags if Flipper.enabled?(:some_feature, current_user) actor Enable for 10% of users crc(actor.id) % 10 == 0 Enable "sta ff " group actor.staff? Gates

Slide 46

Slide 46 text

ENV "Feature fl ags" if ENV["ENABLE_YJIT"] == "1" RubyVM::YJIT.enable end

Slide 47

Slide 47 text

ENV "Feature fl ags" if rand() < ENV["ENABLE_YJIT_PCT"].to_f / 100.0 RubyVM::YJIT.enable end

Slide 48

Slide 48 text

ENV "Feature fl ags" hostname = Socket.gethostname hostname_rand = Zlib.crc32(hostname) / 2.0 ** 32 if hostname_rand < ENV["ENABLE_YJIT_PCT"].to_f / 100.0 RubyVM::YJIT.enable end

Slide 49

Slide 49 text

Scientist h tt ps://github.com/github/scientist Compare two implementations in production

Slide 50

Slide 50 text

Scientist users_orig = original_code(args) return users_orig

Slide 51

Slide 51 text

Scientist users_orig = original_code(args) users_new = new_code(args) return users_orig

Slide 52

Slide 52 text

Scientist users_orig = original_code(args) users_new = new_code(args) unless users_new == users_orig report_error "users didn't match" end return users_orig

Slide 53

Slide 53 text

Scientist h tt ps://github.com/github/scientist science "users" do |e| e.use { original_code(args) } e.try { new_code(args) } end Runs every time. Result returned Controlled by feature fl ag Result compared to "use" block Reports metrics of both! Mismatches recorded

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Similar open source tool: rack-mini-profiler

Slide 57

Slide 57 text

Profilers Features SQL Vernier stackprof allocation_tracer

Slide 58

Slide 58 text

Trilogy τϦϩδʔ MySQL-compatible database client

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Trilogy • MySQL-compatible database client • Fast! • No dependency on libmysqlclient/libmariadb • Automatic casting • Embedding friendly https://github.blog/open-source/maintainers/introducing-trilogy-a-new-database-adapter-for-ruby-on-rails/

Slide 61

Slide 61 text

Database Scaling 🔥 x 1000 app DB https://github.blog/engineering/infrastructure/partitioning-githubs-relational-databases-scale/ app app

Slide 62

Slide 62 text

DB Scaling - Replication

Slide 63

Slide 63 text

DB Scaling - Replication Primary Replicas read-only

Slide 64

Slide 64 text

DB Scaling - Replication Primary Replicas SELECT * SELECT * SELECT * INSERT ... read-only

Slide 65

Slide 65 text

DB Scaling - Replication Primary Replicas read-only May have stale data delay: < 1s when healthy

Slide 66

Slide 66 text

DB Scaling - Replication class ApplicationRecord self.abstract_class = true connects_to database: { writing: :primary, reading: :primary_replica } end

Slide 67

Slide 67 text

DB Scaling - Replication Queries the replica SELECT * FROM users WHERE id = 123 ActiveRecord::Base.connected_to(role: :reading) do User.find(123) end Available since Rails 6.0.0! https://github.com/rails/rails/pull/34052 by @eileencodes

Slide 68

Slide 68 text

DB Scaling - Replication Primary Replicas anon logged in

Slide 69

Slide 69 text

DB Scaling - Replication Primary Replicas anon logged in

Slide 70

Slide 70 text

DB Scaling - Replication Primary Replicas anon logged in 50% 50% 50% 50%

Slide 71

Slide 71 text

DB Scaling - Replication Primary Replicas 2x scale 50% 50%

Slide 72

Slide 72 text

Can we do be tt er? Primary Replicas logged in

Slide 73

Slide 73 text

Primary Replicas logged in < 1s delay Write in last 1s? yes no "Read your own write"

Slide 74

Slide 74 text

❯ rails g active_record:multi_db create config/initializers/multi_db.rb # config/initializers/multi_db.rb Rails.application.configure do config.active_record.database_selector = { delay: 2.seconds } end

Slide 75

Slide 75 text

https://github.blog/engineering/infrastructure/mitigating-replication-lag-and-reducing-read-load-with-freno/ https://github.com/github/freno-client https://github.com/github/freno

Slide 76

Slide 76 text

Primary Replicas logged in Is data replicated? yes no "Read your own write" w/ freno 98% 2% 50x scale!

Slide 77

Slide 77 text

There's much more sharding proxies etc more queries to replicas

Slide 78

Slide 78 text

Database Retries

Slide 79

Slide 79 text

Replicas retry!

Slide 80

Slide 80 text

(mostly) Default in Rails 7.1+ # Hack: assume all queries are retryable def raw_execute(*args, **kwargs) super(*args, **kwargs.merge(allow_retry: true)) end

Slide 81

Slide 81 text

# database.yml production: adapter: trilogy connection_retries: 2 retry_deadline: 2 Stop retrying after 2 seconds 2 retries (default 1)

Slide 82

Slide 82 text

Caution! ⚠ Retries can hide unhealthy systems ⚠ Retries + scale + problem = self DoS

Slide 83

Slide 83 text

Circuit breakers errors more often but faster

Slide 84

Slide 84 text

circuit_breaker = CircuitBreaker.get("key") if circuit_breaker.allow_request? begin # do something expensive circuit_breaker.success rescue => e circuit_breaker.failure # do fallback end else # do fallback end After a certain number of failures Disallow requests

Slide 85

Slide 85 text

class ResilientTrilogy def query(...) if @circuit_breaker.allow_request? begin ret = @trilogy.query(...) @circuit_breaker.success ret rescue ConnectionError => e @circuit_breaker.failure raise end else raise CircuitOpenError end end end

Slide 86

Slide 86 text

class ResilientTrilogy def query(...) if @circuit_breaker.allow_request? begin ret = @trilogy.query(...) @circuit_breaker.success ret rescue ConnectionError => e @circuit_breaker.failure raise end else raise CircuitOpenError end end end No fallback, but allows systems to recover

Slide 87

Slide 87 text

A sample of Ruby at scale

Slide 88

Slide 88 text

Language, framework, and gems are a part of our application

Slide 89

Slide 89 text

Let's scale Ruby together

Slide 90

Slide 90 text

Thank you ͋Γ͕ͱ͏͍͟͝·͢