Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hacking and Profiling Ruby for Performance - RubyKaigi 2023

Hacking and Profiling Ruby for Performance - RubyKaigi 2023

A RubyKaigi 2023 talk

osyoyu

May 12, 2023
Tweet

More Decks by osyoyu

Other Decks in Programming

Transcript

  1. pp @osyoyu • Daisuke Aritomo (osyoyu, ͓͠ΐʔΏ, ͓͠ΐ͏Ώ) • "osyoyu"

    is pronounced as "oh show you" • #rubykaigiNOC (Venue Wi-Fi Team) • Software Engineer at Cookpad Inc.
  2. Get a free drink from Cookpad's fridge! Cookpad provides RUBY-POWERED

    FRIDGES to RubyKaigi 2023! Let's make everyday cooking fun together! https://cookpad.careers/
  3. Cookpad is doing a look- back RubyKaigi event! 5/18 @

    Tokyo • Look back RubyKaigi with 
 hands on! • Let's make the next RubyKaigi together Come to Cookpad booth for details https://cookpad.connpass.com/event/282436/
  4. About this Talk • How to pro fi le and

    tune a Ruby webapp • ... you know nothing about, • ... within 8 hours, • ... as part of a performance tuning competition, • for fun!
  5. : the performance challenge • Contestants are given a VM

    and a super-slow webapp • Contestants may request a 60-second benchmark during the contest • Scores and standings are decided based on benchmark results • The goal is to get the best score Benchmarker (scorer) Contestant VM (running a Ruby webapp) benchmark requests 
 sent at an intensive rate for 60 seconds
  6. The Initial State • 3 VMs (Computing Instances) • Ruby

    • Sinatra • Puma • Nginx • MySQL • Implementations in other languages • Python, Go, Node.js, ... • Initial performance is designed to be not so different
  7. Will talks/Won't talks • Why Ruby is a great language

    to compete in ISUCON • How to track down and pro fi le slow code on the CRuby level • What future Rubies need to shine in ISUCON Wills Won't • Con fi guring Nginx, Linux, etc... • Monitoring on the system level • Itamae recipes we made for ISUCON
  8. (Almost) Everything is permitted • Add effective RDBMS indexes •

    Kill N+1 SQL Queries • → user_ids.each {|id| query("select * from users where id = ?", id) } • → query("select * from users where id in (?)", user_ids) • Replace suboptimal algorithms • Utilize Server Resources (cpu, memory) to the last drop • Adding Puma threads/processes • Caching • Upgrading to ruby/ruby master • (Adding VMs and scaling VMs up are prohibited)
  9. (Almost) Everything is permitted • Add effective RDBMS indexes •

    Kill N+1 SQL Queries • → user_ids.each {|id| query("select * from users where id = ?", id) } • → query("select * from users where id in (?)", user_ids) • Replace suboptimal algorithms • Utilize Server Resources (cpu, memory) to the last drop • Adding Puma threads/processes • Caching • Upgrading to ruby/ruby master • (Adding VMs and scaling VMs up are prohibited) But where should we start from?
  10. Hack around with Ruby code Run a benchmark, get your

    score See pro fi ling results, 
 think what next
  11. Dancing with Ruby • You'll be given 500+ lines of

    Sinatra code • and need to make it really fast within 8 hours • This is where Ruby really shines
  12. The mighty Array, Hash, Enumerable • #map • #each_with_object •

    #sort_by! • Almost anything is possible • Comes in really handy when 
 tacking N+1 queries 
 (without ActiveRecord)
  13. Monkey patching • Need something in the Standard Library? Build

    it on site! binding.irb/pry in production • Debugging work fl ow: • Stop real benchmark requests • Write code in binding.irb/pry using real requests • Con fi rm it works • Copy to editor and save (Note: I don't do this at work - just a competition technique 😁)
  14. • This... happens • It's fi xable, no worries •

    Let's hope RBS, TypeProf and other projects solve this problem
  15. Pro fi ling the accurate bottlenecks Utilizing 100% cpu in

    Ruby Achieving high concurrency in Ruby Performance Challenges in Ruby
  16. Don't guess, measure! • Random improvements will take you nowhere

    • Spending time to fi x non-real problems is the 
 last thing you want to do in an 8-hour timeframe • Accurate pro fi ling is the key
  17. Pro fi ler choices • Tracing pro fi lers •

    Tracks everything, but huge performance impact • ruby-prof based on TracePoint • Sampling pro fi lers • Collects samples every 10-100ms, 
 small performance impact • Stackprof (cpu, wall, memory) • based on rb_pro fi le_frames() API • rbspy (wall) • runs as a separate process and reads ruby memory (process_vm_readv(2))
  18. Let's pro fi le! Benchmark window (60 seconds) Time spent

    in particular handler (space for optimization?) POST /api/condition/...
  19. Flamegraph source: rb_pro fi le_frames() • Stackprof utilizes the rb_pro

    fi le_frames() API • Returns the call stack that was running when rb_pro fi le_frames() was called • Stackprof calls rb_pro fi le_frames() on SIGPROF (cpu mode) or SIGALRM (wall mode) timers ti me a() a() a() a() b() b() c() 📸 rb_pro fi le_frames() Records call stack [a(), b()]
  20. • rb_pro fi le_frames (Stackprof) is inaccurate when I/O comes

    into action burn_cpu Thread#join No io!?
  21. Multithreading in Ruby / GVL • Only 1 Thread can

    use the CPU at the same time • due to the Global VM Lock (GVL) • I/O ( fi le read/writes, network access, ...) can be performed in the background Thread 1 Thread 2 Thread 3 Release GVL (Do I/O) Release GVL (Do I/O) Acquire GVL Acquire GVL Use CPU Do 
 Do 
 Use 
 Wait Wait
  22. Issues in rb_pro fi le_frames() • The current implementation of

    rb_pro fi le_frames() returns information about the last active Thread (which had the GVL) • Threads doing I/O have low chances to be targeted • Statistics ( fl amegraphs) built from continuous rb_pro fi le_frames() calls may be not accurate, especially when many Threads are doing I/Os • (even in wall mode!) 💥 = Stack frame peeked by rb_pro fi le_frames() Thread 1 Thread 2 Thread 3 Release GVL (Do I/O) Release GVL (Do I/O) Acquire GVL Acquire GVL 💥 = Stack frame peeked by rb_pro fi le_frames()
  23. Issues in rb_pro fi le_frames() • Proposal: 
 Add rb_thread_pro

    fi le_frames() API, a per-thread version of rb_pro fi le_frames() • Accepts VALUE thread as arg • https://github.com/ruby/ruby/ pull/7784 Thread 1 Thread 2 Thread 3 Release GVL (Do I/O) Release GVL (Do I/O) Acquire GVL Acquire GVL 💥 = Stack frame peeked by rb_pro fi le_frames()
  24. Pro fi ling the accurate bottlenecks Utilizing 100% cpu in

    Ruby Achieving high concurrency in Ruby Performance Challenges in Ruby
  25. Utilizing 100% cpu in Ruby • Only 1 Thread can

    be active due to the GVL • CPU tasks and I/O can run simultaneously, but conditions are not always ideal • Resources are very limited in ISUCON; we want to squeeze everything out! 4 Ruby processes, 32 threads isn't enough to burn all CPU we want to see this (1 process, 8 threads in Go)
  26. Reducing GVL wait • Less Threads = Less races for

    the GVL • Ok, Let's reduce Threads! • In the ISUCON webapp case, Puma is creating Threads • We can create more processes in place of Threads to keep the number of workers
  27. Tuning Puma for 100% cpu • Adding Server (Puma) processes

    is effective, but consumes more memory • Memory is precious in ISUCON as MySQL lives in the same VM . . . . . . More Processes (better cpu utilization but more memory) More Threads (lesser memory consumption but suboptimal cpu util)
  28. wip: Finding the process/thread balance using GVL stats • GVL

    event hooks were added in Ruby 3.2 • RUBY_INTERNAL_THREAD_EVENT_* • Shopify/gvltools, ivoanjo/gvl-tracing • I integrated this into pro fi ler results • Usable for tuning # of Puma threads? Try adding Puma threads Perfect! Check GVL wait Reduce threads, 
 add processes Low enough Somewhat high
  29. Pro fi ling the accurate bottlenecks Utilizing 100% cpu in

    Ruby Achieving high concurrency in Ruby Performance Challenges in Ruby
  30. Higher concurrency? • Falcon (socketry/async series) • Event-loop based async

    I/O for Ruby! • Sadly, we couldn't rewrite everything in 8 hours • Truf fl eRuby • ISUCON VMs didn't have enough memory for Truf fl eRuby 😔 • Arming Ractors for lesser GVL waits • Maybe my next challenge!
  31. Ruby vs. Go • Go: Goroutines • Kind of a

    lightweight threads - this system makes high concurrency very easy • Go: Concurrency deeply embedded in the ecosystem • Contexts: Ability to cancel no longer needed MySQL queries • Timeouts: Same
  32. Wrapping up • It's fun to write Ruby 😉 •

    Pro fi ling is important! • But it's more important to check if those pro fi les are accurate • Let's do ISUCON!