Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Ruby Fast(er)

Making Ruby Fast(er)

Blue Ridge Ruby 2023

Ruby has always been a developer-friendly language, but that has often come at the expense of performance. The past several years have seen massive performance gains CRuby, JRuby, and TruffleRuby. Long-held assumptions about Ruby’s performance ceiling have been challenged and up-ended. In this talk, we’ll look at improvements in the virtual machine and the introduction of JIT compilers to better understand what’s making your Ruby code run fast(er).

Kevin Menard

June 08, 2023
Tweet

More Decks by Kevin Menard

Other Decks in Programming

Transcript

  1. Overview How CPUs run software Crash course in computer architecture

    What compilers are fundamentally doing Say “hello” to assembly How JIT compilers can make Ruby code run as fast as C Tying it all together
  2. Instructions Tell the CPU what to do Loaded into memory

    when a program is executed Operate on data Values coming from and and going to outside world
  3. x86_64 MUL (Unsigned Multiply) Forms Operand Size Source 1 Source

    2 Destination Byte AL r/m8 AX Word AX r/m16 DX:AX Doubleword EAX r/m32 EDX:EAX Quadword RAX r/m64 RDX:RAX Flags Affected The OF and CF flags are set to 0 if the upper half of the result is 0; otherwise, they are set to 1. The SF, ZF, AF, and PF flags are undefined.
  4. x86_64 MUL (Unsigned Multiply) IF (Byte operation) 1 THEN 2

    AX := AL ∗ SRC; 3 ELSE (* Word or doubleword operation *) 4 IF OperandSize = 16 5 THEN 6 DX:AX := AX ∗ SRC; 7 ELSE IF OperandSize = 32 8 THEN 9 EDX:EAX := EAX ∗ SRC; FI; 10 ELSE (* OperandSize = 64 *) 11 RDX:RAX := RAX ∗ SRC; 12 FI; 13 FI; 14
  5. CISC vs RISC CISC: Complex Instruction Set Computer Intel x86,

    AMD64 RISC: Reduced Instruction Set Computer Think Apple Silicon, Graviton, RISC-V, ARM, and most mobile CPUs Most new ISAs are RISC Key differences Register number and use Scope of instructions How we address data in instructions Register, memory address, immediate (constant) value, etc.
  6. Special Registers PC - Program Counter Sometimes called: IP -

    Instruction Pointer Holds address of next instruction to execute SP - Stack Pointer Holds address of the top of the stack Efficiently allows for storing and removing values in RAM FP - Frame Pointer Called the BP - Base Pointer on x86_64 Holds address for the start of the stack frame Allows functions to quickly clean up after themselves
  7. Machine Code Sometimes called native code Binary representation of instructions

    Encoded using the ISA’s opcode table That’s why applications are called binaries You could hand-write this if you wanted It’s really tedious Sometimes necessary for microcontrollers
  8. Assembly Language (ASM) Low-level text-based programming language Fairly simple by

    virtue of having a limited set of operations Maps to ISA instructions Assembler: turns ASM into machine code Disassembler: decodes machine code back into ASM
  9. Compilers Change code in one language to another Sometimes split

    as: Transpiler: language -> language Compiler: language -> machine code We’ll focus on machine code generation
  10. Example: Addition (x86_64) add: 1 push rbp 2 mov rbp,

    rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10
  11. Example: Addition x86-64 GCC 13.1 (Linux) ARM64 GCC 13.1 (Linux)

    add: push rbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov DWORD PTR [rbp-8], esi mov edx, DWORD PTR [rbp-4] mov eax, DWORD PTR [rbp-8] add eax, edx pop rbp ret def add(edi, esi) @stack.push(@rbp) @rbp = @rsp @stack.push(edi) @stack.push(esi) eax = @stack.pop edx = @stack.pop eax = eax + edx @rbp = @stack.pop return eax; end
  12. Example: Addition (x86_64) add: 1 push rbp 2 mov rbp,

    rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10
  13. Example: Addition (x86_64) add: 1 push rbp 2 mov rbp,

    rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10
  14. Example: Addition (x86_64) add: 1 push rbp 2 mov rbp,

    rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10
  15. Application Binary Interface (ABI) Platform-specific protocol for coordinating with a

    debugger Keep track of stack frames How to step through functions How to read function arguments Platform-specific protocol for laying out functions in ASM Also called calling convention How arguments are passed Where return value ends up Which registers can be used caller-saved callee-saved scratch
  16. Example: Addition (x86_64) add: 1 push rbp 2 mov rbp,

    rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10
  17. Example: Addition (ARM64) add: 1 sub sp, sp, #16 2

    str w0, [sp, 12] 3 str w1, [sp, 8] 4 ldr w1, [sp, 12] 5 ldr w0, [sp, 8] 6 add w0, w1, w0 7 add sp, sp, 16 8 ret 9
  18. Optimization: Addition x86-64 GCC 13.1 (Linux) ARM64 GCC 13.1 (Linux)

    add: lea eax, [rdi+rsi] ret add: add w0, w0, w1 ret
  19. Virtual Machine Ruby code runs in a virtual machine (VM)

    An abstract computer that hides details about underlying system Hides details about memory layout, IO access, register sizes, etc. We can run the same program on any platform with VM Instead of executing machine code, we interpret VM code We call that part of the VM the interpreter
  20. Interpreter Parser turns source code into a structure the interpreter

    can process Removes comments, white space, punctuation, etc. Most common representations Abstract Syntax Tree (AST) AST interpreter Byte Code (BC) (e.g., CRuby’s YARV) BC interpreter
  21. YARV CRuby’s instruction set You can see the generated YARV

    byte code with: ruby --dump=insns_without_opt > ruby --dump=insns_without_opt -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal a@0, 0 ( 1)[LiCa] 8 0003 getlocal b@1, 0 9 0006 send <calldata!mid:+, argc:1, ARGS_SIMPLE>, nil 10 0009 leave [Re] 11
  22. YARV CRuby’s instruction set You can see the generated YARV

    byte code with: ruby --dump=insns_without_opt > ruby --dump=insns_without_opt -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal a@0, 0 ( 1)[LiCa] 8 0003 getlocal b@1, 0 9 0006 send <calldata!mid:+, argc:1, ARGS_SIMPLE>, nil 10 0009 leave [Re] 11
  23. YARV Optimization CRuby will apply some optimizations to the byte

    code You can see the generated YARV byte code with: ruby --dump=insns Sometimes take form of special instructions, like opt_plus > ruby --dump=insns -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal_WC_0 a@0 ( 1)[LiCa] 8 0002 getlocal_WC_0 b@1 9 0004 opt_plus <calldata!mid:+, argc:1, ARGS_SIMPLE>[CcCr] 10 0006 leave [Re] 11
  24. YARV Optimization CRuby will apply some optimizations to the byte

    code You can see the generated YARV byte code with: ruby --dump=insns Sometimes take form of special instructions, like opt_plus > ruby --dump=insns -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal_WC_0 a@0 ( 1)[LiCa] 8 0002 getlocal_WC_0 b@1 9 0004 opt_plus <calldata!mid:+, argc:1, ARGS_SIMPLE>[CcCr] 10 0006 leave [Re] 11
  25. VM Profiler Monitors control and data flow How execution proceeds

    in your application How and what data moves through Measures how frequently functions are called Measures how frequently loops iterate Uses heuristics to determine when code is hot and should be compiled
  26. JIT Compiler Compiles a fragment of code (rather than whole

    application) Stores it in a region we call a code cache Common scopes Basic Block Fancy way of saying straight-line code Method A Ruby-level method (composed of basic blocks) Trace A flow of execution through multiple methods Once compiled, updates interpreter to jump to compiled code instead of interpreting that fragment
  27. Speculative Optimization What does this function do? def add(a, b)

    1 a + b 2 end 3 Add integers? a(10, 20) Concatenate strings? a('Hello ', 'friend') Append arrays? a([1, 2], [3, 4])
  28. Speculative Optimization Since the profiler knows the control and data

    flow, it can guess how your program will continue to operate The VM can rewrite its internal representation (IR) based on those guesses Called speculative optimization Sets up a fail safe for when that guess is wrong
  29. Method Lookup Nearly everything Ruby is a method call To

    call a method, we need a reference to it We need to look up the method in a method table Methods can change at runtime New methods added at runtime Existing methods redefined or removed Inheritance hierarchy changes
  30. Global Method Cache [class, method_name] Function Pointer [Integer, :+] <0x1234abcd>

    [String, :+] <0xdef04321> [Array, :+] <0x24680975> [Integer, :to_s] <0x13579864> … …
  31. Cache miss requires full method lookup Must be careful to

    invalidate entries when necessary Practical considerations limit size LRU cache eviction policy May thrash if cache too small or many methods called
  32. Inline Cache (IC) VM modifies method body based on observed

    values Cache is scoped to a call site Registers a “cheap” predicate to check if cache can be used AKA a guard function If guard passes, use the cache Otherwise, transition the cache state
  33. Inline Cache States Uninitialized Monomorphic Polymorphic Megamorphic Uninitialized Monomorphic One

    cache entry Polymorphic Multiple cache entries Megamorphic Remove cache because it’s not advantageous
  34. Monomorphic Inline Cache def type_ok?(obj, klass) 1 obj.class == klass

    && !VM.has_changed?(klass) 2 end 3 4 def add_monomorphic(a, b) 5 if type_ok?(a, Integer) && type_ok?(b, Integer) 6 m = Integer.instance_method(:+) 7 m.bind_call(a, b) 8 else 9 handle_miss! 10 end 11 end 12
  35. Polymorphic Inline Cache (PIC) def add(a, b) 1 a +

    b 2 end 3 4 add(10, 20) 5 add('hello ', 'good people') 6
  36. Polymorphic Inline Cache (PIC) def add_polymorphic(a, b) 1 if type_ok?(a,

    Integer) && type_ok?(b, Integer) 2 m = Integer.instance_method(:+) 3 m.bind_call(a, b) 4 5 elsif type_ok?(a, String) && type_ok?(b, String) 6 m = String.instance_method(:+) 7 m.bind_call(a, b) 8 9 else 10 handle_miss! 11 end 12 end 13
  37. Megamorphic def add(a, b) 1 a + b 2 end

    3 4 add(10, 20) 5 add('hello ', 'good people') 6 add([1, 2], [3, 4]) 7 add(10, 20.0) 8
  38. Megamorphic def add_megamorphic(a, b) 1 # Look up method the

    slow way. 2 # The VM may update the Global Method Cache. 3 m = VM.lookup_method([a.class, :+]) 4 5 m.bind_call(a, b) 6 end 7
  39. JIT Compile Inline Cache Take that internal VM state and

    turn it into machine code Speculative optimization that a and b types are stable & the method isn’t redefined The machine code can optimize for the specialized operation add: 1 cmp [rdi + 0x20], 0xfe826359 ; Check if a.class is Integer 2 jne 0x12344321 ; Deoptimize if not an Integer 3 4 cmp [rsi + 0x20], 0xfe826359 ; Check if b.class is Integer 5 jne 0x12344321 ; Deoptimize if not an Integer 6 7 mov eax, rdi ; Copy `a` into EAX for addition 8 add eax, rsi ; Perform `a + b` 9 10 jo 0x67899876 ; Handle potential overflow 11 12 ret 13
  40. JIT Compile Inline Cache Take that internal VM state and

    turn it into machine code Speculative optimization that a and b types are stable & the method isn’t redefined The machine code can optimize for the specialized operation add: 1 cmp [rdi + 0x20], 0xfe826359 ; Check if a.class is Integer 2 jne 0x12344321 ; Deoptimize if not an Integer 3 4 cmp [rsi + 0x20], 0xfe826359 ; Check if b.class is Integer 5 jne 0x12344321 ; Deoptimize if not an Integer 6 7 mov eax, rdi ; Copy `a` into EAX for addition 8 add eax, rsi ; Perform `a + b` 9 10 jo 0x67899876 ; Handle potential overflow 11 12 ret 13
  41. Deoptimization Recovers from bad guesses Throws away compiled code fragment

    Updates interpreter to resume interpreting that code Resets the profiler to start profiling again Optionally makes note about bad optimizations decisions to avoid repeating deopt loops
  42. Method Inlining # Real implementation of empty? in TruffleRuby. 1

    class Array 2 def empty? 3 size == 0 4 end 5 end 6 7 # Our method before inlining. 8 def blank?(o) 9 o.nil? || o.empty? 10 end 11 12 # Our method after inlining. 13 def blank_after_inlining?(o) 14 o.nil? || o.size == 0 15 end 16
  43. Escape Analysis The array never escapes It stays within min?

    No references to it appear anywhere else The JIT compiler could eliminate the array allocation def min?(value) 1 [value, 1000].min == value 2 end 3
  44. Eliminate Metaprogramming Overhead send "abc".send(:size) is the same as "abc".size

    method_missing Implicitly call define_method so calls are fast respond_to? Can be made constant with inline cache
  45. JIT Compilers Recap Conceptually simple Take your Ruby code and

    transform it to optimized machine code Faster than interpreting But incur a warm-up cost before hitting peak performance Optimize for the values flowing through your program Speculative optimizations could be faster than AOT