Slide 1

Slide 1 text

Making Ruby Fast(er) Kevin Menard 2023-06-08 [email protected]

Slide 2

Slide 2 text

Flag Me Down  [email protected]  @nirvdrum  @[email protected]  @nirvdrum

Slide 3

Slide 3 text

Motivation

Slide 4

Slide 4 text

Ruby Implementations with JIT Compilers CRuby (MRI) JRuby TruffleRuby

Slide 5

Slide 5 text

Overview How CPUs run software Crash course in computer architecture What compilers are fundamentally doing Say “hello” to assembly How JIT compilers can make Ruby code run as fast as C Tying it all together

Slide 6

Slide 6 text

1. How CPUs Run Software

Slide 7

Slide 7 text

Image © FutureLearn. Available at https://www.futurelearn.com/info/courses/how-computers-work/0/steps/49283 Basic Computer Architecture

Slide 8

Slide 8 text

Instructions Tell the CPU what to do Loaded into memory when a program is executed Operate on data Values coming from and and going to outside world

Slide 9

Slide 9 text

Instruction Set Architecture (ISA)

Slide 10

Slide 10 text

x86_64 32-bit Registers

Slide 11

Slide 11 text

x86_64 Numeric Data Types

Slide 12

Slide 12 text

x86_64 MUL (Unsigned Multiply) Overview

Slide 13

Slide 13 text

x86_64 MUL (Unsigned Multiply) Forms Operand Size Source 1 Source 2 Destination Byte AL r/m8 AX Word AX r/m16 DX:AX Doubleword EAX r/m32 EDX:EAX Quadword RAX r/m64 RDX:RAX Flags Affected The OF and CF flags are set to 0 if the upper half of the result is 0; otherwise, they are set to 1. The SF, ZF, AF, and PF flags are undefined.

Slide 14

Slide 14 text

x86_64 MUL (Unsigned Multiply) IF (Byte operation) 1 THEN 2 AX := AL ∗ SRC; 3 ELSE (* Word or doubleword operation *) 4 IF OperandSize = 16 5 THEN 6 DX:AX := AX ∗ SRC; 7 ELSE IF OperandSize = 32 8 THEN 9 EDX:EAX := EAX ∗ SRC; FI; 10 ELSE (* OperandSize = 64 *) 11 RDX:RAX := RAX ∗ SRC; 12 FI; 13 FI; 14

Slide 15

Slide 15 text

x86_64 MUL (Unsigned Multiply) Opcode

Slide 16

Slide 16 text

ISA Classification: CISC vs RISC

Slide 17

Slide 17 text

CISC vs RISC CISC: Complex Instruction Set Computer Intel x86, AMD64 RISC: Reduced Instruction Set Computer Think Apple Silicon, Graviton, RISC-V, ARM, and most mobile CPUs Most new ISAs are RISC Key differences Register number and use Scope of instructions How we address data in instructions Register, memory address, immediate (constant) value, etc.

Slide 18

Slide 18 text

Special Registers PC - Program Counter Sometimes called: IP - Instruction Pointer Holds address of next instruction to execute SP - Stack Pointer Holds address of the top of the stack Efficiently allows for storing and removing values in RAM FP - Frame Pointer Called the BP - Base Pointer on x86_64 Holds address for the start of the stack frame Allows functions to quickly clean up after themselves

Slide 19

Slide 19 text

2. What Compilers are Fundamentally Doing

Slide 20

Slide 20 text

Machine Code Sometimes called native code Binary representation of instructions Encoded using the ISA’s opcode table That’s why applications are called binaries You could hand-write this if you wanted It’s really tedious Sometimes necessary for microcontrollers

Slide 21

Slide 21 text

Assembly Language (ASM) Low-level text-based programming language Fairly simple by virtue of having a limited set of operations Maps to ISA instructions Assembler: turns ASM into machine code Disassembler: decodes machine code back into ASM

Slide 22

Slide 22 text

Compilers Change code in one language to another Sometimes split as: Transpiler: language -> language Compiler: language -> machine code We’ll focus on machine code generation

Slide 23

Slide 23 text

Time for Code

Slide 24

Slide 24 text

Example: Addition int add(int a, int b) { return a + b; }

Slide 25

Slide 25 text

Example: Addition (x86_64) add: 1 push rbp 2 mov rbp, rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10

Slide 26

Slide 26 text

Example: Addition x86-64 GCC 13.1 (Linux) ARM64 GCC 13.1 (Linux) add: push rbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov DWORD PTR [rbp-8], esi mov edx, DWORD PTR [rbp-4] mov eax, DWORD PTR [rbp-8] add eax, edx pop rbp ret def add(edi, esi) @stack.push(@rbp) @rbp = @rsp @stack.push(edi) @stack.push(esi) eax = @stack.pop edx = @stack.pop eax = eax + edx @rbp = @stack.pop return eax; end

Slide 27

Slide 27 text

Example: Addition (x86_64) add: 1 push rbp 2 mov rbp, rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10

Slide 28

Slide 28 text

Example: Addition (x86_64) add: 1 push rbp 2 mov rbp, rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10

Slide 29

Slide 29 text

Example: Addition (x86_64) add: 1 push rbp 2 mov rbp, rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10

Slide 30

Slide 30 text

Application Binary Interface (ABI) Platform-specific protocol for coordinating with a debugger Keep track of stack frames How to step through functions How to read function arguments Platform-specific protocol for laying out functions in ASM Also called calling convention How arguments are passed Where return value ends up Which registers can be used caller-saved callee-saved scratch

Slide 31

Slide 31 text

Example: Addition (x86_64) add: 1 push rbp 2 mov rbp, rsp 3 mov DWORD PTR [rbp-4], edi 4 mov DWORD PTR [rbp-8], esi 5 mov edx, DWORD PTR [rbp-4] 6 mov eax, DWORD PTR [rbp-8] 7 add eax, edx 8 pop rbp 9 ret 10

Slide 32

Slide 32 text

Example: Addition (ARM64) add: 1 sub sp, sp, #16 2 str w0, [sp, 12] 3 str w1, [sp, 8] 4 ldr w1, [sp, 12] 5 ldr w0, [sp, 8] 6 add w0, w1, w0 7 add sp, sp, 16 8 ret 9

Slide 33

Slide 33 text

Example: Addition

Slide 34

Slide 34 text

Optimization

Slide 35

Slide 35 text

Optimization: Addition x86-64 GCC 13.1 (Linux) ARM64 GCC 13.1 (Linux) add: lea eax, [rdi+rsi] ret add: add w0, w0, w1 ret

Slide 36

Slide 36 text

https://godbolt.org/

Slide 37

Slide 37 text

3. JIT Compilers and Ruby

Slide 38

Slide 38 text

AOT vs JIT Compilation

Slide 39

Slide 39 text

Virtual Machine Ruby code runs in a virtual machine (VM) An abstract computer that hides details about underlying system Hides details about memory layout, IO access, register sizes, etc. We can run the same program on any platform with VM Instead of executing machine code, we interpret VM code We call that part of the VM the interpreter

Slide 40

Slide 40 text

Interpreter Parser turns source code into a structure the interpreter can process Removes comments, white space, punctuation, etc. Most common representations Abstract Syntax Tree (AST) AST interpreter Byte Code (BC) (e.g., CRuby’s YARV) BC interpreter

Slide 41

Slide 41 text

Sample AST

Slide 42

Slide 42 text

YARV CRuby’s instruction set You can see the generated YARV byte code with: ruby --dump=insns_without_opt > ruby --dump=insns_without_opt -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal a@0, 0 ( 1)[LiCa] 8 0003 getlocal b@1, 0 9 0006 send , nil 10 0009 leave [Re] 11

Slide 43

Slide 43 text

YARV CRuby’s instruction set You can see the generated YARV byte code with: ruby --dump=insns_without_opt > ruby --dump=insns_without_opt -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal a@0, 0 ( 1)[LiCa] 8 0003 getlocal b@1, 0 9 0006 send , nil 10 0009 leave [Re] 11

Slide 44

Slide 44 text

YARV Optimization CRuby will apply some optimizations to the byte code You can see the generated YARV byte code with: ruby --dump=insns Sometimes take form of special instructions, like opt_plus > ruby --dump=insns -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal_WC_0 a@0 ( 1)[LiCa] 8 0002 getlocal_WC_0 b@1 9 0004 opt_plus [CcCr] 10 0006 leave [Re] 11

Slide 45

Slide 45 text

YARV Optimization CRuby will apply some optimizations to the byte code You can see the generated YARV byte code with: ruby --dump=insns Sometimes take form of special instructions, like opt_plus > ruby --dump=insns -e 'def add(a, b); a + b; end' 1 == disasm: 2 0000 definemethod :add, add ( 1)[Li] 3 0003 putobject :add 4 0005 leave 5 6 == disasm: 7 0000 getlocal_WC_0 a@0 ( 1)[LiCa] 8 0002 getlocal_WC_0 b@1 9 0004 opt_plus [CcCr] 10 0006 leave [Re] 11

Slide 46

Slide 46 text

VM Profiler Monitors control and data flow How execution proceeds in your application How and what data moves through Measures how frequently functions are called Measures how frequently loops iterate Uses heuristics to determine when code is hot and should be compiled

Slide 47

Slide 47 text

JIT Compiler Compiles a fragment of code (rather than whole application) Stores it in a region we call a code cache Common scopes Basic Block Fancy way of saying straight-line code Method A Ruby-level method (composed of basic blocks) Trace A flow of execution through multiple methods Once compiled, updates interpreter to jump to compiled code instead of interpreting that fragment

Slide 48

Slide 48 text

Speculative Optimization What does this function do? def add(a, b) 1 a + b 2 end 3 Add integers? a(10, 20) Concatenate strings? a('Hello ', 'friend') Append arrays? a([1, 2], [3, 4])

Slide 49

Slide 49 text

Speculative Optimization Since the profiler knows the control and data flow, it can guess how your program will continue to operate The VM can rewrite its internal representation (IR) based on those guesses Called speculative optimization Sets up a fail safe for when that guess is wrong

Slide 50

Slide 50 text

Critical Optimization: Method Lookup

Slide 51

Slide 51 text

Method Lookup Nearly everything Ruby is a method call To call a method, we need a reference to it We need to look up the method in a method table Methods can change at runtime New methods added at runtime Existing methods redefined or removed Inheritance hierarchy changes

Slide 52

Slide 52 text

Caching: The Agony and the Ecstasy

Slide 53

Slide 53 text

Global Method Cache [class, method_name] Function Pointer [Integer, :+] <0x1234abcd> [String, :+] <0xdef04321> [Array, :+] <0x24680975> [Integer, :to_s] <0x13579864> … …

Slide 54

Slide 54 text

Cache miss requires full method lookup Must be careful to invalidate entries when necessary Practical considerations limit size LRU cache eviction policy May thrash if cache too small or many methods called

Slide 55

Slide 55 text

When in Doubt, Add Another Level

Slide 56

Slide 56 text

Inline Cache (IC) VM modifies method body based on observed values Cache is scoped to a call site Registers a “cheap” predicate to check if cache can be used AKA a guard function If guard passes, use the cache Otherwise, transition the cache state

Slide 57

Slide 57 text

Inline Cache States Uninitialized Monomorphic Polymorphic Megamorphic Uninitialized Monomorphic One cache entry Polymorphic Multiple cache entries Megamorphic Remove cache because it’s not advantageous

Slide 58

Slide 58 text

Monomorphic Inline Cache def add(a, b) 1 a + b 2 end 3 4 add(10, 20) 5

Slide 59

Slide 59 text

Monomorphic Inline Cache def type_ok?(obj, klass) 1 obj.class == klass && !VM.has_changed?(klass) 2 end 3 4 def add_monomorphic(a, b) 5 if type_ok?(a, Integer) && type_ok?(b, Integer) 6 m = Integer.instance_method(:+) 7 m.bind_call(a, b) 8 else 9 handle_miss! 10 end 11 end 12

Slide 60

Slide 60 text

Polymorphic Inline Cache (PIC) def add(a, b) 1 a + b 2 end 3 4 add(10, 20) 5 add('hello ', 'good people') 6

Slide 61

Slide 61 text

Polymorphic Inline Cache (PIC) def add_polymorphic(a, b) 1 if type_ok?(a, Integer) && type_ok?(b, Integer) 2 m = Integer.instance_method(:+) 3 m.bind_call(a, b) 4 5 elsif type_ok?(a, String) && type_ok?(b, String) 6 m = String.instance_method(:+) 7 m.bind_call(a, b) 8 9 else 10 handle_miss! 11 end 12 end 13

Slide 62

Slide 62 text

Megamorphic def add(a, b) 1 a + b 2 end 3 4 add(10, 20) 5 add('hello ', 'good people') 6 add([1, 2], [3, 4]) 7 add(10, 20.0) 8

Slide 63

Slide 63 text

Megamorphic def add_megamorphic(a, b) 1 # Look up method the slow way. 2 # The VM may update the Global Method Cache. 3 m = VM.lookup_method([a.class, :+]) 4 5 m.bind_call(a, b) 6 end 7

Slide 64

Slide 64 text

JIT Compile Inline Cache Take that internal VM state and turn it into machine code Speculative optimization that a and b types are stable & the method isn’t redefined The machine code can optimize for the specialized operation add: 1 cmp [rdi + 0x20], 0xfe826359 ; Check if a.class is Integer 2 jne 0x12344321 ; Deoptimize if not an Integer 3 4 cmp [rsi + 0x20], 0xfe826359 ; Check if b.class is Integer 5 jne 0x12344321 ; Deoptimize if not an Integer 6 7 mov eax, rdi ; Copy `a` into EAX for addition 8 add eax, rsi ; Perform `a + b` 9 10 jo 0x67899876 ; Handle potential overflow 11 12 ret 13

Slide 65

Slide 65 text

JIT Compile Inline Cache Take that internal VM state and turn it into machine code Speculative optimization that a and b types are stable & the method isn’t redefined The machine code can optimize for the specialized operation add: 1 cmp [rdi + 0x20], 0xfe826359 ; Check if a.class is Integer 2 jne 0x12344321 ; Deoptimize if not an Integer 3 4 cmp [rsi + 0x20], 0xfe826359 ; Check if b.class is Integer 5 jne 0x12344321 ; Deoptimize if not an Integer 6 7 mov eax, rdi ; Copy `a` into EAX for addition 8 add eax, rsi ; Perform `a + b` 9 10 jo 0x67899876 ; Handle potential overflow 11 12 ret 13

Slide 66

Slide 66 text

Deoptimization Recovers from bad guesses Throws away compiled code fragment Updates interpreter to resume interpreting that code Resets the profiler to start profiling again Optionally makes note about bad optimizations decisions to avoid repeating deopt loops

Slide 67

Slide 67 text

Other Ruby JIT Optimizations

Slide 68

Slide 68 text

Method Inlining # Real implementation of empty? in TruffleRuby. 1 class Array 2 def empty? 3 size == 0 4 end 5 end 6 7 # Our method before inlining. 8 def blank?(o) 9 o.nil? || o.empty? 10 end 11 12 # Our method after inlining. 13 def blank_after_inlining?(o) 14 o.nil? || o.size == 0 15 end 16

Slide 69

Slide 69 text

Escape Analysis The array never escapes It stays within min? No references to it appear anywhere else The JIT compiler could eliminate the array allocation def min?(value) 1 [value, 1000].min == value 2 end 3

Slide 70

Slide 70 text

Eliminate Metaprogramming Overhead send "abc".send(:size) is the same as "abc".size method_missing Implicitly call define_method so calls are fast respond_to? Can be made constant with inline cache

Slide 71

Slide 71 text

instance_variable_{get|set} Turn into simple field accesses

Slide 72

Slide 72 text

Plenty of Room for Other Optimizations!

Slide 73

Slide 73 text

JIT Compilers Recap Conceptually simple Take your Ruby code and transform it to optimized machine code Faster than interpreting But incur a warm-up cost before hitting peak performance Optimize for the values flowing through your program Speculative optimizations could be faster than AOT

Slide 74

Slide 74 text

Work best with idiomatic Ruby Native extensions & clever hacks present barriers to JIT optimization

Slide 75

Slide 75 text

Resources

Slide 76

Slide 76 text

Thank you for your time  [email protected]  @nirvdrum  @[email protected]  @nirvdrum

Slide 77

Slide 77 text

No content