Slide 1

Slide 1 text

(de)optimizing ruby Urabe, Shyouhei

Slide 2

Slide 2 text

@shyouhei • Long time ruby-core committer since 1.8 era. • Maintained ruby 1.8.5-7 (EOL-ed). • Made ruby/ruby repo mirror @ GitHub. • Now a full-time ruby dev @ Money Forward, Inc.

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

tl;dr • Implemented deoptimization over Ruby 2.4. • Boosts Ruby execution up to 400+ times. • Makes lots of rooms for other optimizations.

Slide 5

Slide 5 text

Ruby is slow

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Why is it • Because we have GC? • Because we have GVL? • Because Ruby is dynamic?

Slide 8

Slide 8 text

Why is it • Because we have GC? • Because we have GVL? • Because Ruby is dynamic?

Slide 9

Slide 9 text

Why is it • Because it is not optimized.

Slide 10

Slide 10 text

Not optimized == disasm: #@>================================ 0000 putobject_OP_INT2FIX_O_1_C_ ( 1) 0001 putobject 2 0003 opt_plus , 0006 leave This is how `1 + 2` is evaluated

Slide 11

Slide 11 text

Not optimized == disasm: #@>================================ 0000 putobject 3 ( 1) 0002 leave This is what `1 + 2` should be (but is not)

Slide 12

Slide 12 text

Why is it • Because `Integer#+` can be redefined • on-the-fly, • dynamically, • globally, • maybe from within other threads.

Slide 13

Slide 13 text

But redefinition rarely happens • Redefinitions must work but, should we redefine things as quickly as possible? • Which one is better: everything runs slowly, or 99% of codes run fast and redefinition takes, say, 1,000x more time? • → Introducing deoptimization.

Slide 14

Slide 14 text

Deoptimization • “Just forget about redefinitions and go as far as you can. If things get changed, discard the optimized bit and fallback to vanilla interpreter.” • A technique originally introduced on SELF (a Smalltalk variant), later applied to many other languages, notably JVM. • JRuby and Rubinius both have their own deoptimization engine, hence both run faster than the MRI.

Slide 15

Slide 15 text

Our strategy • No JIT compile to machine native codes. • Just transform VM instruction sequences and let the VM execute them. • Furthermore we restrict to “patch” a sequence; we don’t either shrink nor grow. • Fill nops when needed. The nop instruction is expected to run adequately fast.

Slide 16

Slide 16 text

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size insn insn insn … ←ɹiseq_size wordsɹ→ (program counter) … …

Slide 17

Slide 17 text

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size insn insn insn … (program counter) … … insn insn insn insn … VALUE *iseq_deoptimize rb_serial_t created_at memcpy

Slide 18

Slide 18 text

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size opt opt opt … (program counter) … … insn insn insn insn … VALUE *iseq_deoptimize rb_serial_t created_at

Slide 19

Slide 19 text

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size opt opt opt … (program counter) … … insn insn insn insn … VALUE *iseq_deoptimize rb_serial_t created_at insn insn insn insn … memcpy

Slide 20

Slide 20 text

void iseq_deoptimize(const rb_iseq_t *restrict i) { body_t *b = i->body; const target_t *d = b->deoptimize; const void *orig = d->ptr; memcpy((void *)b->iseq_encoded, orig, b->iseq_size * sizeof(VALUE)); ISEQ_RESET_ORIGINAL_ISEQ(i); }

Slide 21

Slide 21 text

What is good • Done in Pure C. No portability issues. • Program counter not affected by the operations. • Hence no need to scan the VM stack. • Saved vanilla sequence can be reused multiple times; the preparation is needed only once.

Slide 22

Slide 22 text

The VM timestamp • In order to detect evil activities like method redefinitions, per-VM global timestamp counter is introduced. • This counter is an unsigned integer that is atomically- incremented when any of following activities happen: • Assignments to constants, • (Re-)definition of methods, • Inclusion of modules.

Slide 23

Slide 23 text

diff --git a/vm.c b/vm.c index c3e7bb3..148020d 100644 --- a/vm.c +++ b/vm.c @@ -196,6 +196,7 @@ vm_invoke_proc(rb_thread_t *th, rb_proc_t *proc, VALUE self, int argc, const VALUE *argv, const rb_block_t *blockptr); +static rb_serial_t ruby_vm_global_timestamp = 1; static rb_serial_t ruby_vm_global_method_state = 1; static rb_serial_t ruby_vm_global_constant_state = 1; static rb_serial_t ruby_vm_class_serial = 1; @@ -213,6 +214,7 @@ rb_serial_t rb_next_class_serial(void) { + ATOMIC_INC(ruby_vm_global_timestamp); return NEXT_CLASS_SERIAL(); } diff --git a/vm_method.c b/vm_method.c index 69f98c4..c771a5f 100644 --- a/vm_method.c +++ b/vm_method.c @@ -89,6 +89,7 @@ rb_clear_cache(void) void rb_clear_constant_cache(void)

Slide 24

Slide 24 text

diff --git a/vm_insnhelper.h b/vm_insnhelper.h index 98844dc..27cd18a 100644 --- a/vm_insnhelper.h +++ b/vm_insnhelper.h @@ -123,6 +123,7 @@ enum vm_regan_acttype { #define CALL_METHOD(calling, ci, cc) do { \ VALUE v = (*(cc)->call)(th, GET_CFP(), (calling), (ci), (cc)); \ + iseq_deoptimize_if_needed(GET_ISEQ(), ruby_vm_global_timestamp); \ if (v == Qundef) { \ RESTORE_REGS(); \ NEXT_INSN(); \ -- static inline void iseq_deoptimize_if_needed( const rb_iseq_t *restrict i, rb_serial_t t) { if (t != i>body—>created_at) { iseq_deoptimize(i); } }

Slide 25

Slide 25 text

diff --git a/vm_insnhelper.c b/vm_insnhelper.c index 3841801..a46028c 100644 --- a/vm_insnhelper.c +++ b/vm_insnhelper.c @@ -152,20 +172,21 @@ static inline rb_control_frame_t * vm_push_frame(rb_thread_t *th, const rb_iseq_t *iseq, VALUE type, VALUE self, VALUE specval, VALUE cref_or_me, const VALUE *pc, VALUE *sp, int local_size, int stack_max) { rb_control_frame_t *const cfp = th->cfp - 1; int i; vm_check_frame(type, specval, cref_or_me); VM_ASSERT(local_size >= 1); + iseq_deoptimize_if_needed(iseq, ruby_vm_global_timestamp); /* check stack overflow */ CHECK_VM_STACK_OVERFLOW0(cfp, sp, local_size + stack_max);

Slide 26

Slide 26 text

Almost no overheads class C def method_missing mid end end obj = C.new i = 0 while i<6_000_000 # benchmark loop 2 i += 1 obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; end USVOL PVST UJNF 0 1 2 3 2.441 2.412 vm2_method_missing*

Slide 27

Slide 27 text

Deoptimization • We made a deoptimization engine of ruby. • Its main characteristics include consistency of VM states such as program counter. • Very lightweight.

Slide 28

Slide 28 text

Optimize on it • Various optimization can be thought of: • Eliminating send variants, • Constant folding, • Eliminating unused variables.

Slide 29

Slide 29 text

Folding constants --- /dev/shm/1wqj345 2016-08-17 14:23:36.000000000 +0900 +++ /dev/shm/978zae 2016-08-17 14:23:36.000000000 +0900 @@ -20,9 +20,13 @@ local table (size: 2, argc: 0 [opts: 0, |------------------------------------------------------------------------ 0000 trace 256 ( 8) 0002 trace 1 ( 9) -0004 getinlinecache 13, -0007 getconstant :RubyVM -0009 getconstant :InstructionSequence -0011 setinlinecache +0004 putobject RubyVM::InstructionSequence +0006 nop +0007 nop +0008 nop +0009 nop +0010 nop +0011 nop +0012 nop 0013 trace 512 ( 10) 0015 leave ( 9)

Slide 30

Slide 30 text

Folding constants • Constants are already inline-cached. • Just replace the getinlinecache in question with putobject, and fill the rest of sequence with nop.

Slide 31

Slide 31 text

diff --git a/insns.def b/insns.def index 0c71b32..cf7f009 100644 --- a/insns.def +++ b/insns.def @@ -1319,6 +1319,10 @@ getinlinecache { if (ic->ic_serial == GET_GLOBAL_CONSTANT_STATE() && (ic->ic_cref == NULL || ic->ic_cref == rb_vm_get_cref(GET_EP()))) { + const rb_iseq_t *i = GET_ISEQ(); + const VALUE *p = GET_PC(); + + iseq_const_fold(i, p, OPN_OF_CURRENT_INSN + 1, dst, ic->ic_value.value); val = ic->ic_value.value; JUMP(dst); }

Slide 32

Slide 32 text

void iseq_const_fold( const rb_iseq_t *restrict i, const VALUE *pc, int n, long m, VALUE konst) { VALUE *buf = (VALUE *)&pc[-n]; int len = n + m; memcpy(buf, wipeout_pattern, len * sizeof(VALUE)); buf[0] = putobject; buf[1] = konst; } “nop nop nop …”

Slide 33

Slide 33 text

Folding 1+2 --- /dev/shm/xj79gt 2016-08-17 17:09:31.000000000 +0900 +++ /dev/shm/1gaaeo 2016-08-17 17:09:31.000000000 +0900 @@ -20,8 +20,10 @@ local table (size: 2, argc: 0 [opts: 0, |------------------------------------------------------------------------ 0000 trace 256 ( 7) 0002 trace 1 ( 8) -0004 putobject_OP_INT2FIX_O_1_C_ -0005 putobject 2 -0007 opt_plus , +0004 putobject 3 +0006 nop +0007 nop +0008 nop +0009 nop 0010 trace 512 ( 9) 0012 leave ( 8)

Slide 34

Slide 34 text

diff --git a/insns.def b/insns.def index cf7f009..9bf6025 100644 --- a/insns.def +++ b/insns.def @@ -1459,23 +1458,28 @@ opt_plus #else val = LONG2NUM(FIX2LONG(recv) + FIX2LONG(obj)); #endif + TRY_CONSTFOLD(val); } else if (FLONUM_2_P(recv, obj) && BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) { val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj)); + TRY_CONSTFOLD(val); } else if (!SPECIAL_CONST_P(recv) && !SPECIAL_CONST_P(obj)) { if (RBASIC_CLASS(recv) == rb_cFloat && RBASIC_CLASS(obj) == rb_cFloat && BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) { val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj)); + TRY_CONSTFOLD(val); } else if (RBASIC_CLASS(recv) == rb_cString && RBASIC_CLASS(obj) == rb_cString && BASIC_OP_UNREDEFINED_P(BOP_PLUS, STRING_REDEFINED_OP_FLAG)) { val = rb_str_plus(recv, obj); + TRY_CONSTFOLD(val); } else if (RBASIC_CLASS(recv) == rb_cArray && BASIC_OP_UNREDEFINED_P(BOP_PLUS, ARRAY_REDEFINED_OP_FLAG)) { val = rb_ary_plus(recv, obj); + TRY_CONSTFOLD(val); } else { goto INSN_LABEL(normal_dispatch); @@ -1508,15 +1512,18 @@ opt_minus

Slide 35

Slide 35 text

Elimination of send --- /dev/shm/179yavr 2016-08-12 19:41:44.000000000 +0900 +++ /dev/shm/uma7jr 2016-08-12 19:41:44.000000000 +0900 @@ -15,9 +15,12 @@ |------------------------------------------------------------------------ 0000 trace 256 ( 8) 0002 trace 1 ( 9) -0004 putself -0005 opt_send_without_block , -0008 adjuststack 1 +0004 nop +0005 nop +0006 nop +0007 nop +0008 nop +0009 nop 0010 trace 1 ( 10) 0012 nop 0013 nop

Slide 36

Slide 36 text

Method purity • A method eligible to be skipped is considered “pure”. • A method is marked to be not pure if … • It writes to variables other than local ones. • It yields. • It is not written in Ruby. • It calls other methods that are not pure.

Slide 37

Slide 37 text

Methods that are not pure def m Time.now end def m @foo = self end def m yield end def m { foo: :bar } end rb_define_method(rb_cTCPServer, "sysaccept", tcp_sysaccept, 0);

Slide 38

Slide 38 text

Methods that are pure def m(x) y = i = 0 while i < x z = i % 2 == 0 ? 1 : -1 y += z / (2 * i + 1.0) i += 1 end return 4 * y end def m(x, y, z = ' ') n = y - x.length while n > 0 do n -= z.length x = z + x end return x end

Slide 39

Slide 39 text

Method purity • “A method is either pure (optimizable) or not” is, in fact, an oversimplification. • There is a third state: indeterministic. • For instance, one cannot say if a method is pure or not when that method calls something inside, which is not defined, resulting a call to method_missing.

Slide 40

Slide 40 text

Method purity • So a method’s purity is determined on-the-fly. • Each method starts with its purity being not predicted. • While running the method we collect a method’s usage to detect its purity. • When a method’s purity is determined, that info propagates to its callers.

Slide 41

Slide 41 text

Method purity --- /dev/shm/179yavr 2016-08-12 19:41:44.000000000 +0900 +++ /dev/shm/uma7jr 2016-08-12 19:41:44.000000000 +0900 @@ -15,9 +15,12 @@ |------------------------------------------------------------------------ 0000 trace 256 ( 8) 0002 trace 1 ( 9) -0004 putself -0005 opt_send_without_block , -0008 adjuststack 1 +0004 nop +0005 nop +0006 nop +0007 nop +0008 nop +0009 nop 0010 trace 1 ( 10) 0012 nop 0013 nop This

Slide 42

Slide 42 text

enum insn_purity purity_of_cc(const struct rb_call_cache *cc) { const rb_iseq_t *i; if (! cc->me) { return insn_is_unpredictable; /* method missing */ } else if (! (i = iseq_of_me(cc->me))) { return insn_is_not_pure; /* not written in ruby. */ } else if (! i->body->attributes) { /* Note, we do not recursively analyze. That can lead to infinite * recursion on mutually recursive calls and detecting that is too * expensive in this hot path.*/ return insn_is_unpredictable; } else { return purity_of_VALUE(RB_ISEQ_ANNOTATED_P(i, core::purity)); } }

Slide 43

Slide 43 text

enum insn_purity purity_of_sendish(const VALUE *argv) { enum ruby_vminsn_type insn = argv[0]; const char *ops = insn_op_types(insn); enum insn_purity purity = insn_is_pure; for (int j = 0; j < insn_len(insn); j++) { if (ops[j] == TS_CALLCACHE) { struct rb_call_cache *cc = (void *)argv[j + 1]; purity += purity_of_cc(cc); } } return purity; }

Slide 44

Slide 44 text

Eliminating send-ish instructions • “Method calls whose return values are discarded” are subject to eliminate. • Method calls just check the calling method’s purity; later if the immediately following instruction discards its return value, that preceding method call can be eliminated. • Actual elimination happens in pop, not in send.

Slide 45

Slide 45 text

diff --git a/insns.def b/insns.def index c9d7204..2b877ff 100644 --- a/insns.def +++ b/insns.def @@ -711,9 +722,17 @@ DEFINE_INSN adjuststack (rb_num_t n) (...) (...) // inc -= n { DEC_SP(n); + /* If the immediately precedent instruction was send (or its + * variant), and here we are in adjuststack instruction, this + * means the return value of the method call is silently + * discarded. Then why not just avoid the whole method calling? + * This is possible when the callee method was marked pure. Note + * however that even on such case, evaluation of method arguments + * cannot be skipped, because they can have their own side + * effects. + */ + vm_eliminate_insn(GET_CFP(), GET_PC(), OPN_OF_CURRENT_INSN + 1, n); }

Slide 46

Slide 46 text

void iseq_eliminate_insn( const rb_iseq_t *restrict i, struct cfp_last_insn *restrict p, int n, rb_num_t m) { VALUE *buf = (VALUE *)&i->body->iseq_encoded[p->pc]; int len = p->len + n; int argc = p->argc + m; memcpy(buf, wipeout_pattern, len * sizeof(VALUE)); if (argc != 0) { buf[0] = adjuststack; buf[1] = argc; } ISEQ_RESET_ORIGINAL_ISEQ(i); FL_SET(i, ISEQ_NEEDS_ANALYZE); } “nop nop nop …” in case arguments have side effects

Slide 47

Slide 47 text

Example of argument side-effect --- /dev/shm/165rrgd 2016-08-17 10:44:10.000000000 +0900 +++ /dev/shm/jd0rcj 2016-08-17 10:44:10.000000000 +0900 @@ -23,8 +23,10 @@ local table (size: 2, argc: 0 [opts: 0, 0004 putself 0005 putself 0006 opt_send_without_block , -0009 opt_send_without_block , -0012 adjuststack 1 +0009 adjuststack 2 +0011 nop +0012 nop +0013 nop 0014 trace 1 ( 16) 0016 putnil 0017 trace 512 ( 17) suppose we can't optimize suppose we can't optimize suppose we can't optimize need clear stack top

Slide 48

Slide 48 text

Elimination of variables --- /dev/shm/ea2lud 2016-08-19 10:40:28.000000000 +0900 +++ /dev/shm/1v4irx0 2016-08-19 10:40:28.000000000 +0900 @@ -17,8 +17,10 @@ local table (size: 3, argc: 1 [opts: 0, [ 3] i [ 2] x 0000 trace 256 ( 4) 0002 trace 1 ( 5) -0004 putobject :foo -0006 setlocal_OP__WC__0 2 +0004 nop +0005 nop +0006 nop +0007 nop 0008 trace 1 ( 6) 0010 putnil 0011 trace 512 ( 7)

Slide 49

Slide 49 text

(… is a bit hard though)

Slide 50

Slide 50 text

Elimination of variables • We eliminate variables that are assigned, but never used later (write-only). • Only methods that are pure can be considered. • Methods with side effects might access bindings. • Blocks might share local variables so writeonly-ness should consider all reachable blocks.

Slide 51

Slide 51 text

Elimination of variables • There might also be other kinds of variables that are safe to be eliminated, but detection of such variables is very difficult to do precisely on-the-fly.

Slide 52

Slide 52 text

Optimizations • Fairly basic optimizations are implemented. • All optimizations run on-the-fly, preserve VM states such as exception tables. • There are rooms for other optimization techniques, like subexpression eliminations.

Slide 53

Slide 53 text

Benchmarks • CAUTION: YMMV • `make benchmark` results on my machine. • Not a brand-new box; its /proc/cpuinfo says “Intel(R) Core(TM)2 Duo CPU T7700”. • Following results show average of 7 executions.

Slide 54

Slide 54 text

Speedup ratio versus trunk (greater=faster) 0.1 1 10 100 app_answer app_aobench app_factorial app_fib app_lc_fizzbuzz app_mandelbrot app_pentomino app_raise app_strconcat app_tak app_tarai hash_aref_dsym hash_aref_dsym_long hash_aref_fix hash_aref_flo hash_aref_miss hash_aref_str hash_aref_sym hash_aref_sym_long hash_flatten hash_ident_flo hash_ident_num hash_ident_obj hash_ident_str hash_ident_sym hash_keys hash_shift hash_shift_u16 hash_shift_u24 hash_shift_u32 hash_to_proc hash_values io_file_create io_select io_select2 io_select3 loop_for loop_generator loop_times loop_whileloop loop_whileloop2 marshal_dump_flo marshal_dump_load_geniv marshal_dump_load_time require require_thread so_ackermann so_array so_binary_trees so_concatenate so_count_words so_exception so_fannkuch so_fasta so_k_nucleotide so_lists so_mandelbrot so_matrix so_meteor_contest so_nbody so_nested_loop so_nsieve so_nsieve_bits so_object so_partial_sums so_pidigits so_random so_reverse_complement so_sieve so_spectralnorm vm1_attr_ivar* vm1_attr_ivar_set* vm1_block* vm1_const* vm1_ensure* vm1_float_simple* vm1_gc_short_lived* vm1_gc_short_with_complex_long* vm1_gc_short_with_long* vm1_gc_short_with_symbol* vm1_gc_wb_ary* vm1_gc_wb_ary_promoted* vm1_gc_wb_obj* vm1_gc_wb_obj_promoted* vm1_ivar* vm1_ivar_set* vm1_length* vm1_lvar_init* vm1_lvar_set* vm1_neq* vm1_not* vm1_rescue* vm1_simplereturn* vm1_swap* vm1_yield* vm2_array* vm2_bigarray* vm2_bighash* vm2_case* vm2_case_lit* vm2_defined_method* vm2_dstr* vm2_eval* vm2_method* vm2_method_missing* vm2_method_with_block* vm2_mutex* vm2_newlambda* vm2_poly_method* vm2_poly_method_ov* vm2_proc* vm2_raise1* vm2_raise2* vm2_regexp* vm2_send* vm2_string_literal* vm2_struct_big_aref_hi* vm2_struct_big_aref_lo* vm2_struct_big_aset* vm2_struct_big_href_hi* vm2_struct_big_href_lo* vm2_struct_big_hset* vm2_struct_small_aref* vm2_struct_small_aset* vm2_struct_small_href* vm2_struct_small_hset* vm2_super* vm2_unif1* vm2_zsuper* vm3_backtrace vm3_clearmethodcache vm3_gc vm3_gc_old_full vm3_gc_old_immediate vm3_gc_old_lazy vm_symbol_block_pass vm_thread_alive_check1 vm_thread_close vm_thread_create_join vm_thread_mutex1 vm_thread_mutex2 vm_thread_mutex3 vm_thread_pass vm_thread_pass_flood vm_thread_pipe vm_thread_queue Slower Faster

Slide 55

Slide 55 text

vm1_simplereturn* vm2_defined_method* vm2_method* vm2_poly_method* vm2_super* vm2_zsuper* Execution time [sec] 0 1.25 2.5 3.75 5 0.210 0.196 0.873 0.467 0.477 0.230 0.836 0.768 3.949 1.920 4.174 1.256 trunk ours Faster Resulted in identical instruction sequences

Slide 56

Slide 56 text

app_pentomino hash_aref_dsym_long so_binary_trees vm2_eval* vm2_raise2* Execution time [sec] 0 12.5 25 37.5 50 12.138 40.683 11.801 10.517 24.847 11.237 37.976 10.439 10.26 23.407 trunk ours Faster deoptimization overhead

Slide 57

Slide 57 text

vm1_attr_ivar* vm1_attr_ivar_set* vm1_block* vm1_ivar* vm1_ivar_set* 2_method_with_block* Execution time [sec] 0 1 2 3 4 2.642 0.865 0.739 3.782 2.379 2.125 2.039 0.654 0.625 2.572 1.807 1.755 trunk ours Faster

Slide 58

Slide 58 text

vm1_gc_short_lived* vm2_array* vm2_bigarray* vm2_string_literal* Execution time [sec] 0 3 6 9 12 0.025 0.027 0.032 1.691 0.275 11.483 0.943 8.904 trunk ours Faster

Slide 59

Slide 59 text

Speedup ratio versus trunk (greater=faster) 0.1 1 10 100 app_answer app_aobench app_factorial app_fib app_lc_fizzbuzz app_mandelbrot app_pentomino app_raise app_strconcat app_tak app_tarai hash_aref_dsym hash_aref_dsym_long hash_aref_fix hash_aref_flo hash_aref_miss hash_aref_str hash_aref_sym hash_aref_sym_long hash_flatten hash_ident_flo hash_ident_num hash_ident_obj hash_ident_str hash_ident_sym hash_keys hash_shift hash_shift_u16 hash_shift_u24 hash_shift_u32 hash_to_proc hash_values io_file_create io_select io_select2 io_select3 loop_for loop_generator loop_times loop_whileloop loop_whileloop2 marshal_dump_flo marshal_dump_load_geniv marshal_dump_load_time require require_thread so_ackermann so_array so_binary_trees so_concatenate so_count_words so_exception so_fannkuch so_fasta so_k_nucleotide so_lists so_mandelbrot so_matrix so_meteor_contest so_nbody so_nested_loop so_nsieve so_nsieve_bits so_object so_partial_sums so_pidigits so_random so_reverse_complement so_sieve so_spectralnorm vm1_attr_ivar* vm1_attr_ivar_set* vm1_block* vm1_const* vm1_ensure* vm1_float_simple* vm1_gc_short_lived* vm1_gc_short_with_complex_long* vm1_gc_short_with_long* vm1_gc_short_with_symbol* vm1_gc_wb_ary* vm1_gc_wb_ary_promoted* vm1_gc_wb_obj* vm1_gc_wb_obj_promoted* vm1_ivar* vm1_ivar_set* vm1_length* vm1_lvar_init* vm1_lvar_set* vm1_neq* vm1_not* vm1_rescue* vm1_simplereturn* vm1_swap* vm1_yield* vm2_array* vm2_bigarray* vm2_bighash* vm2_case* vm2_case_lit* vm2_defined_method* vm2_dstr* vm2_eval* vm2_method* vm2_method_missing* vm2_method_with_block* vm2_mutex* vm2_newlambda* vm2_poly_method* vm2_poly_method_ov* vm2_proc* vm2_raise1* vm2_raise2* vm2_regexp* vm2_send* vm2_string_literal* vm2_struct_big_aref_hi* vm2_struct_big_aref_lo* vm2_struct_big_aset* vm2_struct_big_href_hi* vm2_struct_big_href_lo* vm2_struct_big_hset* vm2_struct_small_aref* vm2_struct_small_aset* vm2_struct_small_href* vm2_struct_small_hset* vm2_super* vm2_unif1* vm2_zsuper* vm3_backtrace vm3_clearmethodcache vm3_gc vm3_gc_old_full vm3_gc_old_immediate vm3_gc_old_lazy vm_symbol_block_pass vm_thread_alive_check1 vm_thread_close vm_thread_create_join vm_thread_mutex1 vm_thread_mutex2 vm_thread_mutex3 vm_thread_pass vm_thread_pass_flood vm_thread_pipe vm_thread_queue Slower Faster

Slide 60

Slide 60 text

Benchmarks • Most benchmarks show same performance. • The optimizations work drastically for several benchmarks. • There do exist cases of slowdowns, but IMHO marginal amount of overheads.

Slide 61

Slide 61 text

Conclusion • Implemented deoptimization over Ruby 2.4. • Boosts Ruby execution up to 400+ times. • Makes lots of rooms for other optimizations.

Slide 62

Slide 62 text

Future works • Other optimizations can be thought of, such as: • Subexpression elimination; • Variable liveness & escape analysis; • and more. • Allowing to modify program counter would make more rooms for further optimizations.

Slide 63

Slide 63 text

FAQs • Q: where is the patch? • A: https://github.com/ruby/ruby/pull/1419 • Q: does this speed up Rails? • A: not really. • Q: does this work Ruby 3x3 out? • A: it depends (3x3 goal is vague), but I believe I’m on the right path.