Optimizing Ruby

(de)optimizing ruby Urabe, Shyouhei

@shyouhei • Long time ruby-core committer since 1.8 era. •
Maintained ruby 1.8.5-7 (EOL-ed). • Made ruby/ruby repo mirror @ GitHub. • Now a full-time ruby dev @ Money Forward, Inc.

tl;dr • Implemented deoptimization over Ruby 2.4. • Boosts Ruby
execution up to 400+ times. • Makes lots of rooms for other optimizations.

Ruby is slow

Why is it • Because we have GC? • Because
we have GVL? • Because Ruby is dynamic?

Why is it • Because it is not optimized.

Not optimized == disasm: #<ISeq:<compiled>@<compiled>>================================ 0000 putobject_OP_INT2FIX_O_1_C_ ( 1) 0001
putobject 2 0003 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache> 0006 leave This is how `1 + 2` is evaluated

Not optimized == disasm: #<ISeq:<compiled>@<compiled>>================================ 0000 putobject 3 ( 1)
0002 leave This is what `1 + 2` should be (but is not)

Why is it • Because Ìnteger#+` can be redefined •
on-the-fly, • dynamically, • globally, • maybe from within other threads.

But redefinition rarely happens • Redefinitions must work but, should
we redefine things as quickly as possible? • Which one is better: everything runs slowly, or 99% of codes run fast and redefinition takes, say, 1,000x more time? • → Introducing deoptimization.

Deoptimization • “Just forget about redeﬁnitions and go as far
as you can. If things get changed, discard the optimized bit and fallback to vanilla interpreter.” • A technique originally introduced on SELF (a Smalltalk variant), later applied to many other languages, notably JVM. • JRuby and Rubinius both have their own deoptimization engine, hence both run faster than the MRI.

Our strategy • No JIT compile to machine native codes.
• Just transform VM instruction sequences and let the VM execute them. • Furthermore we restrict to “patch” a sequence; we don’t either shrink nor grow. • Fill nops when needed. The nop instruction is expected to run adequately fast.

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size insn insn
insn … ←ɹiseq_size wordsɹ→ (program counter) … …

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size insn insn
insn … (program counter) … … insn insn insn insn … VALUE *iseq_deoptimize rb_serial_t created_at memcpy

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size opt opt
opt … (program counter) … … insn insn insn insn … VALUE *iseq_deoptimize rb_serial_t created_at

VALUE *iseq_encoded struct rb_iseq_constant_body insn unsigned int iseq_size opt opt
opt … (program counter) … … insn insn insn insn … VALUE *iseq_deoptimize rb_serial_t created_at insn insn insn insn … memcpy

void iseq_deoptimize(const rb_iseq_t *restrict i) { body_t *b = i->body;
const target_t *d = b->deoptimize; const void *orig = d->ptr; memcpy((void *)b->iseq_encoded, orig, b->iseq_size * sizeof(VALUE)); ISEQ_RESET_ORIGINAL_ISEQ(i); }

What is good • Done in Pure C. No portability
issues. • Program counter not aﬀected by the operations. • Hence no need to scan the VM stack. • Saved vanilla sequence can be reused multiple times; the preparation is needed only once.

The VM timestamp • In order to detect evil activities
like method redeﬁnitions, per-VM global timestamp counter is introduced. • This counter is an unsigned integer that is atomically- incremented when any of following activities happen: • Assignments to constants, • (Re-)deﬁnition of methods, • Inclusion of modules.

diff --git a/vm.c b/vm.c index c3e7bb3..148020d 100644 --- a/vm.c +++
b/vm.c @@ -196,6 +196,7 @@ vm_invoke_proc(rb_thread_t *th, rb_proc_t *proc, VALUE self, int argc, const VALUE *argv, const rb_block_t *blockptr); +static rb_serial_t ruby_vm_global_timestamp = 1; static rb_serial_t ruby_vm_global_method_state = 1; static rb_serial_t ruby_vm_global_constant_state = 1; static rb_serial_t ruby_vm_class_serial = 1; @@ -213,6 +214,7 @@ rb_serial_t rb_next_class_serial(void) { + ATOMIC_INC(ruby_vm_global_timestamp); return NEXT_CLASS_SERIAL(); } diff --git a/vm_method.c b/vm_method.c index 69f98c4..c771a5f 100644 --- a/vm_method.c +++ b/vm_method.c @@ -89,6 +89,7 @@ rb_clear_cache(void) void rb_clear_constant_cache(void)

diff --git a/vm_insnhelper.h b/vm_insnhelper.h index 98844dc..27cd18a 100644 --- a/vm_insnhelper.h +++
b/vm_insnhelper.h @@ -123,6 +123,7 @@ enum vm_regan_acttype { #define CALL_METHOD(calling, ci, cc) do { \ VALUE v = (*(cc)->call)(th, GET_CFP(), (calling), (ci), (cc)); \ + iseq_deoptimize_if_needed(GET_ISEQ(), ruby_vm_global_timestamp); \ if (v == Qundef) { \ RESTORE_REGS(); \ NEXT_INSN(); \ -- static inline void iseq_deoptimize_if_needed( const rb_iseq_t *restrict i, rb_serial_t t) { if (t != i>body—>created_at) { iseq_deoptimize(i); } }

diff --git a/vm_insnhelper.c b/vm_insnhelper.c index 3841801..a46028c 100644 --- a/vm_insnhelper.c +++
b/vm_insnhelper.c @@ -152,20 +172,21 @@ static inline rb_control_frame_t * vm_push_frame(rb_thread_t *th, const rb_iseq_t *iseq, VALUE type, VALUE self, VALUE specval, VALUE cref_or_me, const VALUE *pc, VALUE *sp, int local_size, int stack_max) { rb_control_frame_t *const cfp = th->cfp - 1; int i; vm_check_frame(type, specval, cref_or_me); VM_ASSERT(local_size >= 1); + iseq_deoptimize_if_needed(iseq, ruby_vm_global_timestamp); /* check stack overflow */ CHECK_VM_STACK_OVERFLOW0(cfp, sp, local_size + stack_max);

Almost no overheads class C def method_missing mid end end
obj = C.new i = 0 while i<6_000_000 # benchmark loop 2 i += 1 obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; end USVOL PVST UJNF<TFD> 0 1 2 3 2.441 2.412 vm2_method_missing*

Deoptimization • We made a deoptimization engine of ruby. •
Its main characteristics include consistency of VM states such as program counter. • Very lightweight.

Optimize on it • Various optimization can be thought of:
• Eliminating send variants, • Constant folding, • Eliminating unused variables.

Folding constants --- /dev/shm/1wqj345 2016-08-17 14:23:36.000000000 +0900 +++ /dev/shm/978zae 2016-08-17
14:23:36.000000000 +0900 @@ -20,9 +20,13 @@ local table (size: 2, argc: 0 [opts: 0, |------------------------------------------------------------------------ 0000 trace 256 ( 8) 0002 trace 1 ( 9) -0004 getinlinecache 13, <is:0> -0007 getconstant :RubyVM -0009 getconstant :InstructionSequence -0011 setinlinecache <is:0> +0004 putobject RubyVM::InstructionSequence +0006 nop +0007 nop +0008 nop +0009 nop +0010 nop +0011 nop +0012 nop 0013 trace 512 ( 10) 0015 leave ( 9)

Folding constants • Constants are already inline-cached. • Just replace
the getinlinecache in question with putobject, and ﬁll the rest of sequence with nop.

diff --git a/insns.def b/insns.def index 0c71b32..cf7f009 100644 --- a/insns.def +++
b/insns.def @@ -1319,6 +1319,10 @@ getinlinecache { if (ic->ic_serial == GET_GLOBAL_CONSTANT_STATE() && (ic->ic_cref == NULL || ic->ic_cref == rb_vm_get_cref(GET_EP()))) { + const rb_iseq_t *i = GET_ISEQ(); + const VALUE *p = GET_PC(); + + iseq_const_fold(i, p, OPN_OF_CURRENT_INSN + 1, dst, ic->ic_value.value); val = ic->ic_value.value; JUMP(dst); }

void iseq_const_fold( const rb_iseq_t *restrict i, const VALUE *pc, int
n, long m, VALUE konst) { VALUE *buf = (VALUE *)&pc[-n]; int len = n + m; memcpy(buf, wipeout_pattern, len * sizeof(VALUE)); buf[0] = putobject; buf[1] = konst; } “nop nop nop …”

Folding 1+2 --- /dev/shm/xj79gt 2016-08-17 17:09:31.000000000 +0900 +++ /dev/shm/1gaaeo 2016-08-17
17:09:31.000000000 +0900 @@ -20,8 +20,10 @@ local table (size: 2, argc: 0 [opts: 0, |------------------------------------------------------------------------ 0000 trace 256 ( 7) 0002 trace 1 ( 8) -0004 putobject_OP_INT2FIX_O_1_C_ -0005 putobject 2 -0007 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache> +0004 putobject 3 +0006 nop +0007 nop +0008 nop +0009 nop 0010 trace 512 ( 9) 0012 leave ( 8)

diff --git a/insns.def b/insns.def index cf7f009..9bf6025 100644 --- a/insns.def +++
b/insns.def @@ -1459,23 +1458,28 @@ opt_plus #else val = LONG2NUM(FIX2LONG(recv) + FIX2LONG(obj)); #endif + TRY_CONSTFOLD(val); } else if (FLONUM_2_P(recv, obj) && BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) { val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj)); + TRY_CONSTFOLD(val); } else if (!SPECIAL_CONST_P(recv) && !SPECIAL_CONST_P(obj)) { if (RBASIC_CLASS(recv) == rb_cFloat && RBASIC_CLASS(obj) == rb_cFloat && BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) { val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj)); + TRY_CONSTFOLD(val); } else if (RBASIC_CLASS(recv) == rb_cString && RBASIC_CLASS(obj) == rb_cString && BASIC_OP_UNREDEFINED_P(BOP_PLUS, STRING_REDEFINED_OP_FLAG)) { val = rb_str_plus(recv, obj); + TRY_CONSTFOLD(val); } else if (RBASIC_CLASS(recv) == rb_cArray && BASIC_OP_UNREDEFINED_P(BOP_PLUS, ARRAY_REDEFINED_OP_FLAG)) { val = rb_ary_plus(recv, obj); + TRY_CONSTFOLD(val); } else { goto INSN_LABEL(normal_dispatch); @@ -1508,15 +1512,18 @@ opt_minus

Elimination of send --- /dev/shm/179yavr 2016-08-12 19:41:44.000000000 +0900 +++ /dev/shm/uma7jr
2016-08-12 19:41:44.000000000 +0900 @@ -15,9 +15,12 @@ |------------------------------------------------------------------------ 0000 trace 256 ( 8) 0002 trace 1 ( 9) -0004 putself -0005 opt_send_without_block <callinfo!mid:m, argc:0, FCALL|VCALL|ARGS_SIMPLE>, <callcache> -0008 adjuststack 1 +0004 nop +0005 nop +0006 nop +0007 nop +0008 nop +0009 nop 0010 trace 1 ( 10) 0012 nop 0013 nop

Method purity • A method eligible to be skipped is
considered “pure”. • A method is marked to be not pure if … • It writes to variables other than local ones. • It yields. • It is not written in Ruby. • It calls other methods that are not pure.

Methods that are not pure def m Time.now end def
m @foo = self end def m yield end def m { foo: :bar } end rb_define_method(rb_cTCPServer, "sysaccept", tcp_sysaccept, 0);

Methods that are pure def m(x) y = i =
0 while i < x z = i % 2 == 0 ? 1 : -1 y += z / (2 * i + 1.0) i += 1 end return 4 * y end def m(x, y, z = ' ') n = y - x.length while n > 0 do n -= z.length x = z + x end return x end

Method purity • “A method is either pure (optimizable) or
not” is, in fact, an oversimpliﬁcation. • There is a third state: indeterministic. • For instance, one cannot say if a method is pure or not when that method calls something inside, which is not deﬁned, resulting a call to method_missing.

Method purity • So a method’s purity is determined on-the-ﬂy.
• Each method starts with its purity being not predicted. • While running the method we collect a method’s usage to detect its purity. • When a method’s purity is determined, that info propagates to its callers.

Method purity --- /dev/shm/179yavr 2016-08-12 19:41:44.000000000 +0900 +++ /dev/shm/uma7jr 2016-08-12
19:41:44.000000000 +0900 @@ -15,9 +15,12 @@ |------------------------------------------------------------------------ 0000 trace 256 ( 8) 0002 trace 1 ( 9) -0004 putself -0005 opt_send_without_block <callinfo!mid:m, argc:0, FCALL|VCALL|ARGS_SIMPLE>, <callcache> -0008 adjuststack 1 +0004 nop +0005 nop +0006 nop +0007 nop +0008 nop +0009 nop 0010 trace 1 ( 10) 0012 nop 0013 nop This

enum insn_purity purity_of_cc(const struct rb_call_cache *cc) { const rb_iseq_t *i;
if (! cc->me) { return insn_is_unpredictable; /* method missing */ } else if (! (i = iseq_of_me(cc->me))) { return insn_is_not_pure; /* not written in ruby. */ } else if (! i->body->attributes) { /* Note, we do not recursively analyze. That can lead to infinite * recursion on mutually recursive calls and detecting that is too * expensive in this hot path.*/ return insn_is_unpredictable; } else { return purity_of_VALUE(RB_ISEQ_ANNOTATED_P(i, core::purity)); } }

enum insn_purity purity_of_sendish(const VALUE *argv) { enum ruby_vminsn_type insn =
argv[0]; const char *ops = insn_op_types(insn); enum insn_purity purity = insn_is_pure; for (int j = 0; j < insn_len(insn); j++) { if (ops[j] == TS_CALLCACHE) { struct rb_call_cache *cc = (void *)argv[j + 1]; purity += purity_of_cc(cc); } } return purity; }

Eliminating send-ish instructions • “Method calls whose return values are
discarded” are subject to eliminate. • Method calls just check the calling method’s purity; later if the immediately following instruction discards its return value, that preceding method call can be eliminated. • Actual elimination happens in pop, not in send.

diff --git a/insns.def b/insns.def index c9d7204..2b877ff 100644 --- a/insns.def +++
b/insns.def @@ -711,9 +722,17 @@ DEFINE_INSN adjuststack (rb_num_t n) (...) (...) // inc -= n { DEC_SP(n); + /* If the immediately precedent instruction was send (or its + * variant), and here we are in adjuststack instruction, this + * means the return value of the method call is silently + * discarded. Then why not just avoid the whole method calling? + * This is possible when the callee method was marked pure. Note + * however that even on such case, evaluation of method arguments + * cannot be skipped, because they can have their own side + * effects. + */ + vm_eliminate_insn(GET_CFP(), GET_PC(), OPN_OF_CURRENT_INSN + 1, n); }

void iseq_eliminate_insn( const rb_iseq_t *restrict i, struct cfp_last_insn *restrict p,
int n, rb_num_t m) { VALUE *buf = (VALUE *)&i->body->iseq_encoded[p->pc]; int len = p->len + n; int argc = p->argc + m; memcpy(buf, wipeout_pattern, len * sizeof(VALUE)); if (argc != 0) { buf[0] = adjuststack; buf[1] = argc; } ISEQ_RESET_ORIGINAL_ISEQ(i); FL_SET(i, ISEQ_NEEDS_ANALYZE); } “nop nop nop …” in case arguments have side eﬀects

Example of argument side-eﬀect --- /dev/shm/165rrgd 2016-08-17 10:44:10.000000000 +0900 +++
/dev/shm/jd0rcj 2016-08-17 10:44:10.000000000 +0900 @@ -23,8 +23,10 @@ local table (size: 2, argc: 0 [opts: 0, 0004 putself 0005 putself 0006 opt_send_without_block <callinfo!mid:n, argc:0, FCALL|VCALL|ARGS_SIMPLE>, <callcache> -0009 opt_send_without_block <callinfo!mid:m, argc:1, FCALL|ARGS_SIMPLE>, <callcache> -0012 adjuststack 1 +0009 adjuststack 2 +0011 nop +0012 nop +0013 nop 0014 trace 1 ( 16) 0016 putnil 0017 trace 512 ( 17) suppose we can't optimize suppose we can't optimize suppose we can't optimize need clear stack top

Elimination of variables --- /dev/shm/ea2lud 2016-08-19 10:40:28.000000000 +0900 +++ /dev/shm/1v4irx0
2016-08-19 10:40:28.000000000 +0900 @@ -17,8 +17,10 @@ local table (size: 3, argc: 1 [opts: 0, [ 3] i<Arg> [ 2] x 0000 trace 256 ( 4) 0002 trace 1 ( 5) -0004 putobject :foo -0006 setlocal_OP__WC__0 2 +0004 nop +0005 nop +0006 nop +0007 nop 0008 trace 1 ( 6) 0010 putnil 0011 trace 512 ( 7)

(… is a bit hard though)

Elimination of variables • We eliminate variables that are assigned,
but never used later (write-only). • Only methods that are pure can be considered. • Methods with side eﬀects might access bindings. • Blocks might share local variables so writeonly-ness should consider all reachable blocks.

Elimination of variables • There might also be other kinds
of variables that are safe to be eliminated, but detection of such variables is very diﬃcult to do precisely on-the-ﬂy.

Optimizations • Fairly basic optimizations are implemented. • All optimizations
run on-the-ﬂy, preserve VM states such as exception tables. • There are rooms for other optimization techniques, like subexpression eliminations.

Benchmarks • CAUTION: YMMV • `make benchmark` results on my
machine. • Not a brand-new box; its /proc/cpuinfo says “Intel(R) Core(TM)2 Duo CPU T7700”. • Following results show average of 7 executions.

Speedup ratio versus trunk (greater=faster) 0.1 1 10 100 app_answer
app_aobench app_factorial app_fib app_lc_fizzbuzz app_mandelbrot app_pentomino app_raise app_strconcat app_tak app_tarai hash_aref_dsym hash_aref_dsym_long hash_aref_fix hash_aref_flo hash_aref_miss hash_aref_str hash_aref_sym hash_aref_sym_long hash_flatten hash_ident_flo hash_ident_num hash_ident_obj hash_ident_str hash_ident_sym hash_keys hash_shift hash_shift_u16 hash_shift_u24 hash_shift_u32 hash_to_proc hash_values io_file_create io_select io_select2 io_select3 loop_for loop_generator loop_times loop_whileloop loop_whileloop2 marshal_dump_flo marshal_dump_load_geniv marshal_dump_load_time require require_thread so_ackermann so_array so_binary_trees so_concatenate so_count_words so_exception so_fannkuch so_fasta so_k_nucleotide so_lists so_mandelbrot so_matrix so_meteor_contest so_nbody so_nested_loop so_nsieve so_nsieve_bits so_object so_partial_sums so_pidigits so_random so_reverse_complement so_sieve so_spectralnorm vm1_attr_ivar* vm1_attr_ivar_set* vm1_block* vm1_const* vm1_ensure* vm1_float_simple* vm1_gc_short_lived* vm1_gc_short_with_complex_long* vm1_gc_short_with_long* vm1_gc_short_with_symbol* vm1_gc_wb_ary* vm1_gc_wb_ary_promoted* vm1_gc_wb_obj* vm1_gc_wb_obj_promoted* vm1_ivar* vm1_ivar_set* vm1_length* vm1_lvar_init* vm1_lvar_set* vm1_neq* vm1_not* vm1_rescue* vm1_simplereturn* vm1_swap* vm1_yield* vm2_array* vm2_bigarray* vm2_bighash* vm2_case* vm2_case_lit* vm2_defined_method* vm2_dstr* vm2_eval* vm2_method* vm2_method_missing* vm2_method_with_block* vm2_mutex* vm2_newlambda* vm2_poly_method* vm2_poly_method_ov* vm2_proc* vm2_raise1* vm2_raise2* vm2_regexp* vm2_send* vm2_string_literal* vm2_struct_big_aref_hi* vm2_struct_big_aref_lo* vm2_struct_big_aset* vm2_struct_big_href_hi* vm2_struct_big_href_lo* vm2_struct_big_hset* vm2_struct_small_aref* vm2_struct_small_aset* vm2_struct_small_href* vm2_struct_small_hset* vm2_super* vm2_unif1* vm2_zsuper* vm3_backtrace vm3_clearmethodcache vm3_gc vm3_gc_old_full vm3_gc_old_immediate vm3_gc_old_lazy vm_symbol_block_pass vm_thread_alive_check1 vm_thread_close vm_thread_create_join vm_thread_mutex1 vm_thread_mutex2 vm_thread_mutex3 vm_thread_pass vm_thread_pass_flood vm_thread_pipe vm_thread_queue Slower Faster

vm1_simplereturn* vm2_defined_method* vm2_method* vm2_poly_method* vm2_super* vm2_zsuper* Execution time [sec] 0
1.25 2.5 3.75 5 0.210 0.196 0.873 0.467 0.477 0.230 0.836 0.768 3.949 1.920 4.174 1.256 trunk ours Faster Resulted in identical instruction sequences

app_pentomino hash_aref_dsym_long so_binary_trees vm2_eval* vm2_raise2* Execution time [sec] 0 12.5
25 37.5 50 12.138 40.683 11.801 10.517 24.847 11.237 37.976 10.439 10.26 23.407 trunk ours Faster deoptimization overhead

vm1_attr_ivar* vm1_attr_ivar_set* vm1_block* vm1_ivar* vm1_ivar_set* 2_method_with_block* Execution time [sec] 0
1 2 3 4 2.642 0.865 0.739 3.782 2.379 2.125 2.039 0.654 0.625 2.572 1.807 1.755 trunk ours Faster

vm1_gc_short_lived* vm2_array* vm2_bigarray* vm2_string_literal* Execution time [sec] 0 3 6
9 12 0.025 0.027 0.032 1.691 0.275 11.483 0.943 8.904 trunk ours Faster

Speedup ratio versus trunk (greater=faster) 0.1 1 10 100 app_answer
app_aobench app_factorial app_fib app_lc_fizzbuzz app_mandelbrot app_pentomino app_raise app_strconcat app_tak app_tarai hash_aref_dsym hash_aref_dsym_long hash_aref_fix hash_aref_flo hash_aref_miss hash_aref_str hash_aref_sym hash_aref_sym_long hash_flatten hash_ident_flo hash_ident_num hash_ident_obj hash_ident_str hash_ident_sym hash_keys hash_shift hash_shift_u16 hash_shift_u24 hash_shift_u32 hash_to_proc hash_values io_file_create io_select io_select2 io_select3 loop_for loop_generator loop_times loop_whileloop loop_whileloop2 marshal_dump_flo marshal_dump_load_geniv marshal_dump_load_time require require_thread so_ackermann so_array so_binary_trees so_concatenate so_count_words so_exception so_fannkuch so_fasta so_k_nucleotide so_lists so_mandelbrot so_matrix so_meteor_contest so_nbody so_nested_loop so_nsieve so_nsieve_bits so_object so_partial_sums so_pidigits so_random so_reverse_complement so_sieve so_spectralnorm vm1_attr_ivar* vm1_attr_ivar_set* vm1_block* vm1_const* vm1_ensure* vm1_float_simple* vm1_gc_short_lived* vm1_gc_short_with_complex_long* vm1_gc_short_with_long* vm1_gc_short_with_symbol* vm1_gc_wb_ary* vm1_gc_wb_ary_promoted* vm1_gc_wb_obj* vm1_gc_wb_obj_promoted* vm1_ivar* vm1_ivar_set* vm1_length* vm1_lvar_init* vm1_lvar_set* vm1_neq* vm1_not* vm1_rescue* vm1_simplereturn* vm1_swap* vm1_yield* vm2_array* vm2_bigarray* vm2_bighash* vm2_case* vm2_case_lit* vm2_defined_method* vm2_dstr* vm2_eval* vm2_method* vm2_method_missing* vm2_method_with_block* vm2_mutex* vm2_newlambda* vm2_poly_method* vm2_poly_method_ov* vm2_proc* vm2_raise1* vm2_raise2* vm2_regexp* vm2_send* vm2_string_literal* vm2_struct_big_aref_hi* vm2_struct_big_aref_lo* vm2_struct_big_aset* vm2_struct_big_href_hi* vm2_struct_big_href_lo* vm2_struct_big_hset* vm2_struct_small_aref* vm2_struct_small_aset* vm2_struct_small_href* vm2_struct_small_hset* vm2_super* vm2_unif1* vm2_zsuper* vm3_backtrace vm3_clearmethodcache vm3_gc vm3_gc_old_full vm3_gc_old_immediate vm3_gc_old_lazy vm_symbol_block_pass vm_thread_alive_check1 vm_thread_close vm_thread_create_join vm_thread_mutex1 vm_thread_mutex2 vm_thread_mutex3 vm_thread_pass vm_thread_pass_flood vm_thread_pipe vm_thread_queue Slower Faster

Benchmarks • Most benchmarks show same performance. • The optimizations
work drastically for several benchmarks. • There do exist cases of slowdowns, but IMHO marginal amount of overheads.

Conclusion • Implemented deoptimization over Ruby 2.4. • Boosts Ruby
execution up to 400+ times. • Makes lots of rooms for other optimizations.

Future works • Other optimizations can be thought of, such
as: • Subexpression elimination; • Variable liveness & escape analysis; • and more. • Allowing to modify program counter would make more rooms for further optimizations.

FAQs • Q: where is the patch? • A: https://github.com/ruby/ruby/pull/1419
• Q: does this speed up Rails? • A: not really. • Q: does this work Ruby 3x3 out? • A: it depends (3x3 goal is vague), but I believe I’m on the right path.

Optimizing Ruby

Optimizing Ruby

More Decks by Urabe Shyouhei

Other Decks in Technology

Featured

Transcript