$30 off During Our Annual Pro Sale. View Details »

Optimizing Ruby

Optimizing Ruby

A presentation at RubyKaigi 2016, Kyoto.

Urabe Shyouhei

September 10, 2016
Tweet

More Decks by Urabe Shyouhei

Other Decks in Technology

Transcript

  1. (de)optimizing
    ruby
    Urabe, Shyouhei

    View Slide

  2. @shyouhei
    • Long time ruby-core committer since 1.8 era.
    • Maintained ruby 1.8.5-7 (EOL-ed).
    • Made ruby/ruby repo mirror @ GitHub.
    • Now a full-time ruby dev @ Money Forward, Inc.

    View Slide

  3. View Slide

  4. tl;dr
    • Implemented deoptimization over Ruby 2.4.
    • Boosts Ruby execution up to 400+ times.
    • Makes lots of rooms for other optimizations.

    View Slide

  5. Ruby is slow

    View Slide

  6. View Slide

  7. Why is it
    • Because we have GC?
    • Because we have GVL?
    • Because Ruby is dynamic?

    View Slide

  8. Why is it
    • Because we have GC?
    • Because we have GVL?
    • Because Ruby is dynamic?

    View Slide

  9. Why is it
    • Because it is not optimized.

    View Slide

  10. Not optimized
    == disasm: #@>================================
    0000 putobject_OP_INT2FIX_O_1_C_ ( 1)
    0001 putobject 2
    0003 opt_plus ,
    0006 leave
    This is how `1 + 2` is evaluated

    View Slide

  11. Not optimized
    == disasm: #@>================================
    0000 putobject 3 ( 1)
    0002 leave
    This is what `1 + 2` should be
    (but is not)

    View Slide

  12. Why is it
    • Because `Integer#+` can be redefined
    • on-the-fly,
    • dynamically,
    • globally,
    • maybe from within other threads.

    View Slide

  13. But redefinition rarely happens
    • Redefinitions must work but, should we redefine
    things as quickly as possible?
    • Which one is better: everything runs slowly, or 99%
    of codes run fast and redefinition takes, say, 1,000x
    more time?
    • → Introducing deoptimization.

    View Slide

  14. Deoptimization
    • “Just forget about redefinitions and go as far as you
    can. If things get changed, discard the optimized bit
    and fallback to vanilla interpreter.”
    • A technique originally introduced on SELF (a
    Smalltalk variant), later applied to many other
    languages, notably JVM.
    • JRuby and Rubinius both have their own
    deoptimization engine, hence both run faster than
    the MRI.

    View Slide

  15. Our strategy
    • No JIT compile to machine native codes.
    • Just transform VM instruction sequences and let
    the VM execute them.
    • Furthermore we restrict to “patch” a sequence; we
    don’t either shrink nor grow.
    • Fill nops when needed. The nop instruction is
    expected to run adequately fast.

    View Slide

  16. VALUE *iseq_encoded
    struct rb_iseq_constant_body
    insn
    unsigned int iseq_size
    insn insn insn …
    ←ɹiseq_size wordsɹ→
    (program counter)


    View Slide

  17. VALUE *iseq_encoded
    struct rb_iseq_constant_body
    insn
    unsigned int iseq_size
    insn insn insn …
    (program counter)


    insn insn insn insn …
    VALUE *iseq_deoptimize
    rb_serial_t created_at
    memcpy

    View Slide

  18. VALUE *iseq_encoded
    struct rb_iseq_constant_body
    insn
    unsigned int iseq_size
    opt opt opt …
    (program counter)


    insn insn insn insn …
    VALUE *iseq_deoptimize
    rb_serial_t created_at

    View Slide

  19. VALUE *iseq_encoded
    struct rb_iseq_constant_body
    insn
    unsigned int iseq_size
    opt opt opt …
    (program counter)


    insn insn insn insn …
    VALUE *iseq_deoptimize
    rb_serial_t created_at
    insn insn insn insn …
    memcpy

    View Slide

  20. void
    iseq_deoptimize(const rb_iseq_t *restrict i)
    {
    body_t *b = i->body;
    const target_t *d = b->deoptimize;
    const void *orig = d->ptr;
    memcpy((void *)b->iseq_encoded, orig, b->iseq_size * sizeof(VALUE));
    ISEQ_RESET_ORIGINAL_ISEQ(i);
    }

    View Slide

  21. What is good
    • Done in Pure C. No portability issues.
    • Program counter not affected by the operations.
    • Hence no need to scan the VM stack.
    • Saved vanilla sequence can be reused multiple
    times; the preparation is needed only once.

    View Slide

  22. The VM timestamp
    • In order to detect evil activities like method
    redefinitions, per-VM global timestamp counter is
    introduced.
    • This counter is an unsigned integer that is atomically-
    incremented when any of following activities happen:
    • Assignments to constants,
    • (Re-)definition of methods,
    • Inclusion of modules.

    View Slide

  23. diff --git a/vm.c b/vm.c
    index c3e7bb3..148020d 100644
    --- a/vm.c
    +++ b/vm.c
    @@ -196,6 +196,7 @@
    vm_invoke_proc(rb_thread_t *th, rb_proc_t *proc, VALUE self,
    int argc, const VALUE *argv, const rb_block_t *blockptr);
    +static rb_serial_t ruby_vm_global_timestamp = 1;
    static rb_serial_t ruby_vm_global_method_state = 1;
    static rb_serial_t ruby_vm_global_constant_state = 1;
    static rb_serial_t ruby_vm_class_serial = 1;
    @@ -213,6 +214,7 @@
    rb_serial_t
    rb_next_class_serial(void)
    {
    + ATOMIC_INC(ruby_vm_global_timestamp);
    return NEXT_CLASS_SERIAL();
    }
    diff --git a/vm_method.c b/vm_method.c
    index 69f98c4..c771a5f 100644
    --- a/vm_method.c
    +++ b/vm_method.c
    @@ -89,6 +89,7 @@ rb_clear_cache(void)
    void
    rb_clear_constant_cache(void)

    View Slide

  24. diff --git a/vm_insnhelper.h b/vm_insnhelper.h
    index 98844dc..27cd18a 100644
    --- a/vm_insnhelper.h
    +++ b/vm_insnhelper.h
    @@ -123,6 +123,7 @@ enum vm_regan_acttype {
    #define CALL_METHOD(calling, ci, cc) do { \
    VALUE v = (*(cc)->call)(th, GET_CFP(), (calling), (ci), (cc)); \
    + iseq_deoptimize_if_needed(GET_ISEQ(), ruby_vm_global_timestamp); \
    if (v == Qundef) { \
    RESTORE_REGS(); \
    NEXT_INSN(); \
    --
    static inline void
    iseq_deoptimize_if_needed(
    const rb_iseq_t *restrict i,
    rb_serial_t t)
    {
    if (t != i>body—>created_at) {
    iseq_deoptimize(i);
    }
    }

    View Slide

  25. diff --git a/vm_insnhelper.c b/vm_insnhelper.c
    index 3841801..a46028c 100644
    --- a/vm_insnhelper.c
    +++ b/vm_insnhelper.c
    @@ -152,20 +172,21 @@
    static inline rb_control_frame_t *
    vm_push_frame(rb_thread_t *th,
    const rb_iseq_t *iseq,
    VALUE type,
    VALUE self,
    VALUE specval,
    VALUE cref_or_me,
    const VALUE *pc,
    VALUE *sp,
    int local_size,
    int stack_max)
    {
    rb_control_frame_t *const cfp = th->cfp - 1;
    int i;
    vm_check_frame(type, specval, cref_or_me);
    VM_ASSERT(local_size >= 1);
    + iseq_deoptimize_if_needed(iseq, ruby_vm_global_timestamp);
    /* check stack overflow */
    CHECK_VM_STACK_OVERFLOW0(cfp, sp, local_size + stack_max);

    View Slide

  26. Almost no overheads
    class C
    def method_missing mid
    end
    end
    obj = C.new
    i = 0
    while i<6_000_000 # benchmark loop 2
    i += 1
    obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; obj.m; obj.m;
    end
    USVOL
    PVST
    UJNF
    0 1 2 3
    2.441
    2.412
    vm2_method_missing*

    View Slide

  27. Deoptimization
    • We made a deoptimization engine of ruby.
    • Its main characteristics include consistency of VM
    states such as program counter.
    • Very lightweight.

    View Slide

  28. Optimize on it
    • Various optimization can be thought of:
    • Eliminating send variants,
    • Constant folding,
    • Eliminating unused variables.

    View Slide

  29. Folding constants
    --- /dev/shm/1wqj345 2016-08-17 14:23:36.000000000 +0900
    +++ /dev/shm/978zae 2016-08-17 14:23:36.000000000 +0900
    @@ -20,9 +20,13 @@ local table (size: 2, argc: 0 [opts: 0,
    |------------------------------------------------------------------------
    0000 trace 256 ( 8)
    0002 trace 1 ( 9)
    -0004 getinlinecache 13,
    -0007 getconstant :RubyVM
    -0009 getconstant :InstructionSequence
    -0011 setinlinecache
    +0004 putobject RubyVM::InstructionSequence
    +0006 nop
    +0007 nop
    +0008 nop
    +0009 nop
    +0010 nop
    +0011 nop
    +0012 nop
    0013 trace 512 ( 10)
    0015 leave ( 9)

    View Slide

  30. Folding constants
    • Constants are already inline-cached.
    • Just replace the getinlinecache in question with
    putobject, and fill the rest of sequence with nop.

    View Slide

  31. diff --git a/insns.def b/insns.def
    index 0c71b32..cf7f009 100644
    --- a/insns.def
    +++ b/insns.def
    @@ -1319,6 +1319,10 @@ getinlinecache
    {
    if (ic->ic_serial == GET_GLOBAL_CONSTANT_STATE() &&
    (ic->ic_cref == NULL || ic->ic_cref == rb_vm_get_cref(GET_EP()))) {
    + const rb_iseq_t *i = GET_ISEQ();
    + const VALUE *p = GET_PC();
    +
    + iseq_const_fold(i, p, OPN_OF_CURRENT_INSN + 1, dst, ic->ic_value.value);
    val = ic->ic_value.value;
    JUMP(dst);
    }

    View Slide

  32. void
    iseq_const_fold(
    const rb_iseq_t *restrict i,
    const VALUE *pc,
    int n,
    long m,
    VALUE konst)
    {
    VALUE *buf = (VALUE *)&pc[-n];
    int len = n + m;
    memcpy(buf, wipeout_pattern, len * sizeof(VALUE));
    buf[0] = putobject;
    buf[1] = konst;
    }
    “nop nop nop …”

    View Slide

  33. Folding 1+2
    --- /dev/shm/xj79gt 2016-08-17 17:09:31.000000000 +0900
    +++ /dev/shm/1gaaeo 2016-08-17 17:09:31.000000000 +0900
    @@ -20,8 +20,10 @@ local table (size: 2, argc: 0 [opts: 0,
    |------------------------------------------------------------------------
    0000 trace 256 ( 7)
    0002 trace 1 ( 8)
    -0004 putobject_OP_INT2FIX_O_1_C_
    -0005 putobject 2
    -0007 opt_plus ,
    +0004 putobject 3
    +0006 nop
    +0007 nop
    +0008 nop
    +0009 nop
    0010 trace 512 ( 9)
    0012 leave ( 8)

    View Slide

  34. diff --git a/insns.def b/insns.def
    index cf7f009..9bf6025 100644
    --- a/insns.def
    +++ b/insns.def
    @@ -1459,23 +1458,28 @@ opt_plus
    #else
    val = LONG2NUM(FIX2LONG(recv) + FIX2LONG(obj));
    #endif
    + TRY_CONSTFOLD(val);
    }
    else if (FLONUM_2_P(recv, obj) &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) {
    val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj));
    + TRY_CONSTFOLD(val);
    }
    else if (!SPECIAL_CONST_P(recv) && !SPECIAL_CONST_P(obj)) {
    if (RBASIC_CLASS(recv) == rb_cFloat && RBASIC_CLASS(obj) == rb_cFloat &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) {
    val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj));
    + TRY_CONSTFOLD(val);
    }
    else if (RBASIC_CLASS(recv) == rb_cString && RBASIC_CLASS(obj) == rb_cString &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, STRING_REDEFINED_OP_FLAG)) {
    val = rb_str_plus(recv, obj);
    + TRY_CONSTFOLD(val);
    }
    else if (RBASIC_CLASS(recv) == rb_cArray &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, ARRAY_REDEFINED_OP_FLAG)) {
    val = rb_ary_plus(recv, obj);
    + TRY_CONSTFOLD(val);
    }
    else {
    goto INSN_LABEL(normal_dispatch);
    @@ -1508,15 +1512,18 @@ opt_minus

    View Slide

  35. Elimination of send
    --- /dev/shm/179yavr 2016-08-12 19:41:44.000000000 +0900
    +++ /dev/shm/uma7jr 2016-08-12 19:41:44.000000000 +0900
    @@ -15,9 +15,12 @@
    |------------------------------------------------------------------------
    0000 trace 256 ( 8)
    0002 trace 1 ( 9)
    -0004 putself
    -0005 opt_send_without_block ,
    -0008 adjuststack 1
    +0004 nop
    +0005 nop
    +0006 nop
    +0007 nop
    +0008 nop
    +0009 nop
    0010 trace 1 ( 10)
    0012 nop
    0013 nop

    View Slide

  36. Method purity
    • A method eligible to be skipped is considered “pure”.
    • A method is marked to be not pure if …
    • It writes to variables other than local ones.
    • It yields.
    • It is not written in Ruby.
    • It calls other methods that are not pure.

    View Slide

  37. Methods that are not pure
    def m
    Time.now
    end
    def m
    @foo = self
    end
    def m
    yield
    end
    def m
    { foo: :bar }
    end
    rb_define_method(rb_cTCPServer, "sysaccept", tcp_sysaccept, 0);

    View Slide

  38. Methods that are pure
    def m(x)
    y = i = 0
    while i < x
    z = i % 2 == 0 ? 1 : -1
    y += z / (2 * i + 1.0)
    i += 1
    end
    return 4 * y
    end
    def m(x, y, z = ' ')
    n = y - x.length
    while n > 0 do
    n -= z.length
    x = z + x
    end
    return x
    end

    View Slide

  39. Method purity
    • “A method is either pure (optimizable) or not” is, in
    fact, an oversimplification.
    • There is a third state: indeterministic.
    • For instance, one cannot say if a method is pure or
    not when that method calls something inside,
    which is not defined, resulting a call to
    method_missing.

    View Slide

  40. Method purity
    • So a method’s purity is determined on-the-fly.
    • Each method starts with its purity being not
    predicted.
    • While running the method we collect a method’s
    usage to detect its purity.
    • When a method’s purity is determined, that info
    propagates to its callers.

    View Slide

  41. Method purity
    --- /dev/shm/179yavr 2016-08-12 19:41:44.000000000 +0900
    +++ /dev/shm/uma7jr 2016-08-12 19:41:44.000000000 +0900
    @@ -15,9 +15,12 @@
    |------------------------------------------------------------------------
    0000 trace 256 ( 8)
    0002 trace 1 ( 9)
    -0004 putself
    -0005 opt_send_without_block ,
    -0008 adjuststack 1
    +0004 nop
    +0005 nop
    +0006 nop
    +0007 nop
    +0008 nop
    +0009 nop
    0010 trace 1 ( 10)
    0012 nop
    0013 nop
    This

    View Slide

  42. enum insn_purity
    purity_of_cc(const struct rb_call_cache *cc)
    {
    const rb_iseq_t *i;
    if (! cc->me) {
    return insn_is_unpredictable; /* method missing */
    }
    else if (! (i = iseq_of_me(cc->me))) {
    return insn_is_not_pure; /* not written in ruby. */
    }
    else if (! i->body->attributes) {
    /* Note, we do not recursively analyze. That can lead to infinite
    * recursion on mutually recursive calls and detecting that is too
    * expensive in this hot path.*/
    return insn_is_unpredictable;
    }
    else {
    return purity_of_VALUE(RB_ISEQ_ANNOTATED_P(i, core::purity));
    }
    }

    View Slide

  43. enum insn_purity
    purity_of_sendish(const VALUE *argv)
    {
    enum ruby_vminsn_type insn = argv[0];
    const char *ops = insn_op_types(insn);
    enum insn_purity purity = insn_is_pure;
    for (int j = 0; j < insn_len(insn); j++) {
    if (ops[j] == TS_CALLCACHE) {
    struct rb_call_cache *cc = (void *)argv[j + 1];
    purity += purity_of_cc(cc);
    }
    }
    return purity;
    }

    View Slide

  44. Eliminating send-ish instructions
    • “Method calls whose return values are discarded”
    are subject to eliminate.
    • Method calls just check the calling method’s purity;
    later if the immediately following instruction
    discards its return value, that preceding method
    call can be eliminated.
    • Actual elimination happens in pop, not in send.

    View Slide

  45. diff --git a/insns.def b/insns.def
    index c9d7204..2b877ff 100644
    --- a/insns.def
    +++ b/insns.def
    @@ -711,9 +722,17 @@ DEFINE_INSN
    adjuststack
    (rb_num_t n)
    (...)
    (...) // inc -= n
    {
    DEC_SP(n);
    + /* If the immediately precedent instruction was send (or its
    + * variant), and here we are in adjuststack instruction, this
    + * means the return value of the method call is silently
    + * discarded. Then why not just avoid the whole method calling?
    + * This is possible when the callee method was marked pure. Note
    + * however that even on such case, evaluation of method arguments
    + * cannot be skipped, because they can have their own side
    + * effects.
    + */
    + vm_eliminate_insn(GET_CFP(), GET_PC(), OPN_OF_CURRENT_INSN + 1, n);
    }

    View Slide

  46. void
    iseq_eliminate_insn(
    const rb_iseq_t *restrict i,
    struct cfp_last_insn *restrict p,
    int n,
    rb_num_t m)
    {
    VALUE *buf = (VALUE *)&i->body->iseq_encoded[p->pc];
    int len = p->len + n;
    int argc = p->argc + m;
    memcpy(buf, wipeout_pattern, len * sizeof(VALUE));
    if (argc != 0) {
    buf[0] = adjuststack;
    buf[1] = argc;
    }
    ISEQ_RESET_ORIGINAL_ISEQ(i);
    FL_SET(i, ISEQ_NEEDS_ANALYZE);
    }
    “nop nop nop …”
    in case arguments
    have side effects

    View Slide

  47. Example of argument side-effect
    --- /dev/shm/165rrgd 2016-08-17 10:44:10.000000000 +0900
    +++ /dev/shm/jd0rcj 2016-08-17 10:44:10.000000000 +0900
    @@ -23,8 +23,10 @@ local table (size: 2, argc: 0 [opts: 0,
    0004 putself
    0005 putself
    0006 opt_send_without_block ,
    -0009 opt_send_without_block ,
    -0012 adjuststack 1
    +0009 adjuststack 2
    +0011 nop
    +0012 nop
    +0013 nop
    0014 trace 1 ( 16)
    0016 putnil
    0017 trace 512 ( 17)
    suppose we
    can't optimize
    suppose we
    can't optimize
    suppose we
    can't optimize
    need clear
    stack top

    View Slide

  48. Elimination of variables
    --- /dev/shm/ea2lud 2016-08-19 10:40:28.000000000 +0900
    +++ /dev/shm/1v4irx0 2016-08-19 10:40:28.000000000 +0900
    @@ -17,8 +17,10 @@ local table (size: 3, argc: 1 [opts: 0,
    [ 3] i [ 2] x
    0000 trace 256 ( 4)
    0002 trace 1 ( 5)
    -0004 putobject :foo
    -0006 setlocal_OP__WC__0 2
    +0004 nop
    +0005 nop
    +0006 nop
    +0007 nop
    0008 trace 1 ( 6)
    0010 putnil
    0011 trace 512 ( 7)

    View Slide

  49. (… is a bit hard though)

    View Slide

  50. Elimination of variables
    • We eliminate variables that are assigned, but never
    used later (write-only).
    • Only methods that are pure can be considered.
    • Methods with side effects might access bindings.
    • Blocks might share local variables so writeonly-ness
    should consider all reachable blocks.

    View Slide

  51. Elimination of variables
    • There might also be other kinds of variables that
    are safe to be eliminated, but detection of such
    variables is very difficult to do precisely on-the-fly.

    View Slide

  52. Optimizations
    • Fairly basic optimizations are implemented.
    • All optimizations run on-the-fly, preserve VM states
    such as exception tables.
    • There are rooms for other optimization techniques,
    like subexpression eliminations.

    View Slide

  53. Benchmarks
    • CAUTION: YMMV
    • `make benchmark` results on my machine.
    • Not a brand-new box; its /proc/cpuinfo says
    “Intel(R) Core(TM)2 Duo CPU T7700”.
    • Following results show average of 7 executions.

    View Slide

  54. Speedup ratio versus trunk (greater=faster)
    0.1
    1
    10
    100
    app_answer
    app_aobench
    app_factorial
    app_fib
    app_lc_fizzbuzz
    app_mandelbrot
    app_pentomino
    app_raise
    app_strconcat
    app_tak
    app_tarai
    hash_aref_dsym
    hash_aref_dsym_long
    hash_aref_fix
    hash_aref_flo
    hash_aref_miss
    hash_aref_str
    hash_aref_sym
    hash_aref_sym_long
    hash_flatten
    hash_ident_flo
    hash_ident_num
    hash_ident_obj
    hash_ident_str
    hash_ident_sym
    hash_keys
    hash_shift
    hash_shift_u16
    hash_shift_u24
    hash_shift_u32
    hash_to_proc
    hash_values
    io_file_create
    io_select
    io_select2
    io_select3
    loop_for
    loop_generator
    loop_times
    loop_whileloop
    loop_whileloop2
    marshal_dump_flo
    marshal_dump_load_geniv
    marshal_dump_load_time
    require
    require_thread
    so_ackermann
    so_array
    so_binary_trees
    so_concatenate
    so_count_words
    so_exception
    so_fannkuch
    so_fasta
    so_k_nucleotide
    so_lists
    so_mandelbrot
    so_matrix
    so_meteor_contest
    so_nbody
    so_nested_loop
    so_nsieve
    so_nsieve_bits
    so_object
    so_partial_sums
    so_pidigits
    so_random
    so_reverse_complement
    so_sieve
    so_spectralnorm
    vm1_attr_ivar*
    vm1_attr_ivar_set*
    vm1_block*
    vm1_const*
    vm1_ensure*
    vm1_float_simple*
    vm1_gc_short_lived*
    vm1_gc_short_with_complex_long*
    vm1_gc_short_with_long*
    vm1_gc_short_with_symbol*
    vm1_gc_wb_ary*
    vm1_gc_wb_ary_promoted*
    vm1_gc_wb_obj*
    vm1_gc_wb_obj_promoted*
    vm1_ivar*
    vm1_ivar_set*
    vm1_length*
    vm1_lvar_init*
    vm1_lvar_set*
    vm1_neq*
    vm1_not*
    vm1_rescue*
    vm1_simplereturn*
    vm1_swap*
    vm1_yield*
    vm2_array*
    vm2_bigarray*
    vm2_bighash*
    vm2_case*
    vm2_case_lit*
    vm2_defined_method*
    vm2_dstr*
    vm2_eval*
    vm2_method*
    vm2_method_missing*
    vm2_method_with_block*
    vm2_mutex*
    vm2_newlambda*
    vm2_poly_method*
    vm2_poly_method_ov*
    vm2_proc*
    vm2_raise1*
    vm2_raise2*
    vm2_regexp*
    vm2_send*
    vm2_string_literal*
    vm2_struct_big_aref_hi*
    vm2_struct_big_aref_lo*
    vm2_struct_big_aset*
    vm2_struct_big_href_hi*
    vm2_struct_big_href_lo*
    vm2_struct_big_hset*
    vm2_struct_small_aref*
    vm2_struct_small_aset*
    vm2_struct_small_href*
    vm2_struct_small_hset*
    vm2_super*
    vm2_unif1*
    vm2_zsuper*
    vm3_backtrace
    vm3_clearmethodcache
    vm3_gc
    vm3_gc_old_full
    vm3_gc_old_immediate
    vm3_gc_old_lazy
    vm_symbol_block_pass
    vm_thread_alive_check1
    vm_thread_close
    vm_thread_create_join
    vm_thread_mutex1
    vm_thread_mutex2
    vm_thread_mutex3
    vm_thread_pass
    vm_thread_pass_flood
    vm_thread_pipe
    vm_thread_queue
    Slower
    Faster

    View Slide

  55. vm1_simplereturn*
    vm2_defined_method*
    vm2_method*
    vm2_poly_method*
    vm2_super*
    vm2_zsuper*
    Execution time [sec]
    0 1.25 2.5 3.75 5
    0.210
    0.196
    0.873
    0.467
    0.477
    0.230
    0.836
    0.768
    3.949
    1.920
    4.174
    1.256
    trunk ours
    Faster
    Resulted in identical
    instruction sequences

    View Slide

  56. app_pentomino
    hash_aref_dsym_long
    so_binary_trees
    vm2_eval*
    vm2_raise2*
    Execution time [sec]
    0 12.5 25 37.5 50
    12.138
    40.683
    11.801
    10.517
    24.847
    11.237
    37.976
    10.439
    10.26
    23.407
    trunk ours
    Faster
    deoptimization
    overhead

    View Slide

  57. vm1_attr_ivar*
    vm1_attr_ivar_set*
    vm1_block*
    vm1_ivar*
    vm1_ivar_set*
    2_method_with_block*
    Execution time [sec]
    0 1 2 3 4
    2.642
    0.865
    0.739
    3.782
    2.379
    2.125
    2.039
    0.654
    0.625
    2.572
    1.807
    1.755
    trunk ours
    Faster

    View Slide

  58. vm1_gc_short_lived*
    vm2_array*
    vm2_bigarray*
    vm2_string_literal*
    Execution time [sec]
    0 3 6 9 12
    0.025
    0.027
    0.032
    1.691
    0.275
    11.483
    0.943
    8.904
    trunk ours
    Faster

    View Slide

  59. Speedup ratio versus trunk (greater=faster)
    0.1
    1
    10
    100
    app_answer
    app_aobench
    app_factorial
    app_fib
    app_lc_fizzbuzz
    app_mandelbrot
    app_pentomino
    app_raise
    app_strconcat
    app_tak
    app_tarai
    hash_aref_dsym
    hash_aref_dsym_long
    hash_aref_fix
    hash_aref_flo
    hash_aref_miss
    hash_aref_str
    hash_aref_sym
    hash_aref_sym_long
    hash_flatten
    hash_ident_flo
    hash_ident_num
    hash_ident_obj
    hash_ident_str
    hash_ident_sym
    hash_keys
    hash_shift
    hash_shift_u16
    hash_shift_u24
    hash_shift_u32
    hash_to_proc
    hash_values
    io_file_create
    io_select
    io_select2
    io_select3
    loop_for
    loop_generator
    loop_times
    loop_whileloop
    loop_whileloop2
    marshal_dump_flo
    marshal_dump_load_geniv
    marshal_dump_load_time
    require
    require_thread
    so_ackermann
    so_array
    so_binary_trees
    so_concatenate
    so_count_words
    so_exception
    so_fannkuch
    so_fasta
    so_k_nucleotide
    so_lists
    so_mandelbrot
    so_matrix
    so_meteor_contest
    so_nbody
    so_nested_loop
    so_nsieve
    so_nsieve_bits
    so_object
    so_partial_sums
    so_pidigits
    so_random
    so_reverse_complement
    so_sieve
    so_spectralnorm
    vm1_attr_ivar*
    vm1_attr_ivar_set*
    vm1_block*
    vm1_const*
    vm1_ensure*
    vm1_float_simple*
    vm1_gc_short_lived*
    vm1_gc_short_with_complex_long*
    vm1_gc_short_with_long*
    vm1_gc_short_with_symbol*
    vm1_gc_wb_ary*
    vm1_gc_wb_ary_promoted*
    vm1_gc_wb_obj*
    vm1_gc_wb_obj_promoted*
    vm1_ivar*
    vm1_ivar_set*
    vm1_length*
    vm1_lvar_init*
    vm1_lvar_set*
    vm1_neq*
    vm1_not*
    vm1_rescue*
    vm1_simplereturn*
    vm1_swap*
    vm1_yield*
    vm2_array*
    vm2_bigarray*
    vm2_bighash*
    vm2_case*
    vm2_case_lit*
    vm2_defined_method*
    vm2_dstr*
    vm2_eval*
    vm2_method*
    vm2_method_missing*
    vm2_method_with_block*
    vm2_mutex*
    vm2_newlambda*
    vm2_poly_method*
    vm2_poly_method_ov*
    vm2_proc*
    vm2_raise1*
    vm2_raise2*
    vm2_regexp*
    vm2_send*
    vm2_string_literal*
    vm2_struct_big_aref_hi*
    vm2_struct_big_aref_lo*
    vm2_struct_big_aset*
    vm2_struct_big_href_hi*
    vm2_struct_big_href_lo*
    vm2_struct_big_hset*
    vm2_struct_small_aref*
    vm2_struct_small_aset*
    vm2_struct_small_href*
    vm2_struct_small_hset*
    vm2_super*
    vm2_unif1*
    vm2_zsuper*
    vm3_backtrace
    vm3_clearmethodcache
    vm3_gc
    vm3_gc_old_full
    vm3_gc_old_immediate
    vm3_gc_old_lazy
    vm_symbol_block_pass
    vm_thread_alive_check1
    vm_thread_close
    vm_thread_create_join
    vm_thread_mutex1
    vm_thread_mutex2
    vm_thread_mutex3
    vm_thread_pass
    vm_thread_pass_flood
    vm_thread_pipe
    vm_thread_queue
    Slower
    Faster

    View Slide

  60. Benchmarks
    • Most benchmarks show same performance.
    • The optimizations work drastically for several
    benchmarks.
    • There do exist cases of slowdowns, but IMHO
    marginal amount of overheads.

    View Slide

  61. Conclusion
    • Implemented deoptimization over Ruby 2.4.
    • Boosts Ruby execution up to 400+ times.
    • Makes lots of rooms for other optimizations.

    View Slide

  62. Future works
    • Other optimizations can be thought of, such as:
    • Subexpression elimination;
    • Variable liveness & escape analysis;
    • and more.
    • Allowing to modify program counter would make
    more rooms for further optimizations.

    View Slide

  63. FAQs
    • Q: where is the patch?
    • A: https://github.com/ruby/ruby/pull/1419
    • Q: does this speed up Rails?
    • A: not really.
    • Q: does this work Ruby 3x3 out?
    • A: it depends (3x3 goal is vague), but I believe I’m on
    the right path.

    View Slide