The send-pop optimisation

The send-pop optimisation Urabe, Shyouhei Photo by Tomoyuki Kengaku

In a nutshell, this talk is about… • The “send-pop”
sequence we focus in this talk is a pattern that appears very frequently in a Ruby program. • We propose automatic detection of them, and let the inter- preter optimise that part. • This optimisation enhances some benchmark results, including Rails.

Motivations Photo by Tomoyuki Kengaku

def foo something something_another return something_else end

== disasm: #<ISeq:foo@<compiled>:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself 0001 send
<callinfo!mid:something, argc:0, 0005 pop 0006 putself 0007 send <callinfo!mid:something_another, 0011 pop 0012 putself 0013 send <callinfo!mid:something_else, ar 0017 leave

The “send-pop” sequence • Calling a method, then immediately discarding
its return value. • Note that every method in Ruby has return value(s). • The value(s) returned however do not have to be used. • Even when a method does not expect its caller to take return values, it has to return something “just in case” the expectation breaks. • Waste of both time and memory.

But how often? • By taking 2-grams of a mame/optcarrot
execution…

% LANG=C sort 2gram.txt | uniq -c | sort -nr
| head -n 10 69065813 getinstancevariable -> getinstancevariable 65600442 putself -> getinstancevariable 59624140 getinstancevariable -> branchunless 59116388 branchunless -> getinstancevariable 52828407 leave -> pop 50434175 getinstancevariable -> putobject 30368815 pop -> putself 27717161 setinstancevariable -> getinstancevariable 25661090 branchunless -> putself 25165032 getinstancevariable -> branchif

But how often? • By taking 2-grams of a mame/optcarrot
execution, the sequence in question is #5 most frequent. • This is definitely worth consideration.

Relax them Photo by Tomoyuki Kengaku

First step: allow arbitrary return values • We cannot entirely
eliminate return values. • In the wild, there already are methods written in C. • They cannot be modified, and they already return something. • The best we can do is to allow methods to return arbitrary values when they are not used by their callers. • Let each methods decide what to return. We can auto-optimise pure-ruby methods later.

Pass 1-bit flag to each method • Every time a
method is called, some flags are passed to it already. • Why not add another one who describes the usage of its return value(s).

diff --git a/vm_core.h b/vm_core.h index 574837dea0..513b8b85c1 100644 --- a/vm_core.h +++
b/vm_core.h @@ -1132,11 +1133,11 @@ typedef rb_control_frame_t * enum { /* Frame/Environment flag bits: - * MMMM MMMM MMMM MMMM ____ __FF FFFF EEEX (LSB) + * MMMM MMMM MMMM MMMM ____ _FFF FFFF EEEX (LSB) * * X : tag for GC marking (It seems as Fixnum) * EEE : 3 bits Env flags - * FF..: 6 bits Frame flags + * FF..: 7 bits Frame flags * MM..: 15 bits frame magic (to check frame corruption) */ @@ -1160,6 +1161,7 @@ enum { VM_FRAME_FLAG_CFRAME = 0x0080, VM_FRAME_FLAG_LAMBDA = 0x0100, VM_FRAME_FLAG_MODIFIED_BLOCK_PARAM = 0x0200, + VM_FRAME_FLAG_POPPED = 0x0400, /* env flag */ VM_ENV_FLAG_LOCAL = 0x0002,

diff --git a/vm_insnhelper.c b/vm_insnhelper.c index a2f7433029..b024b29fc6 100644 --- a/vm_insnhelper.c +++
b/vm_insnhelper.c @@ -1767,12 +1767,13 @@ static inline VALUE vm_call_iseq_setup_normal(rb_execution_context_t *ec, rb_control_frame_t *cfp, struct rb_calling_in int opt_pc, int param_size, int local_size) { + int popped = calling->popped; const rb_iseq_t *iseq = def_iseq_ptr(me->def); VALUE *argv = cfp->sp - calling->argc; VALUE *sp = argv + param_size; cfp->sp = argv - 1 /* recv */; - vm_push_frame(ec, iseq, VM_FRAME_MAGIC_METHOD | VM_ENV_FLAG_LOCAL, calling->recv, + vm_push_frame(ec, iseq, VM_FRAME_MAGIC_METHOD | VM_ENV_FLAG_LOCAL | popped, calling->recv, calling->block_handler, (VALUE)me, iseq->body->iseq_encoded + opt_pc, sp, local_size - param_size, @@ -1791,6 +1792,7 @@ vm_call_iseq_setup_tailcall(rb_execution_context_t *ec, rb_control_frame_t *cf VALUE *src_argv = argv; VALUE *sp_orig, *sp; VALUE finish_flag = VM_FRAME_FINISHED_P(cfp) ? VM_FRAME_FLAG_FINISH : 0; + unsigned long popped = VM_ENV_FLAGS(cfp->ep, VM_FRAME_FLAG_POPPED); if (VM_BH_FROM_CFP_P(calling->block_handler, cfp)) { struct rb_captured_block *dst_captured = VM_CFP_TO_CAPTURED_BLOCK(RUBY_VM_PREVIOUS_CONTROL_ @@ -1818,7 +1820,7 @@ vm_call_iseq_setup_tailcall(rb_execution_context_t *ec, rb_control_frame_t *cf *sp++ = src_argv[i];

Use the flag Photo by Tomoyuki Kengaku

Let pure-Ruby methods check that flag • We can make
pure-Ruby methods check that flag automatically, so that they can skip rearmost instructions. • For instance when we have: def foo(x) y = bar(x) return y end

== disasm: #<ISeq:foo@<compiled>:1 (1,2)-(4,5)> (catch: FALSE) local table (size: 2,
argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0<Arg> [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send <callinfo!mid:bar, argc:1, FCALL 0008 setlocal y@1, 0 0011 getlocal y@1, 0 0014 leave Waste of time if the value returned is not used

argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0<Arg> [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send <callinfo!mid:bar, argc:1, FCALL 0008 opt_bailout 1 0010 setlocal y@1, 0 0013 getlocal y@1, 0 ( 3) 0016 leave

+/* This instruction is no-op unless the instruction sequence is
called + * with VM_FRAME_FLAG_POPPED. With that flag on, it immediately + * leaves the current stack frame with scratching the topmost n stack + * values. The return value of the iseq for that case is always + * nil. */ +DEFINE_INSN +opt_bailout +(rb_num_t n) +() +() +{ +#ifdef MJIT_HEADER + /* :FIXME: don't know how to make it work with JIT... */ +#else + if (VM_ENV_FLAGS(GET_EP(), VM_FRAME_FLAG_POPPED) && + CURRENT_INSN_IS(opt_bailout) /* <- rule out trace instruction */ ) { + POPN(n); + PUSH(Qnil); + DISPATCH_ORIGINAL_INSN(leave); + } + #endif +} + /**********************************************************/ /* deal with control flow 3: exception */ /**********************************************************/

Automatic insertion of it Photo by Tomoyuki Kengaku

Make the insertion automatic • What operations are safe to
be skipped when a return value is not used? • Obviously not everything are. • That concept should be identical to what we call “pure” operations, proposed in RubyKaigi 2016.

Automatic bail out of a method • In stead of
thinking a method being entirely pure or not, we are gong to focus on each method’s rearmost part that are pure. • Such part, if any, makes no sense when the return value of the method is discarded.

argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0<Arg> [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send <callinfo!mid:bar, argc:1, FCALL 0008 setlocal y@1, 0 0011 getlocal y@1, 0 0014 leave pure pure not pure pure pure pure

argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0<Arg> [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send <callinfo!mid:bar, argc:1, FCALL 0008 opt_bailout 1 0010 setlocal y@1, 0 0013 getlocal y@1, 0 ( 3) 0016 leave

C API (nit-picky) Photo by Tomoyuki Kengaku

Can we also optimize C methods? • We cannot auto-skip
a part of a C method. • But the `VM_FRAME_FLAG_POPPED` flag is always set, no matter the called method is in Ruby or not. • Why not make it visible from C, so that future methods can look at it.

diff --git a/vm.c b/vm.c index c5beed64c0..d33ff98619 100644 --- a/vm.c +++
b/vm.c @@ -3544,4 +3544,14 @@ vm_collect_usage_register(int reg, int isset) #endif /* #ifndef MJIT_HEADER */ +int +rb_whether_the_return_value_is_used_p(void) +{ + const struct rb_execution_context_struct *ec = GET_EC(); + const struct rb_control_frame_struct *reg_cfp = ec->cfp; + const VALUE *ep = GET_EP(); + + return ! VM_ENV_FLAGS(ep, VM_FRAME_FLAG_POPPED); +} + #include "vm_call_iseq_optimized.inc" /* required from vm_insnhelper.c */

Practical applications • `StringScanner#scan` scans the receiver, advances its internal
pointer, then returns the matched string. The “matched string” can be omitted by leveraging the flag. • Exact same discussion applies to `String#slice!`

Eliminating `pop`s Photo by Tomoyuki Kengaku

def foo something something_another return something_else end Recap:

<callinfo!mid:something, argc:0, 0005 pop 0006 putself 0007 send <callinfo!mid:something_another, 0011 pop 0012 putself 0013 send <callinfo!mid:something_else, ar 0017 leave Would like to eliminate those `pop`s

<callinfo!mid:something, argc:0, 0005 putself 0006 send <callinfo!mid:something_another, 0010 putself 0011 send <callinfo!mid:something_else, ar 0015 leave Would like to eliminate those `pop`s

Note however, that: • The elimination is not always possible.
• That `pop` can be a jump destination. • For an (illustrative) example: def foo self &. x nil end

== disasm: #<ISeq:foo@<compiled>:1 (1,0)-(4,3)> (catch: FALSE) 0000 putself 0001 dup
0002 branchnil 7 0004 opt_send_without_block <callinfo!mid:x, argc:0, ARG 0007 pop 0008 putnil 0009 leave This `pop` is not optimizable.

Let us add another frame flag • Called `VM_FRAME_FLAG_POPIT`. •
This flag denotes that the pop instruction in the caller was optimised out from the sequence. • Hence when the flag is set, it is the callee’s duty to properly skip pushing return value(s), not its caller’s.

diff –git a/vm_core.h b/vm_core.h index 0b3f3e06ba..932c70a734 100644 --- a/vm_core.h +++
b/vm_core.h @@ -1134,11 +1136,11 @@ typedef rb_control_frame_t * enum { /* Frame/Environment flag bits: - * MMMM MMMM MMMM MMMM ____ _FFF FFFF EEEX (LSB) + * MMMM MMMM MMMM MMMM ____ FFFF FFFF EEEX (LSB) * * X : tag for GC marking (It seems as Fixnum) * EEE : 3 bits Env flags - * FF..: 7 bits Frame flags + * FF..: 8 bits Frame flags * MM..: 15 bits frame magic (to check frame corruption) */ @@ -1163,6 +1165,7 @@ enum { VM_FRAME_FLAG_LAMBDA = 0x0100, VM_FRAME_FLAG_MODIFIED_BLOCK_PARAM = 0x0200, VM_FRAME_FLAG_POPPED = 0x0400, + VM_FRAME_FLAG_POPIT = 0x0800, /* env flag */ VM_ENV_FLAG_LOCAL = 0x0002,

<callinfo!mid:something, argc:0, 0005 pop 0006 putself 0007 send <callinfo!mid:something_another, 0011 pop 0012 putself 0013 send <callinfo!mid:something_else, ar 0017 leave

<callinfo!mid:something, argc:0, 0005 putself 0006 send <callinfo!mid:something_another, 0010 putself 0011 send <callinfo!mid:something_else, ar 0015 leave

== disasm: #<ISeq:foo@<compiled>:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself ( 2)[LiCa]
0001 send <callinfo!mid:something, argc:0, FCALL|VCALL|ARGS_SIMPLE, [POPIT]>, <callcache>, nil 0005 putself ( 3)[Li] 0006 send <callinfo!mid:something_another, argc:0, FCALL|VCALL|ARGS_SIMPLE, [POPIT]>, <callcache>, nil 0010 putself ( 4)[Li] 0011 send <callinfo!mid:something_else, argc:0, FCALL|VCALL|ARGS_SIMPLE>, <callcache>, nil 0015 leave ( 5)[Re]

Avoid pushing return values Photo by Tomoyuki Kengaku

In order to properly avoid pushing… • We have to
consider 3 (!) distinct situations. • Returning from a method written in C. • Returning from a method written in Ruby. • Returning from inside of a block.

C method return values • C methods return values using
C’s return semantics. Just discarding them should suffice. VALUE foo(VALUE x) { VALUE y = complex_calculation(x); return y; }

diff --git a/tool/ruby_vm/views/_insn_entry.erb b/tool/ruby_vm/views/_insn_entry.erb index cdadd93abc..bbfe539fd2 100644 --- a/tool/ruby_vm/views/_insn_entry.erb +++
b/tool/ruby_vm/views/_insn_entry.erb @@ -56,7 +58,18 @@ INSN_ENTRY(<%= insn.name %>) /* ### Instruction trailers. ### */ CHECK_VM_STACK_OVERFLOW_FOR_INSN(VM_REG_CFP, INSN_ATTR(retn)); <%= insn.handle_canary "CHECK_CANARY()" -%> -% if insn.handles_sp? +% if insn.sendish? # Then we can safely assume there is only one return value. +% if insn.handles_sp? + if (! (ci->compiled_frame_bits & VM_FRAME_FLAG_POPIT)) { + PUSH(<%= insn.cast_to_VALUE insn.rets.first %>); + } +% else + INC_SP(INSN_ATTR(sp_inc)); + if (! (ci->compiled_frame_bits & VM_FRAME_FLAG_POPIT)) { + TOPN(0) = <%= insn.cast_to_VALUE insn.rets.first %>; + } +% end +% elsif insn.handles_sp? % insn.rets.reverse_each do |ret| PUSH(<%= insn.cast_to_VALUE ret %>); % end

Ruby method return values • Ruby methods (normally) return values
using `leave` instruction. def foo(x) return x + 1 end

argc: 1 [opts: 0, rest: -1, post: 0, block: [ 1] x@0<Arg> 0000 getlocal x@0, 0 0003 putobject 1 0005 send <callinfo!mid:+, argc:1, ARG 0009 leave

diff --git a/insns.def b/insns.def index a38dc30168..68e7eabfae 100644 --- a/insns.def +++
b/insns.def @@ -927,7 +911,7 @@ DEFINE_INSN leave () (VALUE val) -(VALUE val) +(...) /* This is super surprising but when leaving from a frame, we check * for interrupts. If any, that should be executed on top of the * current execution context. This is a method call. */ @@ -939,7 +923,10 @@ leave // attr enum rb_insn_purity purity = rb_insn_is_pure; /* And this instruction handles SP by nature. */ // attr bool handles_sp = true; +// attr rb_snum_t sp_inc = 0; { + bool popit = VM_ENV_FLAGS(GET_EP(), VM_FRAME_FLAG_POPIT); + if (OPT_CHECKED_RUN) { const VALUE *const bp = vm_base_ptr(reg_cfp); if (reg_cfp->sp != bp) { @@ -959,6 +946,9 @@ leave } else { RESTORE_REGS(); + if (! popit) { + PUSH(val); + } } }

So far so good… but, • It immediately gets complicated
when a block has a return statement. def foo(x) x.times do | i | return i end end p foo(42) # => 0

So far so good… but, • It immediately gets complicated
when a block has a return statement. def foo(x) x.times &-> (i) do return i end end p foo(42) # => 42

What “return-inside-of-a-block” does: 1. Look for the exact place where
the execution to proceed. 2. Rewind the stack. 3. Push the return value onto the stack. 4. Continue executing. This has to be cancelled, however: The flag has been squashed already

diff --git a/vm.c b/vm.c index 807a20ee5a..057863e5e3 100644 --- a/vm.c +++
b/vm.c @@ -1926,6 +1926,7 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, VALUE errinfo, VALUE *initial) { struct vm_throw_data *err = (struct vm_throw_data *)errinfo; + bool popit = false; for (;;) { unsigned int i; @@ -1950,6 +1951,7 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, rb_vm_frame_method_entry(ec->cfp)->owner, rb_vm_frame_method_entry(ec->cfp)->def->original_id); } + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); } @@ -1983,6 +1985,7 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, ec->errinfo = Qnil; THROW_DATA_CATCH_FRAME_SET(err, cfp + 1); hook_before_rewind(ec, ec->cfp, TRUE, state, err); + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); return THROW_DATA_VAL(err); } @@ -1994,7 +1997,9 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, #if OPT_STACK_CACHING *initial = THROW_DATA_VAL(err); #else - *ec->cfp->sp++ = THROW_DATA_VAL(err); + if (! popit) { + *ec->cfp->sp++ = THROW_DATA_VAL(err); + } #endif ec->errinfo = Qnil; return Qundef; @@ -2128,12 +2133,14 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, hook_before_rewind(ec, ec->cfp, FALSE, state, err); if (VM_FRAME_FINISHED_P(ec->cfp)) { + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); ec->errinfo = (VALUE)err; ec->tag = ec->tag->prev; EC_JUMP_TAG(ec, state); } else { + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); } }

Benchmarks Photo by Tomoyuki Kengaku

Several benchmarks were exercised. • Caution: YMMV • All benchmarks
are done on this exact machine I am projecting this presentation: 6th gen. ThinkPad X1 Carbon. • They all compare trunk (2.7.0 revision 67168), versus ours (the proposed patch applied against trunk).

The `make benchmark` results • This set of benchmarks are
considered micro: consist of many small ruby scripts. They tend to shed some lights on each specific parts of the VM.

faster

0.000 1.000 so_ackermann so_array so_binary_trees so_concatenate so_exception so_fannkuch so_fasta so_lists
so_mandelbrot so_matrix so_meteor_contest so_nbody so_nested_loop so_nsieve so_nsieve_bits so_object so_partial_sums so_pidigits so_random so_sieve so_spectralnorm Speedup ratio versus trunk (greater = faster) faster

0.000 1.000 vm1_attr_ivar_set vm1_gc_wb_obj vm2_method vm2_method_with_block vm2_send vm2_struct_small_aref Speedup ratio
versus trunk (greater = faster) faster

faster

The `make benchmark` results • Majority of the results are
almost the same. Either slower or faster, they differ very faintly. • There are a few notable benchmark instances where our proposal clearly outperforms the trunk. • On the other hand it seems no instance shows clear slowdown for our proposal. • This tendency is roughly the same as we saw in 2016.

Mid-sized benchmarks • We tested `time make rdoc`, which has
historically been considered as a benchmark that reflects real-word use-case. • Also did we test mame/optcarrot, which was made for benchmarking various ruby implementations.

23.13 23.58 0 5 10 15 20 25 trunk ours
`time make rdoc` [sec] (greater = slower) faster

faster 42.554 43.276 0 5 10 15 20 25 30
35 40 45 50 trunk ours Optcarrot Lan_Master.nes [fps] (greater = faster)

Mid-sized benchmarks • We have to say they are almost
the same. • Rdoc got slower, opcarrot got faster. We see these results consistently. • There might be reasons behind them but, … well, isn’t it enough to say that we see no significant changes?

Rails application • discourse/discourse comes with a benchmark script so
we tested our changeset against it. • The benchmark is basically a series of `ab(1)`. • Discourse is a field-proven real-world Rails application. The benchmark shows how the proposed changeset behaves in the wild. • OTOH this is the greatest LOC among other benchmarks.

50 90 0 20 40 60 80 100 120 140
160 180 categories ours categories trunk home ours home trunk topic ours topic trunk categories_admin ours categories_admin trunk home_admin ours home_admin trunk topic_admin ours topic_admin trunk Discourse benchmark results [msec] (greater = slower) 50 75 90 99 faster

50 56 69 115 51 62 69 119 0 20
40 60 80 100 120 140 50 75 90 99 Percentile Discourse home [msec] (greater = slower) trunk ours faster

80 87 100 157 83 96 102 169 0 20
40 60 80 100 120 140 160 180 50 75 90 99 Percentile Discourse categories_admin [msec] (greater = slower) trunk ours faster

3707 3963 0 1000 2000 3000 4000 5000 trunk ours
Discourse timing loading rails [msec] (greater = slower) faster

Conclusions Photo by Tomoyuki Kengaku

Conclusions • Additional method-calling ABIs are introduced to tell each
method if its return value is used or not. Unused return values are then optimised out from the VM’s value stack. • Our proposal sacrifices process bootup time to yield better runtime performance. • Not only small benchmarks, but also Rails applications can benefit from it.

Future works Photo by Tomoyuki Kengaku

(more) Aggressive compilation • The automatic insertion of `opt_bailout` proposed
in this presentation works, but we can think of more. • For instance let us consider: 1.times {|i| x, y = self, i }

(more) Aggressive compilation • We can think of better compilation
so that the `newarray` instruction can be eliminated. • That should decrease GC pressures.

Tail call flag propagation • Think of a method foo,
which calls another method bar: def bar something return something_else end def foo return bar end foo; nil This must be optimisable

Tail call flag propagation • We have tried optimising this
scenario but turned out it slows things down. • Overheads added by asking “is this a tail call?” every time a method is called turned out to be too heavy. • Some static analysis could be possible. That might reroute the runtime overheads.

Other future works • The introduced C API could be
applied to our core classes, to gain bonus speed-ups. • Values of blocks called from C (which we still cannot say if they are used or not) should also be considered.

The send-pop optimisation

The send-pop optimisation

More Decks by Urabe Shyouhei

Other Decks in Technology

Featured

Transcript