The send-pop optimisation

Slide 1

Slide 1 text

The send-pop optimisation Urabe, Shyouhei Photo by Tomoyuki Kengaku

Slide 2

Slide 2 text

In a nutshell, this talk is about… • The “send-pop” sequence we focus in this talk is a pattern that appears very frequently in a Ruby program. • We propose automatic detection of them, and let the inter- preter optimise that part. • This optimisation enhances some benchmark results, including Rails.

Slide 3

Slide 3 text

Motivations Photo by Tomoyuki Kengaku

Slide 4

Slide 4 text

def foo something something_another return something_else end

Slide 5

Slide 5 text

== disasm: #:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself 0001 send

Slide 6

Slide 6 text

The “send-pop” sequence • Calling a method, then immediately discarding its return value. • Note that every method in Ruby has return value(s). • The value(s) returned however do not have to be used. • Even when a method does not expect its caller to take return values, it has to return something “just in case” the expectation breaks. • Waste of both time and memory.

Slide 7

Slide 7 text

But how often? • By taking 2-grams of a mame/optcarrot execution…

Slide 8

Slide 8 text

% LANG=C sort 2gram.txt | uniq -c | sort -nr | head -n 10 69065813 getinstancevariable -> getinstancevariable 65600442 putself -> getinstancevariable 59624140 getinstancevariable -> branchunless 59116388 branchunless -> getinstancevariable 52828407 leave -> pop 50434175 getinstancevariable -> putobject 30368815 pop -> putself 27717161 setinstancevariable -> getinstancevariable 25661090 branchunless -> putself 25165032 getinstancevariable -> branchif

Slide 9

Slide 9 text

But how often? • By taking 2-grams of a mame/optcarrot execution, the sequence in question is #5 most frequent. • This is definitely worth consideration.

Slide 10

Slide 10 text

Relax them Photo by Tomoyuki Kengaku

Slide 11

Slide 11 text

First step: allow arbitrary return values • We cannot entirely eliminate return values. • In the wild, there already are methods written in C. • They cannot be modified, and they already return something. • The best we can do is to allow methods to return arbitrary values when they are not used by their callers. • Let each methods decide what to return. We can auto-optimise pure-ruby methods later.

Slide 12

Slide 12 text

Pass 1-bit flag to each method • Every time a method is called, some flags are passed to it already. • Why not add another one who describes the usage of its return value(s).

Slide 13

Slide 13 text

diff --git a/vm_core.h b/vm_core.h index 574837dea0..513b8b85c1 100644 --- a/vm_core.h +++ b/vm_core.h @@ -1132,11 +1133,11 @@ typedef rb_control_frame_t * enum { /* Frame/Environment flag bits: - * MMMM MMMM MMMM MMMM ____ __FF FFFF EEEX (LSB) + * MMMM MMMM MMMM MMMM ____ _FFF FFFF EEEX (LSB) * * X : tag for GC marking (It seems as Fixnum) * EEE : 3 bits Env flags - * FF..: 6 bits Frame flags + * FF..: 7 bits Frame flags * MM..: 15 bits frame magic (to check frame corruption) */ @@ -1160,6 +1161,7 @@ enum { VM_FRAME_FLAG_CFRAME = 0x0080, VM_FRAME_FLAG_LAMBDA = 0x0100, VM_FRAME_FLAG_MODIFIED_BLOCK_PARAM = 0x0200, + VM_FRAME_FLAG_POPPED = 0x0400, /* env flag */ VM_ENV_FLAG_LOCAL = 0x0002,

Slide 14

Slide 14 text

diff --git a/vm_insnhelper.c b/vm_insnhelper.c index a2f7433029..b024b29fc6 100644 --- a/vm_insnhelper.c +++ b/vm_insnhelper.c @@ -1767,12 +1767,13 @@ static inline VALUE vm_call_iseq_setup_normal(rb_execution_context_t *ec, rb_control_frame_t *cfp, struct rb_calling_in int opt_pc, int param_size, int local_size) { + int popped = calling->popped; const rb_iseq_t *iseq = def_iseq_ptr(me->def); VALUE *argv = cfp->sp - calling->argc; VALUE *sp = argv + param_size; cfp->sp = argv - 1 /* recv */; - vm_push_frame(ec, iseq, VM_FRAME_MAGIC_METHOD | VM_ENV_FLAG_LOCAL, calling->recv, + vm_push_frame(ec, iseq, VM_FRAME_MAGIC_METHOD | VM_ENV_FLAG_LOCAL | popped, calling->recv, calling->block_handler, (VALUE)me, iseq->body->iseq_encoded + opt_pc, sp, local_size - param_size, @@ -1791,6 +1792,7 @@ vm_call_iseq_setup_tailcall(rb_execution_context_t *ec, rb_control_frame_t *cf VALUE *src_argv = argv; VALUE *sp_orig, *sp; VALUE finish_flag = VM_FRAME_FINISHED_P(cfp) ? VM_FRAME_FLAG_FINISH : 0; + unsigned long popped = VM_ENV_FLAGS(cfp->ep, VM_FRAME_FLAG_POPPED); if (VM_BH_FROM_CFP_P(calling->block_handler, cfp)) { struct rb_captured_block *dst_captured = VM_CFP_TO_CAPTURED_BLOCK(RUBY_VM_PREVIOUS_CONTROL_ @@ -1818,7 +1820,7 @@ vm_call_iseq_setup_tailcall(rb_execution_context_t *ec, rb_control_frame_t *cf *sp++ = src_argv[i];

Slide 15

Slide 15 text

Use the flag Photo by Tomoyuki Kengaku

Slide 16

Slide 16 text

Let pure-Ruby methods check that flag • We can make pure-Ruby methods check that flag automatically, so that they can skip rearmost instructions. • For instance when we have: def foo(x) y = bar(x) return y end

Slide 17

Slide 17 text

== disasm: #:1 (1,2)-(4,5)> (catch: FALSE) local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0 [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send

Slide 18

Slide 18 text

== disasm: #:1 (1,2)-(4,5)> (catch: FALSE) local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0 [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send

Slide 19

Slide 19 text

+/* This instruction is no-op unless the instruction sequence is called + * with VM_FRAME_FLAG_POPPED. With that flag on, it immediately + * leaves the current stack frame with scratching the topmost n stack + * values. The return value of the iseq for that case is always + * nil. */ +DEFINE_INSN +opt_bailout +(rb_num_t n) +() +() +{ +#ifdef MJIT_HEADER + /* :FIXME: don't know how to make it work with JIT... */ +#else + if (VM_ENV_FLAGS(GET_EP(), VM_FRAME_FLAG_POPPED) && + CURRENT_INSN_IS(opt_bailout) /* <- rule out trace instruction */ ) { + POPN(n); + PUSH(Qnil); + DISPATCH_ORIGINAL_INSN(leave); + } + #endif +} + /**********************************************************/ /* deal with control flow 3: exception */ /**********************************************************/

Slide 20

Slide 20 text

Automatic insertion of it Photo by Tomoyuki Kengaku

Slide 21

Slide 21 text

Make the insertion automatic • What operations are safe to be skipped when a return value is not used? • Obviously not everything are. • That concept should be identical to what we call “pure” operations, proposed in RubyKaigi 2016.

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Recap

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Automatic bail out of a method • In stead of thinking a method being entirely pure or not, we are gong to focus on each method’s rearmost part that are pure. • Such part, if any, makes no sense when the return value of the method is discarded.

Slide 28

Slide 28 text

== disasm: #:1 (1,2)-(4,5)> (catch: FALSE) local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0 [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send

Slide 29

Slide 29 text

== disasm: #:1 (1,2)-(4,5)> (catch: FALSE) local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: [ 2] x@0 [ 1] y@1 0000 putself 0001 getlocal x@0, 0 0004 send

Slide 30

Slide 30 text

C API (nit-picky) Photo by Tomoyuki Kengaku

Slide 31

Slide 31 text

Can we also optimize C methods? • We cannot auto-skip a part of a C method. • But the `VM_FRAME_FLAG_POPPED` flag is always set, no matter the called method is in Ruby or not. • Why not make it visible from C, so that future methods can look at it.

Slide 32

Slide 32 text

diff --git a/vm.c b/vm.c index c5beed64c0..d33ff98619 100644 --- a/vm.c +++ b/vm.c @@ -3544,4 +3544,14 @@ vm_collect_usage_register(int reg, int isset) #endif /* #ifndef MJIT_HEADER */ +int +rb_whether_the_return_value_is_used_p(void) +{ + const struct rb_execution_context_struct *ec = GET_EC(); + const struct rb_control_frame_struct *reg_cfp = ec->cfp; + const VALUE *ep = GET_EP(); + + return ! VM_ENV_FLAGS(ep, VM_FRAME_FLAG_POPPED); +} + #include "vm_call_iseq_optimized.inc" /* required from vm_insnhelper.c */

Slide 33

Slide 33 text

Practical applications • `StringScanner#scan` scans the receiver, advances its internal pointer, then returns the matched string. The “matched string” can be omitted by leveraging the flag. • Exact same discussion applies to `String#slice!`

Slide 34

Slide 34 text

Eliminating `pop`s Photo by Tomoyuki Kengaku

Slide 35

Slide 35 text

def foo something something_another return something_else end Recap:

Slide 36

Slide 36 text

== disasm: #:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself 0001 send

Slide 37

Slide 37 text

== disasm: #:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself 0001 send

Slide 38

Slide 38 text

Note however, that: • The elimination is not always possible. • That `pop` can be a jump destination. • For an (illustrative) example: def foo self &. x nil end

Slide 39

Slide 39 text

== disasm: #:1 (1,0)-(4,3)> (catch: FALSE) 0000 putself 0001 dup 0002 branchnil 7 0004 opt_send_without_block

Slide 40

Slide 40 text

Let us add another frame flag • Called `VM_FRAME_FLAG_POPIT`. • This flag denotes that the pop instruction in the caller was optimised out from the sequence. • Hence when the flag is set, it is the callee’s duty to properly skip pushing return value(s), not its caller’s.

Slide 41

Slide 41 text

diff –git a/vm_core.h b/vm_core.h index 0b3f3e06ba..932c70a734 100644 --- a/vm_core.h +++ b/vm_core.h @@ -1134,11 +1136,11 @@ typedef rb_control_frame_t * enum { /* Frame/Environment flag bits: - * MMMM MMMM MMMM MMMM ____ _FFF FFFF EEEX (LSB) + * MMMM MMMM MMMM MMMM ____ FFFF FFFF EEEX (LSB) * * X : tag for GC marking (It seems as Fixnum) * EEE : 3 bits Env flags - * FF..: 7 bits Frame flags + * FF..: 8 bits Frame flags * MM..: 15 bits frame magic (to check frame corruption) */ @@ -1163,6 +1165,7 @@ enum { VM_FRAME_FLAG_LAMBDA = 0x0100, VM_FRAME_FLAG_MODIFIED_BLOCK_PARAM = 0x0200, VM_FRAME_FLAG_POPPED = 0x0400, + VM_FRAME_FLAG_POPIT = 0x0800, /* env flag */ VM_ENV_FLAG_LOCAL = 0x0002,

Slide 42

Slide 42 text

== disasm: #:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself 0001 send

Slide 43

Slide 43 text

== disasm: #:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself 0001 send

Slide 44

Slide 44 text

== disasm: #:1 (1,0)-(5,3)> (catch: FALSE) 0000 putself ( 2)[LiCa] 0001 send , , nil 0005 putself ( 3)[Li] 0006 send , , nil 0010 putself ( 4)[Li] 0011 send , , nil 0015 leave ( 5)[Re]

Slide 45

Slide 45 text

Avoid pushing return values Photo by Tomoyuki Kengaku

Slide 46

Slide 46 text

In order to properly avoid pushing… • We have to consider 3 (!) distinct situations. • Returning from a method written in C. • Returning from a method written in Ruby. • Returning from inside of a block.

Slide 47

Slide 47 text

C method return values • C methods return values using C’s return semantics. Just discarding them should suffice. VALUE foo(VALUE x) { VALUE y = complex_calculation(x); return y; }

Slide 48

Slide 48 text

diff --git a/tool/ruby_vm/views/_insn_entry.erb b/tool/ruby_vm/views/_insn_entry.erb index cdadd93abc..bbfe539fd2 100644 --- a/tool/ruby_vm/views/_insn_entry.erb +++ b/tool/ruby_vm/views/_insn_entry.erb @@ -56,7 +58,18 @@ INSN_ENTRY(<%= insn.name %>) /* ### Instruction trailers. ### */ CHECK_VM_STACK_OVERFLOW_FOR_INSN(VM_REG_CFP, INSN_ATTR(retn)); <%= insn.handle_canary "CHECK_CANARY()" -%> -% if insn.handles_sp? +% if insn.sendish? # Then we can safely assume there is only one return value. +% if insn.handles_sp? + if (! (ci->compiled_frame_bits & VM_FRAME_FLAG_POPIT)) { + PUSH(<%= insn.cast_to_VALUE insn.rets.first %>); + } +% else + INC_SP(INSN_ATTR(sp_inc)); + if (! (ci->compiled_frame_bits & VM_FRAME_FLAG_POPIT)) { + TOPN(0) = <%= insn.cast_to_VALUE insn.rets.first %>; + } +% end +% elsif insn.handles_sp? % insn.rets.reverse_each do |ret| PUSH(<%= insn.cast_to_VALUE ret %>); % end

Slide 49

Slide 49 text

Ruby method return values • Ruby methods (normally) return values using `leave` instruction. def foo(x) return x + 1 end

Slide 50

Slide 50 text

== disasm: #:1 (1,0)-(3,3)> (catch: FALSE) local table (size: 1, argc: 1 [opts: 0, rest: -1, post: 0, block: [ 1] x@0 0000 getlocal x@0, 0 0003 putobject 1 0005 send

Slide 51

Slide 51 text

diff --git a/insns.def b/insns.def index a38dc30168..68e7eabfae 100644 --- a/insns.def +++ b/insns.def @@ -927,7 +911,7 @@ DEFINE_INSN leave () (VALUE val) -(VALUE val) +(...) /* This is super surprising but when leaving from a frame, we check * for interrupts. If any, that should be executed on top of the * current execution context. This is a method call. */ @@ -939,7 +923,10 @@ leave // attr enum rb_insn_purity purity = rb_insn_is_pure; /* And this instruction handles SP by nature. */ // attr bool handles_sp = true; +// attr rb_snum_t sp_inc = 0; { + bool popit = VM_ENV_FLAGS(GET_EP(), VM_FRAME_FLAG_POPIT); + if (OPT_CHECKED_RUN) { const VALUE *const bp = vm_base_ptr(reg_cfp); if (reg_cfp->sp != bp) { @@ -959,6 +946,9 @@ leave } else { RESTORE_REGS(); + if (! popit) { + PUSH(val); + } } }

Slide 52

Slide 52 text

So far so good… but, • It immediately gets complicated when a block has a return statement. def foo(x) x.times do | i | return i end end p foo(42) # => 0

Slide 53

Slide 53 text

So far so good… but, • It immediately gets complicated when a block has a return statement. def foo(x) x.times &-> (i) do return i end end p foo(42) # => 42

Slide 54

Slide 54 text

What “return-inside-of-a-block” does: 1. Look for the exact place where the execution to proceed. 2. Rewind the stack. 3. Push the return value onto the stack. 4. Continue executing. This has to be cancelled, however: The flag has been squashed already

Slide 55

Slide 55 text

diff --git a/vm.c b/vm.c index 807a20ee5a..057863e5e3 100644 --- a/vm.c +++ b/vm.c @@ -1926,6 +1926,7 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, VALUE errinfo, VALUE *initial) { struct vm_throw_data *err = (struct vm_throw_data *)errinfo; + bool popit = false; for (;;) { unsigned int i; @@ -1950,6 +1951,7 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, rb_vm_frame_method_entry(ec->cfp)->owner, rb_vm_frame_method_entry(ec->cfp)->def->original_id); } + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); } @@ -1983,6 +1985,7 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, ec->errinfo = Qnil; THROW_DATA_CATCH_FRAME_SET(err, cfp + 1); hook_before_rewind(ec, ec->cfp, TRUE, state, err); + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); return THROW_DATA_VAL(err); } @@ -1994,7 +1997,9 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, #if OPT_STACK_CACHING *initial = THROW_DATA_VAL(err); #else - *ec->cfp->sp++ = THROW_DATA_VAL(err); + if (! popit) { + *ec->cfp->sp++ = THROW_DATA_VAL(err); + } #endif ec->errinfo = Qnil; return Qundef; @@ -2128,12 +2133,14 @@ vm_exec_handle_exception(rb_execution_context_t *ec, enum ruby_tag_type state, hook_before_rewind(ec, ec->cfp, FALSE, state, err); if (VM_FRAME_FINISHED_P(ec->cfp)) { + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); ec->errinfo = (VALUE)err; ec->tag = ec->tag->prev; EC_JUMP_TAG(ec, state); } else { + popit = VM_ENV_FLAGS(ec->cfp->ep, VM_FRAME_FLAG_POPIT); rb_vm_pop_frame(ec); } }

Slide 56

Slide 56 text

Benchmarks Photo by Tomoyuki Kengaku

Slide 57

Slide 57 text

Several benchmarks were exercised. • Caution: YMMV • All benchmarks are done on this exact machine I am projecting this presentation: 6th gen. ThinkPad X1 Carbon. • They all compare trunk (2.7.0 revision 67168), versus ours (the proposed patch applied against trunk).

Slide 58

Slide 58 text

The `make benchmark` results • This set of benchmarks are considered micro: consist of many small ruby scripts. They tend to shed some lights on each specific parts of the VM.

Slide 59

Slide 59 text

faster

Slide 60

Slide 60 text

0.000 1.000 so_ackermann so_array so_binary_trees so_concatenate so_exception so_fannkuch so_fasta so_lists so_mandelbrot so_matrix so_meteor_contest so_nbody so_nested_loop so_nsieve so_nsieve_bits so_object so_partial_sums so_pidigits so_random so_sieve so_spectralnorm Speedup ratio versus trunk (greater = faster) faster

Slide 61

Slide 61 text

0.000 1.000 vm1_attr_ivar_set vm1_gc_wb_obj vm2_method vm2_method_with_block vm2_send vm2_struct_small_aref Speedup ratio versus trunk (greater = faster) faster

Slide 62

Slide 62 text

faster

Slide 63

Slide 63 text

The `make benchmark` results • Majority of the results are almost the same. Either slower or faster, they differ very faintly. • There are a few notable benchmark instances where our proposal clearly outperforms the trunk. • On the other hand it seems no instance shows clear slowdown for our proposal. • This tendency is roughly the same as we saw in 2016.

Slide 64

Slide 64 text

Mid-sized benchmarks • We tested `time make rdoc`, which has historically been considered as a benchmark that reflects real-word use-case. • Also did we test mame/optcarrot, which was made for benchmarking various ruby implementations.

Slide 65

Slide 65 text

23.13 23.58 0 5 10 15 20 25 trunk ours `time make rdoc` [sec] (greater = slower) faster

Slide 66

Slide 66 text

faster 42.554 43.276 0 5 10 15 20 25 30 35 40 45 50 trunk ours Optcarrot Lan_Master.nes [fps] (greater = faster)

Slide 67

Slide 67 text

Mid-sized benchmarks • We have to say they are almost the same. • Rdoc got slower, opcarrot got faster. We see these results consistently. • There might be reasons behind them but, … well, isn’t it enough to say that we see no significant changes?

Slide 68

Slide 68 text

Rails application • discourse/discourse comes with a benchmark script so we tested our changeset against it. • The benchmark is basically a series of `ab(1)`. • Discourse is a field-proven real-world Rails application. The benchmark shows how the proposed changeset behaves in the wild. • OTOH this is the greatest LOC among other benchmarks.

Slide 69

Slide 69 text

50 90 0 20 40 60 80 100 120 140 160 180 categories ours categories trunk home ours home trunk topic ours topic trunk categories_admin ours categories_admin trunk home_admin ours home_admin trunk topic_admin ours topic_admin trunk Discourse benchmark results [msec] (greater = slower) 50 75 90 99 faster

Slide 70

Slide 70 text

50 56 69 115 51 62 69 119 0 20 40 60 80 100 120 140 50 75 90 99 Percentile Discourse home [msec] (greater = slower) trunk ours faster

Slide 71

Slide 71 text

80 87 100 157 83 96 102 169 0 20 40 60 80 100 120 140 160 180 50 75 90 99 Percentile Discourse categories_admin [msec] (greater = slower) trunk ours faster

Slide 72

Slide 72 text

3707 3963 0 1000 2000 3000 4000 5000 trunk ours Discourse timing loading rails [msec] (greater = slower) faster

Slide 73

Slide 73 text

Conclusions Photo by Tomoyuki Kengaku

Slide 74

Slide 74 text

Conclusions • Additional method-calling ABIs are introduced to tell each method if its return value is used or not. Unused return values are then optimised out from the VM’s value stack. • Our proposal sacrifices process bootup time to yield better runtime performance. • Not only small benchmarks, but also Rails applications can benefit from it.

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

Future works Photo by Tomoyuki Kengaku

Slide 77

Slide 77 text

(more) Aggressive compilation • The automatic insertion of `opt_bailout` proposed in this presentation works, but we can think of more. • For instance let us consider: 1.times {|i| x, y = self, i }

Slide 78

Slide 78 text

Slide 79

Slide 79 text

| | catch type: next st: 0001 ed: 0014 sp: 0000 cont: 0014 | |--------------------------------------------------------------- | local table (size: 3, argc: 1 [opts: 0, rest: -1, post: 0, block | [ 3] i@0 | [ 2] x@1 | [ 1] y@2 | 0000 nop | 0001 putself [Li] | 0002 getlocal_WC_0 i@0 | 0004 newarray 2 | 0006 dup | 0007 expandarray 2, 0 | 0010 setlocal_WC_0 x@1 | 0012 setlocal_WC_0 y@2 | 0014 nop | 0015 leave |----------------------------------------------------------------- 0000 putobject_INT2FIX_1_ 0001 send ,

Slide 80

Slide 80 text

Slide 81

Slide 81 text

(more) Aggressive compilation • We can think of better compilation so that the `newarray` instruction can be eliminated. • That should decrease GC pressures.

Slide 82

Slide 82 text

Tail call flag propagation • Think of a method foo, which calls another method bar: def bar something return something_else end def foo return bar end foo; nil This must be optimisable

Slide 83

Slide 83 text

Tail call flag propagation • We have tried optimising this scenario but turned out it slows things down. • Overheads added by asking “is this a tail call?” every time a method is called turned out to be too heavy. • Some static analysis could be possible. That might reroute the runtime overheads.

Slide 84

Slide 84 text

Other future works • The introduced C API could be applied to our core classes, to gain bonus speed-ups. • Values of blocks called from C (which we still cannot say if they are used or not) should also be considered.

Slide 85

Slide 85 text

No content