The Method JIT Compiler

The Method JIT Compiler

RubyKaigi 2018
http://rubykaigi.org/2018

08d5432a5bc31e6d9edec87b94cb1db1?s=128

Takashi Kokubun

June 02, 2018
Tweet

Transcript

  1. T R E A S U R E D A

    T A The Method JIT Compiler for Ruby 2.6 Takashi Kokubun / @k0kubun RubyKaigi 2018
  2. T R E A S U R E D A

    T A Maintainer of ERB, Haml Developing JIT compiler for Ruby 2.6 @k0kubun
  3. None
  4. • 2017 Sep: LLVM JIT (EN) • 2017 Nov: YARV

    MJIT (EN) • 2017 Dec: YARV MJIT (JA) • 2018 Feb: ERB generation (JA) • 2018 Apr: Preview2 optimizations (EN) .ZQBTUUBMLTBCPVU3VCZT+*5 https://speakerdeck.com/k0kubun
  5. 1. Current Status 2. JIT on Rails 3. Dive Into

    Native Code 4. Method Inlining 5PEBZTUBML
  6. 1. CURRENT STATUS

  7. • JIT in 2.6.0-preview2 is not production ready yet •

    Fixing bugs by a race condition • I'll introduce current status of: • Implementation • Portability • Performance $VSSFOUTUBUVT
  8. IMPLEMENTATION

  9. .+*53VCZ`T+*5BSDIJUFDUVSF Ruby Process Disk Memory MJIT worker Thread Method #1

    Bytecode Ruby VM Thread Interpret
  10. .+*53VCZ`T+*5BSDIJUFDUVSF Ruby Process Disk Memory MJIT worker Thread Method #1

    Bytecode Ruby VM Thread Interpret Request JIT-ing #1
  11. .+*53VCZ`T+*5BSDIJUFDUVSF Ruby Process Disk Memory Method #1 Bytecode Ruby VM

    Thread Interpret Request JIT-ing #1 Method #1 C code MJIT worker Thread Generate
  12. .+*53VCZ`T+*5BSDIJUFDUVSF Ruby Process Disk Memory Method #1 Bytecode Ruby VM

    Thread Interpret Request JIT-ing #1 Method #1 C code MJIT worker Thread Generate Method #1 SO file Run C compiler
  13. .+*53VCZ`T+*5BSDIJUFDUVSF Ruby Process Disk Memory Method #1 Bytecode Ruby VM

    Thread Interpret Request JIT-ing #1 Method #1 C code MJIT worker Thread Generate Method #1 SO file Run C compiler Method #1 Native code Load
  14. .+*53VCZ`T+*5BSDIJUFDUVSF Ruby Process Disk Memory Method #1 Bytecode Interpret Request

    JIT-ing #1 Method #1 C code MJIT worker Thread Generate Method #1 SO file Run C compiler Method #1 Native code Load Ruby VM Thread Call
  15. )PXJTUIJTJNQMFNFOUFE Ruby Process Disk Memory Method #1 Bytecode Interpret Request

    JIT-ing #1 Method #1 C code MJIT worker Thread Generate Method #1 SO file Run C compiler Method #1 Native code Load Ruby VM Thread Call
  16. )PXJTUIJTJNQMFNFOUFE Ruby Code def three 1 + 2 end Bytecode

    putobject 1 putobject 2 opt_plus leave =
  17. )PXJTUIJTJNQMFNFOUFE Ruby Code def three 1 + 2 end Bytecode

    putobject 1 putobject 2 opt_plus leave = C code three() { VALUE stack[2]; /* putobject 1 */ stack[0] = 1; }
  18. )PXJTUIJTJNQMFNFOUFE Ruby Code def three 1 + 2 end Bytecode

    putobject 1 putobject 2 opt_plus leave = C code three() { VALUE stack[2]; /* putobject 1 */ stack[0] = 1; /* putobject 2 */ stack[1] = 2; }
  19. )PXJTUIJTJNQMFNFOUFE Ruby Code def three 1 + 2 end Bytecode

    putobject 1 putobject 2 opt_plus leave = C code three() { VALUE stack[2]; /* putobject 1 */ stack[0] = 1; /* putobject 2 */ stack[1] = 2; /* opt_plus */ stack[0] = opt_plus( stack[0], stack[1] ); }
  20. )PXJTUIJTJNQMFNFOUFE Ruby Code def three 1 + 2 end Bytecode

    putobject 1 putobject 2 opt_plus leave = C code three() { VALUE stack[2]; /* putobject 1 */ stack[0] = 1; /* putobject 2 */ stack[1] = 2; /* opt_plus */ stack[0] = opt_plus( stack[0], stack[1] ); /* leave */ return stack[0]; }
  21. $DPEFHFOFSBUPS ERB template: mjit_compile.inc.erb

  22. $DPEFHFOFSBUPS ERB template: mjit_compile.inc.erb VM instructions: insns.def

  23. $DPEFHFOFSBUPS ERB template: mjit_compile.inc.erb VM instructions: insns.def C code generator:

    mjit_compile.inc Render Copy definition
  24. • Based on Ruby 2.5’s Ruby VM • If JIT

    is disabled, everything must work in 2.6 • JIT implementation is automatically generated • To keep up with frequent Ruby VM changes 3VCZT+*5EFTJHO
  25. PORTABILITY

  26. $DPNQJMFSTVQQPSUT GCC Clang Visual C++ Intel C++ Compiler MJIT worker

    ◦ ◦ ◦ ◦ JIT header ◦ ◦ × ◦ CLI support ◦ ◦ ◦ × Support plan Done Done Next Later Now MJIT worker (native thread, dynamic loading) runs on Windows and UNIX
  27. 1MBUGPSNTVQQPSUTXJUI($$ Linux MinGW Solaris NetBSD FreeBSD JIT header ◦ ˚

    ◦ ◦ ◦ test_jit.rb ◦ ◦ ◦ ? × MinGW header is not minimized and thus compilation speed is slow. I guess NetBSD works but we don’t have NetBSD RubyCI. GCC on FreeBSD is crashing.
  28. 1MBUGPSNTVQQPSUTXJUI$MBOH Linux macOS OpenBSD JIT header ◦ ◦ ◦ test_jit.rb

    ◦ ◦ ? I guess OpenBSD works but we don’t have OpenBSD RubyCI
  29. PERFORMANCE

  30. 3VCZ,BJHJ-5--7.+*5

  31. 3VCZ,BJHJ-5--7.+*5

  32. 3VCZ,BJHJ 5IJTUBML https://benchmark-driver.github.io/benchmarks/mjit/commits.html

  33. 3VCZ,BJHJ 5IJTUBML https://benchmark-driver.github.io/benchmarks/mjit/commits.html 2.6.0 Preview1 2.6.0 Preview2 5.7x faster

  34. 0QUDBSSPU GQT      Ruby 2.0 trunk

    trunk+JIT    1.49x → 2.03x https://gist.github.com/k0kubun/95c81358af6f34b4d0a71425da871178
  35. 3BJMT %JTDPVSTF

  36. 3BJMT %JTDPVSTF

  37. 2. JIT ON RAILS

  38. • Generated code should be faster in general • What's

    different from Optcarrot? 8IZ3BJMTCFDPNFTTMPXXJUI+*5
  39. 1. longjmp by exception is slow 2. Profiling method calls

    has overhead 3. JIT-ed call is canceled too often 4. JIT compilation has overhead 5. Calling JIT-ed code has overhead .ZIZQPUIFTJT
  40. • When a method is returned from its child block,

    it calls longjmp(3) • VM is implemented with just return statement and may be faster in that case MPOHKNQCZFYDFQUJPOJTTMPX
  41. -FUTDIFDLJGMPOHKNQJTDBMMFE • Fortunately, longjmp was not used in this Discourse

    endpoint
  42. • MJIT counts method calls to decide which method to

    compile with JIT enabled • This was suspected in [Bug #14490] 1SPGJMJOHNFUIPEDBMMTIBTPWFSIFBE
  43. -FUTDPVOUJUFWFOJG+*5JTEJTBCMFE

  44. /PCJHEJGGFSFODFCZQSPGJMJOHNFUIPEDBMMT trunk No options modified No options trunk --jit JIT

    × × ◦ Profiling × ◦ ◦ Percentile: ms GET /: 50: 58.4ms 75: 65.4ms 90: 67.9ms 99: 131.1ms GET /: 50: 58.5ms 75: 64.6ms 90: 67.8ms 99: 127.3ms GET /: 50: 66.3ms 75: 72.3ms 90: 77.0ms 99: 133.3ms `ruby script/simple_bench.rb 1000` with: https://github.com/k0kubun/discourse/tree/20fc03558f16aff94c6c017347783374cf4a0ca8
  45. • MJIT has a kind of de-optimization to fallback to

    VM interpretation when any assumption is not met • ex) Method redefinition, etc. • Such fallback might be an overhead +*5FEDBMMJTDBODFMMFEUPPPGUFO
  46. -FUTMPHBMM+*5DBODFMMBUJPO

  47. 5IFSBUJPPG+*5DBODFMMBUJPO JIT-ed calls Cancel by opt_xxx Cancel by call cache

    Optcarrot 49,171,765 786,842 (1.60%) 0 (0.00%) Discourse 1,000 requests 168,925,050 19,394,792 (11.5%) 10,092,254 (5.97%) JIT cancel reasons: • opt_xxx: Non-core class is given to +, -, *, /, #[], etc. • call cache: Method redefinition, receiver class is changed
  48. 8IZ+*5DBODFMIBQQFOTTPPGUFO • Current JIT doesn't discard any JIT-ed code whose

    assumption is not met • opt_xxx is performing badly when a receiver is not a core class like Integer, Float, String, Array, Hash
  49. 8IZ+*5DBODFMIBQQFOTTPPGUFO • Current JIT doesn't discard any JIT-ed code whose

    assumption is not met • opt_xxx is performing badly when a receiver is not a core class like Integer, Float, String, Array, Hash There are many #[] for non Hash/Array classes in Rails
  50. *GJYFEUIJTJTTVFGPS<> S

  51. PQU@YYYDBODFMJTEFDSFBTFENVDI JIT-ed calls Cancel by opt_xxx Cancel by call cache

    Discourse Before 168,925,050 19,394,792 (11.5%) 10,092,254 (5.97%) Discourse After 75,150,482 2,849,825 (3.79%) 3,072,673 (4.09%) #[] has a major impact on Rails. Others are to be improved...
  52. • Appending a method to JIT-ed queue may have overhead

    • GCC or Clang may use the same CPU core, or it may cost to transfer data to another core +*5DPNQJMBUJPOIBTPWFSIFBE
  53. 1SFQBSF3VCZ7..+*5TUPQBOE Stop JIT compilation

  54. +*5FOBCMFEWT+*5TUPQQFE JIT enabled 1000 requests JIT enabled 1000 requests JIT

    stopped 1000 requests RubyVM::MJIT.stop Measure
  55. +*5DPNQJMBUJPOIBEPWFSIFBE No options --jit → Stop --jit Code is JIT-ed

    × ◦ ◦ JIT is going on × × ◦ Percentile: ms GET /: 50: 60.4ms 75: 66.9ms 90: 69.6ms 99: 125.4ms GET /: 50: 65.1ms 75: 72.4ms 90: 75.8ms 99: 145.6ms GET /: 50: 68.4ms 75: 74.8ms 90: 80.0ms 99: 137.2ms But this overhead is excluded from [Bug #14490] degradation…
  56. • JIT-ed code behaves slower only on an exception or

    JIT cancellation, but they weren’t culprit • JIT compilation does not dominate the slowness • Then, calling native code has overhead…? $BMMJOH+*5FEDPEFIBTPWFSIFBE
  57. -FU`TDIFDL+*5FEDBMMPWFSIFBE

  58. -FU`TDIFDL+*5FEDBMMPWFSIFBE

  59. $BMMJOH+*5FEDPEFXBTTMPX JIT disabled JIT enabled Duration 2.17s 2.45s

  60. -FU`TQSPGJMFXJUIQFSG

  61. (VBSEGPSKJUXBJUUBLFTUJNF

  62. 4LJQKJUXBJUDIFDLJONBJOCSBODI r63480

  63. *NQSPWFEBMJUUMF JIT disabled JIT enabled Duration 2.17s 2.31s (-0.14s)

  64. 3FNBJOJOHTXBTGPS Additional memory access here …But it wasn’t a big

    deal in Rails
  65. 8IBUJGUIFSFBSFBMPUPGNFUIPET

  66. $BMMJOHNBOZEJGGFSFOUNFUIPETJTTMPX Called methods 1 method 15 methods JIT disabled 3.69s

    3.71s JIT enabled 3.79s 5.34s Duration with the same total calls
  67. $BMMJOHNBOZEJGGFSFOUNFUIPETJTTMPX 0 1.5 3 4.5 6 1 3 5 7

    9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 VM JIT
  68. $BMMJOHNBOZEJGGFSFOUNFUIPETJTTMPX 0 1.5 3 4.5 6 1 3 5 7

    9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 VM JIT 6 12 19
  69. )PUNFUIPETPG0QUDBSSPU Top 6 methods dominate 50%

  70. )PUNFUIPETPG%JTDPVSTF Top 6 methods are only 18% They are not

    so hot.
  71. 8IZEPFTUIJTIBQQFO 0 1.5 3 4.5 6 1 3 5 7

    9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 VM JIT 6 12 19
  72. -FU`TTFFlQFSGTUBUz 6 methods 40 methods

  73. lJOTOQFSDZDMFzJTWFSZEJGGFSFOU 6 methods 40 methods

  74. YDZDMFTGPSBMNPTUUIFTBNFJOTOT 6 methods 40 methods

  75. 6 methods

  76. 40 methods

  77. &BDINFUIPE TPGJMF JTVTJOHPOFQBHF .#

  78. 40 methods w/ the same so file PoC

  79. • Ongoing JIT compilation may have overhead • JIT cancel

    is happening frequently (to be fixed) • It stalls to load many different methods (to be fixed) 3FBTPOPG3BJMTTMPXEPXOPO+*5
  80. 3. DIVE INTO NATIVE CODE

  81. &YBNQMF  Ruby Code def three 1 + 2 end

    Bytecode putobject 1 putobject 2 opt_plus leave =
  82. &YBNQMF  Ruby Code def three 1 + 2 end

    Bytecode putobject 1 putobject 2 opt_plus leave = C code three() { VALUE stack[2]; /* putobject 1 */ stack[0] = 1; /* putobject 2 */ stack[1] = 2; /* opt_plus */ stack[0] = opt_plus( stack[0], stack[1] ); /* leave */ return stack[0]; }
  83. More detailed definition before inlining opt_plus

  84. opt_plus inlined

  85. Native code generated by GCC (output of perf)

  86. Integer#+ redefinition check

  87. JIT cancel handler Let's ignore this Integer#+ redefinition check

  88. Integer#+ redefinition check

  89. SET_SP: VM's behavior which can be removed Integer#+ redefinition check

  90. Check interrupts like SIGINT, another thread Interruption handler Integer#+ redefinition

    check SET_SP: VM's behavior which can be removed
  91. Interruption handler Check interrupts like SIGINT, another thread Pop VM

    call frame Integer#+ redefinition check SET_SP: VM's behavior which can be removed
  92. Pop VM call frame Interruption handler Check interrupts like SIGINT,

    another thread Return 3 FIX2INT(0x7) == 3 Integer#+ redefinition check SET_SP: VM's behavior which can be removed
  93. So what?

  94. Instruction dispatch Instruction dispatch Instruction dispatch Instruction dispatch 1. Instruction

    dispatch cost is removed
  95. Program counter motion Program counter motion Program counter motion Program

    counter motion 2. No program counter motion
  96. Stack pointer motion Stack pointer motion Stack pointer motion Forgot

    to remove this 3. Stack pointer motion is reduced
  97. And also...

  98. 4. This optimization is delegated to GCC

  99. 8IBUJGJUTNPSFDPNQMFY def six 4 + 8 - 3 * 4

    / 2 end
  100. None
  101. Fixnum#+ redefinition check Fixnum#* redefinition check Fixnum#/ redefinition check Fixnum#-

    redefinition check Return 6 FIX2INT(0xd) == 6
  102. -BTU&YBNQMFXIJMFMPPQ def while_loop i = 0 while i < 1000000

    i += 1 end end i = 0 while i < 2000 while_loop i += 1 end
  103. -BTU&YBNQMFXIJMFMPPQ VM: 22.9s JIT: 2.8s Why it becomes so faster?

    8.18x faster
  104. -FUTTFFUIFOBUJWFDPEFPGXIJMF@MPPQ i = 0 while i < 1000000 i +=

    1 end
  105. None
  106. j They are JIT cancel handlers Let's ignore them

  107. None
  108. They are interruption handlers Let's ignore them

  109. None
  110. They are write-barrier-related slow paths Let's ignore them

  111. None
  112. It is Bignum promotion handler Let's ignore them

  113. None
  114. None
  115. "HBJO MFUTTFFUIFOBUJWFDPEFPG i = 0 while i < 1000000 i

    += 1 end
  116. c i = 0

  117. i = 0 check interrupts c

  118. i = 0 c Fixnum?(i) for #< check interrupts

  119. i = 0 c check interrupts Fixnum#< redefined? Fixnum?(i) for

    #<
  120. i = 0 c check interrupts i < 1000000 Fixnum?(i)

    for #< Fixnum#< redefined?
  121. i = 0 c i < 1000000 check interrupts check

    interrupts Fixnum?(i) for #< Fixnum#< redefined?
  122. i = 0 c i < 1000000 check interrupts check

    interrupts Fixnum#+ redefined? Fixnum?(i) for #< Fixnum#< redefined?
  123. i = 0 c i < 1000000 check interrupts check

    interrupts i + 1 Fixnum?(i) for #< Fixnum#< redefined? Fixnum#+ redefined?
  124. i = 0 c i < 1000000 check interrupts check

    interrupts Int overflow? i + 1 Fixnum?(i) for #< Fixnum#< redefined? Fixnum#+ redefined?
  125. i = 0 c i < 1000000 check interrupts check

    interrupts Int overflow? can't optimize #+ ? i + 1 Fixnum?(i) for #< Fixnum#< redefined? Fixnum#+ redefined?
  126. i = 0 c i < 1000000 check interrupts check

    interrupts Int overflow? can't optimize #+ ? i + 1 Fixnum?(i) for #< Fixnum#< redefined? Fixnum#+ redefined? set i for VM + check WB
  127. i = 0 c i < 1000000 check interrupts check

    interrupts Int overflow? can't optimize #+ ? i + 1 Fixnum?(i) for #< Fixnum#< redefined? Fixnum#+ redefined? set i for VM + check WB set i for JIT
  128. i = 0 c i < 1000000 check interrupts check

    interrupts Int overflow? i + 1 Fixnum?(i) for #< Fixnum#< redefined? Fixnum#+ redefined? can't optimize #+ ? set i for VM + check WB set i for JIT
  129. • #+ and #< are performed on not VM stack

    but registers • #+ and #< share some instructions to check redefinition • Unnecessary type checks are omitted from the loop 8IZXIJMFMPPQCFDPNFTGBTUFS
  130. 4. METHOD INLINING

  131. • Many optimizations are possible because C compiler can know

    definitions • If we could inline methods, C compiler would be able to optimize more -FU$DPNQJMFSXPSLIBSE
  132. 1. x 2. x 3. x 8IFOJTNFUIPEJOMJOJOHQPTTJCMF

  133. 1. JIT compiler can know definitions 2. JIT compiler can

    modify code to call a method 3. Inlined code can be invalidated 8IFOJTNFUIPEJOMJOJOHQPTTJCMF
  134. 1. JIT compiler can know definitions 2. JIT compiler can

    modify code to call a method 3. Inlined code can be invalidated 8IFOJTNFUIPEJOMJOJOHQPTTJCMF
  135. 1. JIT compiler can know definitions 2. JIT compiler can

    modify code to call a method 3. Inlined code can be invalidated 8IFOJTNFUIPEJOMJOJOHQPTTJCMF
  136. • Ruby method • called by Ruby method • called

    by C method • Ruby block • yield-ed by Ruby method • called by C method • C method • called by Ruby method • called by C method .BKPSJOMJOFUBSHFUT
  137. • Ruby method • called by Ruby method => easy

    • called by C method • Ruby block • yield-ed by Ruby method • called by C method • C method • called by Ruby method • called by C method .BKPSJOMJOFUBSHFUT JIT compiler can deal with bytecode easily Method cache can be used for invalidation
  138. • Ruby method • called by Ruby method => easy

    • called by C method • Ruby block • yield-ed by Ruby method => medium • called by C method • C method • called by Ruby method => medium • called by C method .BKPSJOMJOFUBSHFUT yield doesn't have cache Sometimes it's hard to know definitions
  139. • Ruby method • called by Ruby method => easy

    • called by C method => hard • Ruby block • yield-ed by Ruby method => medium • called by C method => hard • C method • called by Ruby method => medium • called by C method => hard .BKPSJOMJOFUBSHFUT There is no cache key for invalidation How to modify C code?
  140. 3VCZˠ$ˠ3VCZJOMJOJOHQSPCMFN ret = 0 1000000.times do |i| ret += i

    end ret
  141. ret = 0 1000000.times do |i| ret += i end

    ret Ruby -> C method call medium Integer#times is defined with C 3VCZˠ$ˠ3VCZJOMJOJOHQSPCMFN
  142. ret = 0 1000000.times do |i| ret += i end

    ret Ruby -> C method call medium Integer#times is defined with C C -> Ruby block call hard 3VCZˠ$ˠ3VCZJOMJOJOHQSPCMFN
  143. What if... Ruby can be faster than C?

  144. What if... Ruby can be faster than C?

  145. -FUTEFGJOF*OUFHFSUJNFTXJUI3VCZ https://github.com/rubinius/rubinius/blob/master/core/integer.rb

  146. • Ruby method • called by Ruby method => easy

    • called by C method => hard • Ruby block • yield-ed by Ruby method => medium • called by C method => hard • C method • called by Ruby method => medium • called by C method => hard *JNQMFNFOUFEBQSPUPUZQFUPJOMJOFUIJT https://github.com/k0kubun/ruby/commits/mjit-inline-send-yield
  147. 5JNFUPCFODINBSL

  148. *OUFHFSUJNFTCFODINBSLSFTVMUT Integer#times in C Integer#times in Ruby VM 145.44s 1.00x

    156.38s 0.93x JIT time ruby --disable-gems times_loop.rb
  149. *OUFHFSUJNFTCFODINBSLSFTVMUT Integer#times in C Integer#times in Ruby VM 145.44s 1.00x

    156.38s 0.93x JIT 104.80s 1.39x time ruby --disable-gems times_loop.rb
  150. *OUFHFSUJNFTCFODINBSLSFTVMUT Integer#times in C Integer#times in Ruby VM 145.44s 1.00x

    156.38s 0.93x JIT 104.80s 1.39x 56.46s 2.56x time ruby --disable-gems times_loop.rb
  151. C language is dead

  152. • Rails performance is going to be improved • JIT

    can eliminate many instructions • C language will be useless in the future $PODMVTJPO