Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High Performance Ruby - Golden Gate RubyConf

headius
September 14, 2012

High Performance Ruby - Golden Gate RubyConf

A talk on the road to "high performance" Ruby in JRuby, given at GoGaRuCo 2012

headius

September 14, 2012
Tweet

More Decks by headius

Other Decks in Technology

Transcript

  1. Hiya • Charles Oliver Nutter • [email protected] • @headius •

    JVM language guy at Red Hat (JBoss) Friday, September 14, 12
  2. Performance? • Writing code • Man hours more expensive than

    CPU hours • Developer contentedness • Running code • Straight line Friday, September 14, 12
  3. High Performance? • Faster than... • ...other Ruby impls? •

    ...other language runtimes? • ...unmanaged languages, like C? • ...you need it to be? Friday, September 14, 12
  4. “Fast Enough” • 1.8.7 was fast enough • 1.9.3 is

    fast enough • Unless it’s not fast enough • Does it matter? Friday, September 14, 12
  5. Performance Wall • Move to a different runtime • Move

    to a different language • ...in whole or part Friday, September 14, 12
  6. If you’re not writing perf- sensitive code in Ruby, you’re

    giving up too easily. Friday, September 14, 12
  7. Native Extensions • Not universally bad • Just bad in

    MRI • Invasive • Pointers • Few guarantees Friday, September 14, 12
  8. What We Want • Faster execution • Better GC •

    Parallel execution • Big data Friday, September 14, 12
  9. What We Can’t Have • Faster execution • Better GC

    • Parallel execution • Big data Friday, September 14, 12
  10. Different Approach • Build our own runtime? • YARV, Rubinius,

    MacRuby • Use an existing runtime? • JRuby, MagLev, MacRuby, IronRuby Friday, September 14, 12
  11. Build or Buy • Making a new VM is “easy”

    • Making it competitive is really hard • I mean really, really, really hard Friday, September 14, 12
  12. JVM • 15+ years of engineering by whole teams •

    FOSS • Fastest VM available • Best GCs available • Full parallel threading with guarantees • Broad platform support Friday, September 14, 12
  13. But Java is Slow! • Java is very, very fast

    • Literally, C fast in many cases • Java applications can be slow • Oh hey, just like Ruby? • The way you write code is more important than the language you use. Friday, September 14, 12
  14. JRuby • Java (and Ruby) impl of Ruby on JVM

    • Same memory, threading model • JRuby JITs to JVM bytecode • End of story, right? Friday, September 14, 12
  15. Long, Hard Road • Interpreter optimization • JVM bytecode compiler

    • Optimizing core class methods • Lather, rinse, and repeat Friday, September 14, 12
  16. Align with JVM • Individual arguments on call stack •

    JVM local variables • Avoid artificial framing • Avoid inter-call goo • Eliminate unnecessary work Friday, September 14, 12
  17. Unnecessary Work • Modules are maps • Name to method

    • Name to constant • Name to class var • Instance variables as maps • Wasted cycles without caching Friday, September 14, 12
  18. Method Lookup • Inside a class/module • Current class’s methods

    (a map) • Methods retrieved from class + ancestors • Serial or switch indicates staleness • Weak list of child classes • Class mutation cascades down hierarchy Friday, September 14, 12
  19. Thing Person Place Rubyist Other Method lookups go up-hierarchy Lookup

    target caches result obj.to_s to_s Friday, September 14, 12
  20. Thing Person Place Rubyist Other Method lookups go up-hierarchy Lookup

    target caches result obj.to_s to_s Friday, September 14, 12
  21. Thing Person Place Rubyist Other Method lookups go up-hierarchy Lookup

    target caches result Modification cascades down obj.to_s to_s Friday, September 14, 12
  22. Thing Person Place Rubyist Other Method lookups go up-hierarchy Lookup

    target caches result Modification cascades down obj.to_s to_s to_s Friday, September 14, 12
  23. Constant Lookup • Cache at lookup site • Global serial/switch

    indicates staleness • Complexities of lookup, etc • Joy of Ruby interfering with Joy of Opto • Modifying constants triggers invalidation Friday, September 14, 12
  24. Instance Vars • Class holds a table of offsets •

    Object holds array of values • Call site caches offset plus class ID • Same class, no lookup cost • Can be polymorphically chained Friday, September 14, 12
  25. Optimizing Ruby • Make calls fast • Make constants free

    • Make instance variables cheap • Make closures lightweight • TODO Friday, September 14, 12
  26. Dynamic? Dynamic typing is a common reason, but there are

    many others Friday, September 14, 12
  27. JVM 101 200 opcodes Invocation invokevirtual invokeinterface invokestatic invokespecial Field

    Access getfield setfield getstatic setstatic Ten (or 16) “data endpoints” Friday, September 14, 12
  28. JVM 101 200 opcodes Invocation invokevirtual invokeinterface invokestatic invokespecial Field

    Access getfield setfield getstatic setstatic Array Access *aload *astore b,s,c,i,l,d,f,a Ten (or 16) “data endpoints” Friday, September 14, 12
  29. JVM 101 200 opcodes Invocation invokevirtual invokeinterface invokestatic invokespecial Field

    Access getfield setfield getstatic setstatic Array Access *aload *astore b,s,c,i,l,d,f,a Ten (or 16) “data endpoints” All Java code revolves around these endpoints Remaining ops are stack, local vars, flow control allocation, and math/boolean/bit operations Friday, September 14, 12
  30. JVM Opcodes Invocation invokevirtual invokeinterface invokestatic invokespecial Field Access getfield

    setfield getstatic setstatic Array Access *aload *astore b,s,c,i,l,d,f,a Friday, September 14, 12
  31. JVM Opcodes Invocation invokevirtual invokeinterface invokestatic invokespecial Field Access getfield

    setfield getstatic setstatic Array Access *aload *astore b,s,c,i,l,d,f,a Stack Local Vars Flow Control Allocation Boolean and Numeric Friday, September 14, 12
  32. JVM Opcodes Invocation invokevirtual invokeinterface invokestatic invokespecial Field Access getfield

    setfield getstatic setstatic Array Access *aload *astore b,s,c,i,l,d,f,a Stack Local Vars Flow Control Allocation Boolean and Numeric Friday, September 14, 12
  33. JVM Opcodes Invocation invokevirtual invokeinterface invokestatic invokespecial Field Access getfield

    setfield getstatic setstatic Array Access *aload *astore b,s,c,i,l,d,f,a Stack Local Vars Flow Control Allocation Boolean and Numeric Friday, September 14, 12
  34. In Detail • JRuby generates code with indy calls •

    JVM at first call asks JRuby what to do • JRuby provides function pointers to code • Pointers include guards, invalidation logic • JRuby and JVM cooperate on optimizing Friday, September 14, 12
  35. Dynamic Invocation Target Object Method Table def foo ... def

    bar ... associated with obj.foo() JVM Friday, September 14, 12
  36. VM Operations Dynamic Invocation Target Object Method Table def foo

    ... def bar ... associated with obj.foo() JVM Call Site Friday, September 14, 12
  37. VM Operations Dynamic Invocation Target Object Method Table def foo

    ... def bar ... associated with obj.foo() JVM Call Site Friday, September 14, 12
  38. VM Operations Method Lookup Dynamic Invocation Target Object Method Table

    def foo ... def bar ... associated with obj.foo() JVM def foo ... Call Site Friday, September 14, 12
  39. VM Operations Method Lookup Branch Dynamic Invocation Target Object Method

    Table def foo ... def bar ... associated with obj.foo() JVM def foo ... Call Site Friday, September 14, 12
  40. VM Operations Method Lookup Branch Method Cache Dynamic Invocation Target

    Object Method Table def foo ... def bar ... associated with obj.foo() JVM def foo ... Call Site Friday, September 14, 12
  41. Instance Variables Target Object Offset Table “@foo” => 0 “@bar”

    => 1 associated with @bar JVM Friday, September 14, 12
  42. VM Operations Instance Variables Target Object Offset Table “@foo” =>

    0 “@bar” => 1 associated with @bar JVM Access Site Friday, September 14, 12
  43. VM Operations Instance Var Lookup Instance Variables Target Object Offset

    Table “@foo” => 0 “@bar” => 1 associated with @bar JVM Access Site Friday, September 14, 12
  44. VM Operations Instance Var Lookup Offset Cache Instance Variables Target

    Object Offset Table “@foo” => 0 “@bar” => 1 associated with @bar JVM 1 Access Site Friday, September 14, 12
  45. VM Operations Instance Var Lookup Offset Cache Access Object Instance

    Variables Target Object Offset Table “@foo” => 0 “@bar” => 1 associated with @bar JVM 1 Access Site Friday, September 14, 12
  46. VM Operations Instance Var Lookup Offset Cache Access Object Instance

    Variables Target Object Offset Table “@foo” => 0 “@bar” => 1 associated with @bar JVM 1 Access Site Friday, September 14, 12
  47. How Do We Know We’ve Succeeded? • Benchmarking • Monitoring

    • User reports Friday, September 14, 12
  48. Benchmarking is Hard • Runtimes may improve over time •

    Optimizer may eliminate useless code • Small systems are completely different • Know how your runtime optimizes! Friday, September 14, 12
  49. bench_empty_method def foo; self; end i = 0 while i

    < 10_000_000 foo; foo; foo; foo; foo i += 1 end Friday, September 14, 12
  50. 0s 1s 2s 3s 4s Ruby 1.9.3 JRuby JRuby +

    indy ZOMG 40X FASTER! Friday, September 14, 12
  51. 0s 1s 2s 3s 4s Ruby 1.9.3 JRuby JRuby +

    indy Friday, September 14, 12
  52. JVM Opto 101 • JITs code bodies after 10k calls

    • No 10k calls, no JIT (generally) • Inlines up to two targets • Optimistic • Early decisions may be wrong • Small code looks drastically different Friday, September 14, 12
  53. Inlining • Call site in method A and method B

    match • JVM treats them as though B lived in A • No call overhead • Variables visible across call boundary • More complete view for optimization Friday, September 14, 12
  54. Optimistic • Say we have a system... • The only

    method dynamically called is “foo” • All logic for dyncall revolves around “foo” • Hotspot thinks all dyncalls will be “foo” Friday, September 14, 12
  55. bench_empty_method2 def foo; self; end def bar1; self; end def

    bar2; self; end i = 0 while i < 10_000_000 bar1; bar1; bar1; bar1; bar1 bar2; bar2; bar2; bar2; bar2 i += 1 end ... Friday, September 14, 12
  56. 0s 0.175s 0.35s 0.525s 0.7s bench1 bench2 bench1 + indy

    bench2 + indy Friday, September 14, 12
  57. 0s 0.1s 0.2s 0.3s 0.4s bench1 + rbx bench2 +

    rbx bench1 + indy bench2 + indy Friday, September 14, 12
  58. What Happened? • An unrelated change slowed our bench? •

    Not really unrelated • Hotspot optimizes early loop first • Later loop is different...calls “foo” • Assumptions change, perf looks different Friday, September 14, 12
  59. Benchmarking is Not Enough • Need to monitor runtime optimization

    • JIT compilation • Inlining • Eventual native code (x86 ASM) • Fun? Friday, September 14, 12
  60. 1711 4 % bench_empty_method::block_0$RUBY$__file__ @ 56 (171 bytes) @ 59

    java.lang.invoke.MethodHandle::invokeExact (33 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (20 bytes) inline (hot) @ 2 java.lang.invoke.MethodHandle::invokeExact (9 bytes) inline (hot) @ 2 java.lang.invoke.MutableCallSite::getTarget (5 bytes) inline (hot) @ 16 java.lang.invoke.MethodHandle::invokeExact (5 bytes) inline (hot) @ 1 sun.invoke.util.ValueConversions::identity (2 bytes) inline (hot) @ 12 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 29 java.lang.invoke.MethodHandle::invokeExact (35 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (7 bytes) inline (hot) @ 3 org.jruby.runtime.invokedynamic.InvocationLinker::testMetaclass (17 bytes) inline (hot) @ 5 org.jruby.RubyBasicObject::getMetaClass (5 bytes) inline (hot) @ 14 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 31 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 6 bench_empty_method::method__0$RUBY$foo (2 bytes) inline (hot) @ 68 java.lang.invoke.MethodHandle::invokeExact (33 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (20 bytes) inline (hot) @ 2 java.lang.invoke.MethodHandle::invokeExact (9 bytes) inline (hot) @ 2 java.lang.invoke.MutableCallSite::getTarget (5 bytes) inline (hot) Friday, September 14, 12
  61. 1711 4 % bench_empty_method::block_0$RUBY$__file__ @ 56 (171 bytes) @ 59

    java.lang.invoke.MethodHandle::invokeExact (33 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (20 bytes) inline (hot) @ 2 java.lang.invoke.MethodHandle::invokeExact (9 bytes) inline (hot) @ 2 java.lang.invoke.MutableCallSite::getTarget (5 bytes) inline (hot) @ 16 java.lang.invoke.MethodHandle::invokeExact (5 bytes) inline (hot) @ 1 sun.invoke.util.ValueConversions::identity (2 bytes) inline (hot) @ 12 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 29 java.lang.invoke.MethodHandle::invokeExact (35 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (7 bytes) inline (hot) @ 3 org.jruby.runtime.invokedynamic.InvocationLinker::testMetaclass (17 bytes) inline (hot) @ 5 org.jruby.RubyBasicObject::getMetaClass (5 bytes) inline (hot) @ 14 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 31 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 6 bench_empty_method::method__0$RUBY$foo (2 bytes) inline (hot) @ 68 java.lang.invoke.MethodHandle::invokeExact (33 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (20 bytes) inline (hot) @ 2 java.lang.invoke.MethodHandle::invokeExact (9 bytes) inline (hot) @ 2 java.lang.invoke.MutableCallSite::getTarget (5 bytes) inline (hot) Friday, September 14, 12
  62. 1711 4 % bench_empty_method::block_0$RUBY$__file__ @ 56 (171 bytes) @ 59

    java.lang.invoke.MethodHandle::invokeExact (33 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (20 bytes) inline (hot) @ 2 java.lang.invoke.MethodHandle::invokeExact (9 bytes) inline (hot) @ 2 java.lang.invoke.MutableCallSite::getTarget (5 bytes) inline (hot) @ 16 java.lang.invoke.MethodHandle::invokeExact (5 bytes) inline (hot) @ 1 sun.invoke.util.ValueConversions::identity (2 bytes) inline (hot) @ 12 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 29 java.lang.invoke.MethodHandle::invokeExact (35 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (7 bytes) inline (hot) @ 3 org.jruby.runtime.invokedynamic.InvocationLinker::testMetaclass (17 bytes) inline (hot) @ 5 org.jruby.RubyBasicObject::getMetaClass (5 bytes) inline (hot) @ 14 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 31 java.lang.invoke.MethodHandle::invokeExact (10 bytes) inline (hot) @ 6 bench_empty_method::method__0$RUBY$foo (2 bytes) inline (hot) @ 68 java.lang.invoke.MethodHandle::invokeExact (33 bytes) inline (hot) @ 5 java.lang.invoke.MethodHandle::invokeExact (20 bytes) inline (hot) @ 2 java.lang.invoke.MethodHandle::invokeExact (9 bytes) inline (hot) @ 2 java.lang.invoke.MutableCallSite::getTarget (5 bytes) inline (hot) Friday, September 14, 12
  63. Decoding compiled method 0x000000010549d7d0: Code: [Entry Point] [Verified Entry Point]

    [Constants] # {method} 'method__0$RUBY$foo' '(Lbench_empty_method;Lorg/jruby/runtime/ThreadContext;Lorg/jruby/ runtime/builtin/IRubyObject;Lorg/jruby/runtime/Block;)Lorg/jruby/runtime/builtin/IRubyObject;' in 'bench_empty_method' # parm0: rsi:rsi = 'bench_empty_method' # parm1: rdx:rdx = 'org/jruby/runtime/ThreadContext' # parm2: rcx:rcx = 'org/jruby/runtime/builtin/IRubyObject' # parm3: r8:r8 = 'org/jruby/runtime/Block' # [sp+0x20] (sp of caller) 0x000000010549d900: sub $0x18,%rsp 0x000000010549d907: mov %rbp,0x10(%rsp) ;*synchronization entry ; - bench_empty_method::method__0$RUBY$foo@-1 (line 3) 0x000000010549d90c: mov %rcx,%rax 0x000000010549d90f: add $0x10,%rsp 0x000000010549d913: pop %rbp 0x000000010549d914: test %eax,-0xe9f91a(%rip) # 0x00000001045fe000 ; {poll_return} 0x000000010549d91a: retq Friday, September 14, 12
  64. Decoding compiled method 0x000000010549d7d0: Code: [Entry Point] [Verified Entry Point]

    [Constants] # {method} 'method__0$RUBY$foo' '(Lbench_empty_method;Lorg/jruby/runtime/ThreadContext;Lorg/jruby/ runtime/builtin/IRubyObject;Lorg/jruby/runtime/Block;)Lorg/jruby/runtime/builtin/IRubyObject;' in 'bench_empty_method' # parm0: rsi:rsi = 'bench_empty_method' # parm1: rdx:rdx = 'org/jruby/runtime/ThreadContext' # parm2: rcx:rcx = 'org/jruby/runtime/builtin/IRubyObject' # parm3: r8:r8 = 'org/jruby/runtime/Block' # [sp+0x20] (sp of caller) 0x000000010549d900: sub $0x18,%rsp 0x000000010549d907: mov %rbp,0x10(%rsp) ;*synchronization entry ; - bench_empty_method::method__0$RUBY$foo@-1 (line 3) 0x000000010549d90c: mov %rcx,%rax 0x000000010549d90f: add $0x10,%rsp 0x000000010549d913: pop %rbp 0x000000010549d914: test %eax,-0xe9f91a(%rip) # 0x00000001045fe000 ; {poll_return} 0x000000010549d91a: retq Friday, September 14, 12
  65. Decoding compiled method 0x000000010549d7d0: Code: [Entry Point] [Verified Entry Point]

    [Constants] # {method} 'method__0$RUBY$foo' '(Lbench_empty_method;Lorg/jruby/runtime/ThreadContext;Lorg/jruby/ runtime/builtin/IRubyObject;Lorg/jruby/runtime/Block;)Lorg/jruby/runtime/builtin/IRubyObject;' in 'bench_empty_method' # parm0: rsi:rsi = 'bench_empty_method' # parm1: rdx:rdx = 'org/jruby/runtime/ThreadContext' # parm2: rcx:rcx = 'org/jruby/runtime/builtin/IRubyObject' # parm3: r8:r8 = 'org/jruby/runtime/Block' # [sp+0x20] (sp of caller) 0x000000010549d900: sub $0x18,%rsp 0x000000010549d907: mov %rbp,0x10(%rsp) ;*synchronization entry ; - bench_empty_method::method__0$RUBY$foo@-1 (line 3) 0x000000010549d90c: mov %rcx,%rax 0x000000010549d90f: add $0x10,%rsp 0x000000010549d913: pop %rbp 0x000000010549d914: test %eax,-0xe9f91a(%rip) # 0x00000001045fe000 ; {poll_return} 0x000000010549d91a: retq Friday, September 14, 12
  66. bench_empty_method3 def invoker1 i = 0 while i < 1000

    foo; foo; foo; foo; foo i+=1 end end ... i = 0 while i < 10000 invoker1 i+=1 end Friday, September 14, 12
  67. 0s 0.038s 0.075s 0.113s 0.15s bench1 + indy bench2 +

    indy bench3 + indy Friday, September 14, 12
  68. Moral • Benchmarks are synthetic • Every system is different

    • Do your own testing Friday, September 14, 12
  69. bench_red_black • Pure-Ruby red/black tree impl • Build a 100k

    tree of rand(999_999) • Delete all nodes • Build it again • Search for elements • In-order walks, min, max Friday, September 14, 12
  70. 0s 1.25s 2.5s 3.75s 5s bench_red_black Ruby 1.9.3 JRuby -

    indy JRuby + indy Friday, September 14, 12
  71. bench_fractal bench_flipflop_fractal • Mandelbrot generator • Integer loops • Floating-point

    math • Julia generator using flip-flops • I don’t really understand it. Friday, September 14, 12
  72. def fractal_flipflop w, h = 44, 54 c = 7

    + 42 * w a = [0] * w * h g = d = 0 f = proc do |n| a[c] += 1 o = a.map {|z| " :#"[z, 1] * 2 }.join.scan(/.{#{w * 2}}/) puts "\f" + o.map {|l| l.rstrip }.join("\n") d += 1 - 2 * ((g ^= 1 << n) >> n) c += [1, w, -1, -w][d %= 4] end 1024.times do !!(!!(!!(!!(!!(!!(!!(!!(!!(true... f[0])...f[1])...f[2])... f[3])...f[4])...f[5])... f[6])...f[7])...f[8]) end end Friday, September 14, 12
  73. def fractal_flipflop w, h = 44, 54 c = 7

    + 42 * w a = [0] * w * h g = d = 0 f = proc do |n| a[c] += 1 o = a.map {|z| " :#"[z, 1] * 2 }.join.scan(/.{#{w * 2}}/) puts "\f" + o.map {|l| l.rstrip }.join("\n") d += 1 - 2 * ((g ^= 1 << n) >> n) c += [1, w, -1, -w][d %= 4] end 1024.times do !!(!!(!!(!!(!!(!!(!!(!!(!!(true... f[0])...f[1])...f[2])... f[3])...f[4])...f[5])... f[6])...f[7])...f[8]) end end Friday, September 14, 12
  74. 0s 0.375s 0.75s 1.125s 1.5s bench_fractal Ruby 1.9.3 JRuby -

    indy JRuby + indy Friday, September 14, 12
  75. Rails Perf • Mixed bag right now...some fast some slow

    • JVM JIT limits need to be bumped up • Significant gains for some folks • Long warmup times for so much code • Work continues! Friday, September 14, 12
  76. Expand Opto • Mixed-arity (ADD SLIDES ABOUT WHAT WE OPTIMIZE

    TODAY) • Super calls • Much, much lighter-weight closures • Then what? Friday, September 14, 12
  77. Wacky Stuff • define_method methods? • method_missing call-throughs? • respond_to???

    • proc tables? • All possible...but worth it? Friday, September 14, 12
  78. The Future • JRuby will continue to get faster •

    Indy improvements at VM-level • Compiler improvements at Ruby level • If you can’t compete with JVM... • Still FOSS from top to bottom • Don’t be afraid! Friday, September 14, 12