Efficient Hosted Interpreter for Dynamic Languages

Efficient Hosted Interpreter for Dynamic Languages

PhD defense in 06/2015

0660a45bc590da31068780de9b34a0df?s=128

Wei Zhang

June 04, 2015
Tweet

Transcript

  1. Efficient Hosted Interpreter for Dynamic Languages PhD Defense by Wei

    Zhang Committee: Prof. Michael Franz Prof. Kwei-Jay Lin Prof. Guoqing Xu
  2. 2 A Modern Web Service client side server side image

    courtesy of flaticon.com JS, Coffee… Python, PHP, Ruby…
  3. 3 Dynamic languages are no longer — “scripting languages” •

    No longer simply used to accomplish small tasks • Ubiquitous in multiple domains • Appealing to programmers; offer higher “productivity” • Suffer from suboptimal performance
  4. 4 Brief history of dynamic language VMs 1.1980s: early academic

    work on Smalltalk & SELF as full-custom VMs 2.Early 90s: interpreters written in C (Python, Ruby) 3.Late 90s: more powerful and popular VM for statically typed OO languages like JVM and CLR (Java & C#) 4.Early 00s: hosted dynamic language VMs (Rhino, Jython, JRuby) 5.Late 00s: second coming of full-custom VMs for dynamic languages (V8)
  5. Architectural choice of dynamic language VMs 5 VM target language

    JIT GC VM target language interpreter target language Hosting VM JIT GC Hosted VM full-custom interpreter-based hosted
  6. 6 Hosted VM / interpreter for dynamic languages • Full-custom

    VMs are costly to build and maintain • Existing VMs offer mature and powerful components (JIT, GC) • Interpreters are more cost-effective • Existing hosted VMs do not offer competitive performance
  7. 7 ZipPy is a hosted interpreter for Python3 • Built

    atop Truffle framework • Supports the common feature of the language • Open sourced at https://bitbucket.org/ssllab/zippy Truffle is a multi-language framework • Facilitates AST interpreter construction • Streamlines type specialization via AST node rewriting • Bridges the guest interpreter with the underlying JIT compiler
  8. 8 ZipPy on Truffle U: Uninitialized F: Float I: Integer

    compilation interpretation with specialization parse Python program ZipPy Truffle JVM U U U Python AST F I I type specialized Python AST I I F machine code ZipPy ZipPy Truffle
  9. 9 • Trufflization ★ Generators optimizations ★ Efficient object model

    for Python Agenda ★ our contributions
  10. A for range loop example in Python addition for range

    loop def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl print(sum(1000)) 10
  11. 11 Numeric types in Python float complex bool int type

    coercion int has arbitrary precision
  12. 12 Numeric types in ZipPy float complex bool int type

    coercion int has arbitrary precision PFloat PComplex PBool PInt type coercion PInt has arbitrary precision double PComplex boolean BigInteger type coercion int numeric types boxed representation unboxed representation
  13. 13 Type specialization for addition abstract'class'AddNode'extends'BinaryArithmeticNode'{ ''@Specialization ''int'doBoolean(boolean'left,'boolean'right)'{ ''''final'int'leftInt'='left'?'1':'0; ''''final'int'rightInt'='right'?'1':'0;

    ''''return'leftInt'+'rightInt; ''} ''@Specialization(rewriteOn'='ArithmeticException.class) ''int'doInteger(int'left,'int'right)'{ ''''return'ExactMath.addExact(left,'right); ''} ''@Specialization ''BigInteger'doBigInteger(BigInteger'left,'BigInteger'right)'{ ''''return'left.add(right); ''} ''@Specialization ''double'doDouble(double'left,'double'right)'{ ''''return'left'+'right; ''} ''@Specialization ''PComplex'doComplex(PComplex'left,'PComplex'right)'{ ''''return'left.add(right); ''} ''@Specialization ''String'doString(String'left,'String'right)'{ ''''return'left'+'right; ''} ''//... }
  14. 14 AddNode derivatives AddNode AddBooleanNode AddBigIntegerNode AddDoubleNode AddPComplexNode AddStringNode AddGenericNode

    AddIntegerNode AddUninitializedNode
  15. 15 for-range loop in Python def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl

    ForNode specialization for range iterator class%ForNode%extends%LoopNode%{ %%@Specialization %%public%Object%doPRange(VirtualFrame%frame,% %%%%%%%%%%%%%%%%%%%%%%%PRangeIterator%range)%{ %%%%int%start%=%range.getStart(); %%%%int%stop%=%range.getStop(); %%%%int%step%=%range.getStep(); %%%%for%(int%i%=%start;%i%<%stop;%i%+=%step)%{ %%%%%%((WriteNode)%target).executeWrite(frame,%i); %%%%%%body.executeVoid(frame); %%%%} %% %%%%return%PNone.NONE; %%} } Truffle JVM ZipPy Python program Python program Truffle JVM ZipPy
  16. 16 for-range loop in Python def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl

    optimized for-range loop public'int'sum(int'n)'{ ''int'ttl'='0; ''for'(int'i'='0;'i'<'n;'i++)'{ ''''ttl'+='i; ''} ''return'ttl; } Truffle JVM ZipPy Python program Python program Truffle JVM ZipPy
  17. 17 for-range loop in Python def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl

    JIT compiled for range loop jmp L7 L6: mov ecx, edx add ecx, ebp jo L8 mov edx, ebp incl edx mov esi, ebp mov ebp, edx mov edx, ecx L7: cmp eax, ebp jle L9 jmp L6 L8: call deoptimize() L9: Truffle JVM ZipPy Python program Python program JVM ZipPy Truffle
  18. 18 Speedups 0 25 50 75 100 binarytrees fannkuchredux fasta

    m andelbrot m eteor nbody pidigits spectralnorm float richards chaos deltablue go m ean 15 15 23 68 50 18 128 1 6 2 12 16 88 7 12 7 29 26 29 9 127 1 12 3 11 11 47 3 12 7 30 41 30 9 127 1 12 3 11 12 45 3 1 2 1 2 1 1 2 1 1 1 0 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 CPython3 CPython Jython PyPy PyPy3 ZipPy
  19. 19 benchmmark CPython3 CPython Jython PyPy PyPy3 ZipPy binarytrees 1.00

    0.94 1.99 2.60 2.70 7.31 fannkuchredux 1.00 0.97 0.51 44.53 47.29 87.50 fasta 1.00 1.04 1.55 11.73 11.24 15.57 mandelbrot 1.00 1.08 0.34 10.91 10.82 11.69 meteor 1.00 1.02 0.77 2.64 2.62 2.13 nbody 1.00 0.97 0.73 12.13 12.06 6.17 pidigits 1.00 1.00 0.62 0.98 0.95 0.60 spectralnorm 1.00 1.33 1.89 127.33 127.25 128.10 float 1.00 0.95 1.05 8.64 8.67 17.71 richards 1.00 0.94 1.21 29.53 29.25 50.13 chaos 1.00 1.17 1.55 40.88 25.69 68.28 deltablue 1.00 0.85 1.33 30.08 29.14 23.46 go 1.00 1.08 1.99 6.79 6.66 15.41 mean 1.00 1.02 1.05 12.15 11.68 15.34 ZipPy is competitive with PyPy and a fast Python3 on the JVM
  20. Client Iterator __next__() next value advance Python iterators • Iterators

    are ubiquitous • Implement iterator protocol • Built-in iterators • User-defined iterators • Generators are user-defined iterators using special control-flow construct (yield) • Generators exist in other languages too, like C#, PHP,… 20
  21. Python generators def  fib(n):      a,  b  =  0,

     1      for  i  in  range(n):          a,  b  =  b,  a+b          yield  a   #  1,  1,  2,  3,  5,  8.. l  =  []   for  i  in  fib(10):      if  i  %  2  ==  0:          l.append(i)   #  [2,  8,  ..] generator  function consumer  loop 21
  22. Python generators • We surveyed the use of generators in

    Python programs • 90% of the top 50 Python projects on PyPI and GitHub use generators • Given its popularity the performance of generators are critical to Python programs Django LXML Jinja2 Flask pip Fabric Pandas Requests Reddit 22
  23. Python generators are slow… 23

  24. Generator Execution 1.The implicit call to __next__ and resume execution

    l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 generator(body consumer(loop 24
  25. Generator Execution 1.The implicit call to __next__ and resume execution

    2.Evaluate the next value in generator body l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 2 generator(body consumer(loop 25
  26. Generator Execution 1.The implicit call to __next__ and resume execution

    2.Evaluate the next value in generator body 3.Suspend execution and return to the caller l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 2 3 generator(body consumer(loop 26
  27. Generator Execution 1.The implicit call to __next__ and resume execution

    2.Evaluate the next value in generator body 3.Suspend execution and return to the caller 4.Consume the generated value l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 2 3 4 generator(body consumer(loop 27
  28. Generator Overheads • Only step 2 and 4 do the

    real work • Python call is expensive • Resume and suspend add additional costs and prevent frame optimizations l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""resume&to&last&yield ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a """"suspend&execution 1 2 3 4 generator(body consumer(loop 28
  29. Naive Inlining • Desugar the consumer loop and inline __next__

    directly • The suspend and resume handling still persists l"="[] g"="fib(10) while&True: ""resume&to&last&yield ""a,"b"="0,"1 ""for&i&in&range(n): """"a,"b"="b,"a+b """"yield&a """"suspend&execution ""i"="a ""if"i"%"2"=="0: """"l.append(i) except"StopIter: generator(body consumer(loop 0: n 1: a 2: b 3: i generator(frame 0: l 1: i caller(frame 29
  30. Generator Peeling • Specialize the loop over generator at runtime

    • Merge yield with consumer loop body 30
  31. Generator Peeling l"="[] n"="10 a,"b"="0,"1 for$i$in$range(n): ""a,"b"="b,"a+b ""i"="a ""if"i"%"2"=="0: """"l.append(i)

    2 3 4 generator(body loop(body 1 • Specialize the loop over generator at runtime • Remove suspend and resume handling 31
  32. Generator Peeling 32 l"="[] n"="10 a,"b"="0,"1 for$i$in$range(n): ""a,"b"="b,"a+b ""i"="a ""if"i"%"2"=="0:

    """"l.append(i) 2 3 4 generator(body loop(body 1 • Frames can be optimized during compilation 0: n 1: a 2: b 3: i generator(frame 0: l 1: i caller(frame
  33. Before ForNode YieldNode 33

  34. After PeeledLoopNode FrameTransferNode 34

  35. The End Result • Caller frame and generator frame can

    be optimized • Peeling inlines the call to __next__ • No suspend and resume handling • AST level transformation, independent from compilation 35
  36. Speedups of Generator Peeling Measuring peak performance of ZipPy with

    and without Generator Peeling 0 8 15 23 30 nqueens euler11 euler31 eratos lyndon partitions pymaging python-graph simplejson sympy whoosh geomean 3.58 2.79 1.31 3.67 1.79 2.76 4.32 22.69 1.14 2.82 13.19 4.53 36
  37. The Performance of ZipPy Measuring peak performance of ZipPy with

    Generator Peeling 0 1 10 100 1000 nqueens euler11 euler31 eratos lyndon partitions pymaging python-graph simplejson sympy whoosh geomean 20.59 56.53 2.37 14.58 3.16 95.96 40.29 162.88 3.32 13.09 57.43 29.05 3.58 2.79 1.31 3.67 1.79 2.76 4.32 22.69 1.14 2.82 13.19 4.53 11 22 7 12 3 65 25 24 1 8 6 12 1.16 1.39 0.71 1.23 0.54 1.1 1.72 2.37 1.68 0.64 0.75 2.14 1 1 1 1 1 1 1 1 1 1 1 1 CPython 3.4 Jython 2.7 PyPy 2.3 ZipPy Baseline ZipPy + Peeling 37
  38. Generator peeling conclusions • We present a dynamic program transformation

    that optimizes generators for optimizing AST interpreters • Not restricted to ZipPy or Python • As a result, programmers are free to enjoy generators’ upsides 38
  39. Object model for dynamic languages Java primitive Java object PythonObject

    boxing 1: built-in numeric types 2: built-in immutable types 3: custom mutable types Multiple data representations for built-in and custom types 39
  40. Modeling mutable object in Python HashMap based approach PythonObject layout

    0: 42 1: 2 spill array ObjectLayout 'ham' : loc 0 'egg' : loc 1 Hidden class approach HashMap table size mod loadFactor 0 1 2 'egg' : 42 'ham' : 2 'spam' : 0 PythonObject hashmap 40 object storage object layout
  41. class%FixedPythonObjectStorage%extends%PythonObject%{ %%%%static%final%int%INT_LOCATIONS_COUNT%=%5; %%%%protected%int%primitiveInt0; %%%%protected%int%primitiveInt1; %%%%protected%int%primitiveInt2; %%%%protected%int%primitiveInt3; %%%%protected%int%primitiveInt4; %%%%static%final%int%DOUBLE_LOCATIONS_COUNT%=%5; %%%%protected%double%primitiveDouble0; %%%%protected%double%primitiveDouble1;

    %%%%protected%double%primitiveDouble2; %%%%protected%double%primitiveDouble3; %%%%protected%double%primitiveDouble4; %%%%static%final%int%OBJECT_LOCATIONS_COUNT%=%5; %%%%protected%Object%fieldObject0; %%%%protected%Object%fieldObject1; %%%%protected%Object%fieldObject2; %%%%protected%Object%fieldObject3; %%%%protected%Object%fieldObject4; %%%%protected%Object[]%objectsArray%=%null; %%%%public%FixedPythonObjectStorage(PythonClass%pythonClass)%{ %%%%%%%%super(pythonClass); %%%%} } Implementation of object storage class 41
  42. PythonObject layout 0: 42 1: spill array ObjectLayout 'ham' :

    loc 0 1 PythonObject layout 0: 42 1: 2 spill array ObjectLayout 'ham' : loc 0 2 add 'egg' : loc 1 add PythonObject layout 0: 42 1: 2 spill array ObjectLayout 'ham' : loc 0 3 'egg' : loc 1 'spam': arr 0 0: 404 PythonObject layout 0: 42 1: 404 spill array 4 ObjectLayout 'ham' : loc 0 'spam': loc 1 delete Implementation of object storage class 42
  43. PNode LinkedDispatchNode primary attribute LayoutCheckNode check AttributeReadNode read LinkedDispatchNode next

    GetAttributeNode LayoutCheckNode check AttributeReadNode read UninitDispatchNode next cmp $0xe830f77b,r11d ; ObjectLayout jne 0x00000001102e8ee9 ; next dispatch mov rdi,0x640(%rsp) Inline caching for object accesses dispatch chain JIT compiled dispatch node 43
  44. class%Point: %%def%__init__(self,.x,.y): ....self.x.=.x ....self.y.=.y p.=.Point(1.2,.0.3) #"p.x"=="1.2;"p.y"=="0.3" class%Point%extends%FlexiblePythonObjectStorage%{ %%%%protected%double%x; %%%%protected%double%y; %%%%protected%Object[]%objectsArray%=%null;

    %%%%public%Point(PythonClass%pythonClass)%{ %%%%%%%%super(pythonClass); %%%%} } Flexible storage class generation Python class Point generated storage class for Point 44
  45. class%Point: %%def%__init__(self,.x,.y): ....self.x.=.x ....self.y.=.y %%def%addNeighbor(self,.n): ....self.neighbors.=.n n.=.[] . for%i%in%range(5): ..p.=.Point(i*1.0,.i*0.5)

    ..p.addNeighbors(n).. ..n.append(p) Python object layout change Python class Point client code 45
  46. Fixed layout 0: 1: spill array ObjectLayout 'a' : loc

    0 1 ObjectLayout 'a' : loc 0 'b' : loc 1 ObjectLayout 'a' : loc 0 'b' : loc 1 'c' : arr 0 Flexible 0 layout a: spill array ObjectLayout 'a' : loc a 2 ObjectLayout 'a' : loc a 'b' : arr 0 ObjectLayout 'a' : loc a 'b' : arr 0 'c' : arr 1 Flexible 1 layout a: b: spill array ObjectLayout 'a' : loc a 3 'b' : loc b ObjectLayout 'a' : loc a 'b' : loc b 'c' : arr 0 Flexible 2 layout a: b: c: 4 ObjectLayout 'a' : loc 0 'b' : loc 1 'c' : arr 0 spill array storage class generation object layout change Continuous storage class generation 46
  47. float richards chaos deltablue go mean 0.60 0.70 0.80 0.90

    1.00 1.10 1.20 1.02 0.98 1.05 1.14 0.88 1.04 1.03 1.01 1.05 1.14 0.89 1.06 1.00 1.00 1.00 1.00 1.00 1.00 fixed 5 flexible flexible w/ continuous generation Performance of different object storage configurations 47
  48. float richards chaos deltablue go mean 1.00 1.75 2.50 3.25

    4.00 2.87 2.61 3.60 2.57 3.00 2.57 2.23 2.03 2.80 2.00 2.33 2.00 1.63 1.45 2.20 1.43 1.67 1.43 fixed 1 fixed 3 fixed 5 Memory usage of fixed object storages normalized to flexible object storage 48
  49. float richards chaos deltablue go mean 0.70 0.85 1.00 1.15

    1.30 1.02 0.98 1.05 1.14 0.88 1.04 1.04 1.06 1.03 1.15 0.93 1.05 1.10 1.05 1.06 1.20 1.03 1.15 fixed 1 fixed 3 fixed 5 Slowdown of fixed object storages normalized to flexible object storage 49 slowdown
  50. Flexible object storage conclusions • There is always a trade-off

    when using fixed object storage • Fixed object storage leads up to 20% loss on performance or 3.6x more memory usage • Flexible object storage always optimizes the current state of the target Python class • The coexistence of multiple storage classes can introduce overhead 50
  51. Our contributions • Generator peeling: a runtime optimization targeting hosted

    interpreters • It is not restricted to Python or the implementation of ZipPy/Truffle • Flexible object storage: a space efficient object model technique for class-based dynamic languages • Can be reused by other languages hosted on the JVM 51
  52. Publications • Wei Zhang, Per Larsen, Stefan Brunthaler, Michael Franz.

    Accelerating Iterators in Optimizing AST Interpreters. In Proceedings of the 29th ACM SIGPLAN Conference on Object Oriented Programming: Systems, Languages, and Applications, Portland, OR, USA, October 20-24, 2014 (OOPSLA '14), 2014. • Gülfem Savrun-Yeniçeri, Wei Zhang, Huahan Zhang, Eric Seckler, Chen Li, Stefan Brunthaler, Per Larsen, Michael Franz. Efficient Hosted Interpreters on the JVM. In ACM Transactions on Architecture and Code Optimization, volume 11(1) pages 9:1–9:24, 2014. • Gülfem Savrun-Yeniçeri, Wei Zhang, Huahan Zhang, Chen Li, Stefan Brunthaler, Per Larsen, Michael Franz. Efficient Interpreter Optimizations for the JVM. In Proceedings of the 10th International Conference on Principles and Practice of Programming in Java, Stuttgart, Germany, September 11-13, 2013 (PPPJ '13), 2013. 52
  53. Question Please? 53