Slide 1

Slide 1 text

Efficient Hosted Interpreter for Dynamic Languages PhD Defense by Wei Zhang Committee: Prof. Michael Franz Prof. Kwei-Jay Lin Prof. Guoqing Xu

Slide 2

Slide 2 text

2 A Modern Web Service client side server side image courtesy of flaticon.com JS, Coffee… Python, PHP, Ruby…

Slide 3

Slide 3 text

3 Dynamic languages are no longer — “scripting languages” • No longer simply used to accomplish small tasks • Ubiquitous in multiple domains • Appealing to programmers; offer higher “productivity” • Suffer from suboptimal performance

Slide 4

Slide 4 text

4 Brief history of dynamic language VMs 1.1980s: early academic work on Smalltalk & SELF as full-custom VMs 2.Early 90s: interpreters written in C (Python, Ruby) 3.Late 90s: more powerful and popular VM for statically typed OO languages like JVM and CLR (Java & C#) 4.Early 00s: hosted dynamic language VMs (Rhino, Jython, JRuby) 5.Late 00s: second coming of full-custom VMs for dynamic languages (V8)

Slide 5

Slide 5 text

Architectural choice of dynamic language VMs 5 VM target language JIT GC VM target language interpreter target language Hosting VM JIT GC Hosted VM full-custom interpreter-based hosted

Slide 6

Slide 6 text

6 Hosted VM / interpreter for dynamic languages • Full-custom VMs are costly to build and maintain • Existing VMs offer mature and powerful components (JIT, GC) • Interpreters are more cost-effective • Existing hosted VMs do not offer competitive performance

Slide 7

Slide 7 text

7 ZipPy is a hosted interpreter for Python3 • Built atop Truffle framework • Supports the common feature of the language • Open sourced at https://bitbucket.org/ssllab/zippy Truffle is a multi-language framework • Facilitates AST interpreter construction • Streamlines type specialization via AST node rewriting • Bridges the guest interpreter with the underlying JIT compiler

Slide 8

Slide 8 text

8 ZipPy on Truffle U: Uninitialized F: Float I: Integer compilation interpretation with specialization parse Python program ZipPy Truffle JVM U U U Python AST F I I type specialized Python AST I I F machine code ZipPy ZipPy Truffle

Slide 9

Slide 9 text

9 • Trufflization ★ Generators optimizations ★ Efficient object model for Python Agenda ★ our contributions

Slide 10

Slide 10 text

A for range loop example in Python addition for range loop def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl print(sum(1000)) 10

Slide 11

Slide 11 text

11 Numeric types in Python float complex bool int type coercion int has arbitrary precision

Slide 12

Slide 12 text

12 Numeric types in ZipPy float complex bool int type coercion int has arbitrary precision PFloat PComplex PBool PInt type coercion PInt has arbitrary precision double PComplex boolean BigInteger type coercion int numeric types boxed representation unboxed representation

Slide 13

Slide 13 text

13 Type specialization for addition abstract'class'AddNode'extends'BinaryArithmeticNode'{ ''@Specialization ''int'doBoolean(boolean'left,'boolean'right)'{ ''''final'int'leftInt'='left'?'1':'0; ''''final'int'rightInt'='right'?'1':'0; ''''return'leftInt'+'rightInt; ''} ''@Specialization(rewriteOn'='ArithmeticException.class) ''int'doInteger(int'left,'int'right)'{ ''''return'ExactMath.addExact(left,'right); ''} ''@Specialization ''BigInteger'doBigInteger(BigInteger'left,'BigInteger'right)'{ ''''return'left.add(right); ''} ''@Specialization ''double'doDouble(double'left,'double'right)'{ ''''return'left'+'right; ''} ''@Specialization ''PComplex'doComplex(PComplex'left,'PComplex'right)'{ ''''return'left.add(right); ''} ''@Specialization ''String'doString(String'left,'String'right)'{ ''''return'left'+'right; ''} ''//... }

Slide 14

Slide 14 text

14 AddNode derivatives AddNode AddBooleanNode AddBigIntegerNode AddDoubleNode AddPComplexNode AddStringNode AddGenericNode AddIntegerNode AddUninitializedNode

Slide 15

Slide 15 text

15 for-range loop in Python def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl ForNode specialization for range iterator class%ForNode%extends%LoopNode%{ %%@Specialization %%public%Object%doPRange(VirtualFrame%frame,% %%%%%%%%%%%%%%%%%%%%%%%PRangeIterator%range)%{ %%%%int%start%=%range.getStart(); %%%%int%stop%=%range.getStop(); %%%%int%step%=%range.getStep(); %%%%for%(int%i%=%start;%i%<%stop;%i%+=%step)%{ %%%%%%((WriteNode)%target).executeWrite(frame,%i); %%%%%%body.executeVoid(frame); %%%%} %% %%%%return%PNone.NONE; %%} } Truffle JVM ZipPy Python program Python program Truffle JVM ZipPy

Slide 16

Slide 16 text

16 for-range loop in Python def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl optimized for-range loop public'int'sum(int'n)'{ ''int'ttl'='0; ''for'(int'i'='0;'i'<'n;'i++)'{ ''''ttl'+='i; ''} ''return'ttl; } Truffle JVM ZipPy Python program Python program Truffle JVM ZipPy

Slide 17

Slide 17 text

17 for-range loop in Python def$sum(n): $$ttl$=$0 $$for$i$in$range(n): $$$$ttl$+=$i $$return$ttl JIT compiled for range loop jmp L7 L6: mov ecx, edx add ecx, ebp jo L8 mov edx, ebp incl edx mov esi, ebp mov ebp, edx mov edx, ecx L7: cmp eax, ebp jle L9 jmp L6 L8: call deoptimize() L9: Truffle JVM ZipPy Python program Python program JVM ZipPy Truffle

Slide 18

Slide 18 text

18 Speedups 0 25 50 75 100 binarytrees fannkuchredux fasta m andelbrot m eteor nbody pidigits spectralnorm float richards chaos deltablue go m ean 15 15 23 68 50 18 128 1 6 2 12 16 88 7 12 7 29 26 29 9 127 1 12 3 11 11 47 3 12 7 30 41 30 9 127 1 12 3 11 12 45 3 1 2 1 2 1 1 2 1 1 1 0 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 CPython3 CPython Jython PyPy PyPy3 ZipPy

Slide 19

Slide 19 text

19 benchmmark CPython3 CPython Jython PyPy PyPy3 ZipPy binarytrees 1.00 0.94 1.99 2.60 2.70 7.31 fannkuchredux 1.00 0.97 0.51 44.53 47.29 87.50 fasta 1.00 1.04 1.55 11.73 11.24 15.57 mandelbrot 1.00 1.08 0.34 10.91 10.82 11.69 meteor 1.00 1.02 0.77 2.64 2.62 2.13 nbody 1.00 0.97 0.73 12.13 12.06 6.17 pidigits 1.00 1.00 0.62 0.98 0.95 0.60 spectralnorm 1.00 1.33 1.89 127.33 127.25 128.10 float 1.00 0.95 1.05 8.64 8.67 17.71 richards 1.00 0.94 1.21 29.53 29.25 50.13 chaos 1.00 1.17 1.55 40.88 25.69 68.28 deltablue 1.00 0.85 1.33 30.08 29.14 23.46 go 1.00 1.08 1.99 6.79 6.66 15.41 mean 1.00 1.02 1.05 12.15 11.68 15.34 ZipPy is competitive with PyPy and a fast Python3 on the JVM

Slide 20

Slide 20 text

Client Iterator __next__() next value advance Python iterators • Iterators are ubiquitous • Implement iterator protocol • Built-in iterators • User-defined iterators • Generators are user-defined iterators using special control-flow construct (yield) • Generators exist in other languages too, like C#, PHP,… 20

Slide 21

Slide 21 text

Python generators def  fib(n):      a,  b  =  0,  1      for  i  in  range(n):          a,  b  =  b,  a+b          yield  a   #  1,  1,  2,  3,  5,  8.. l  =  []   for  i  in  fib(10):      if  i  %  2  ==  0:          l.append(i)   #  [2,  8,  ..] generator  function consumer  loop 21

Slide 22

Slide 22 text

Python generators • We surveyed the use of generators in Python programs • 90% of the top 50 Python projects on PyPI and GitHub use generators • Given its popularity the performance of generators are critical to Python programs Django LXML Jinja2 Flask pip Fabric Pandas Requests Reddit 22

Slide 23

Slide 23 text

Python generators are slow… 23

Slide 24

Slide 24 text

Generator Execution 1.The implicit call to __next__ and resume execution l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 generator(body consumer(loop 24

Slide 25

Slide 25 text

Generator Execution 1.The implicit call to __next__ and resume execution 2.Evaluate the next value in generator body l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 2 generator(body consumer(loop 25

Slide 26

Slide 26 text

Generator Execution 1.The implicit call to __next__ and resume execution 2.Evaluate the next value in generator body 3.Suspend execution and return to the caller l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 2 3 generator(body consumer(loop 26

Slide 27

Slide 27 text

Generator Execution 1.The implicit call to __next__ and resume execution 2.Evaluate the next value in generator body 3.Suspend execution and return to the caller 4.Consume the generated value l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a 1 2 3 4 generator(body consumer(loop 27

Slide 28

Slide 28 text

Generator Overheads • Only step 2 and 4 do the real work • Python call is expensive • Resume and suspend add additional costs and prevent frame optimizations l"="[] for$i$in$fib(10): ""if"i"%"2"=="0: """"l.append(i) def$fib(n): ""resume&to&last&yield ""a,"b"="0,"1 ""for$i$in$range(n): """"a,"b"="b,"a+b """"yield$a """"suspend&execution 1 2 3 4 generator(body consumer(loop 28

Slide 29

Slide 29 text

Naive Inlining • Desugar the consumer loop and inline __next__ directly • The suspend and resume handling still persists l"="[] g"="fib(10) while&True: ""resume&to&last&yield ""a,"b"="0,"1 ""for&i&in&range(n): """"a,"b"="b,"a+b """"yield&a """"suspend&execution ""i"="a ""if"i"%"2"=="0: """"l.append(i) except"StopIter: generator(body consumer(loop 0: n 1: a 2: b 3: i generator(frame 0: l 1: i caller(frame 29

Slide 30

Slide 30 text

Generator Peeling • Specialize the loop over generator at runtime • Merge yield with consumer loop body 30

Slide 31

Slide 31 text

Generator Peeling l"="[] n"="10 a,"b"="0,"1 for$i$in$range(n): ""a,"b"="b,"a+b ""i"="a ""if"i"%"2"=="0: """"l.append(i) 2 3 4 generator(body loop(body 1 • Specialize the loop over generator at runtime • Remove suspend and resume handling 31

Slide 32

Slide 32 text

Generator Peeling 32 l"="[] n"="10 a,"b"="0,"1 for$i$in$range(n): ""a,"b"="b,"a+b ""i"="a ""if"i"%"2"=="0: """"l.append(i) 2 3 4 generator(body loop(body 1 • Frames can be optimized during compilation 0: n 1: a 2: b 3: i generator(frame 0: l 1: i caller(frame

Slide 33

Slide 33 text

Before ForNode YieldNode 33

Slide 34

Slide 34 text

After PeeledLoopNode FrameTransferNode 34

Slide 35

Slide 35 text

The End Result • Caller frame and generator frame can be optimized • Peeling inlines the call to __next__ • No suspend and resume handling • AST level transformation, independent from compilation 35

Slide 36

Slide 36 text

Speedups of Generator Peeling Measuring peak performance of ZipPy with and without Generator Peeling 0 8 15 23 30 nqueens euler11 euler31 eratos lyndon partitions pymaging python-graph simplejson sympy whoosh geomean 3.58 2.79 1.31 3.67 1.79 2.76 4.32 22.69 1.14 2.82 13.19 4.53 36

Slide 37

Slide 37 text

The Performance of ZipPy Measuring peak performance of ZipPy with Generator Peeling 0 1 10 100 1000 nqueens euler11 euler31 eratos lyndon partitions pymaging python-graph simplejson sympy whoosh geomean 20.59 56.53 2.37 14.58 3.16 95.96 40.29 162.88 3.32 13.09 57.43 29.05 3.58 2.79 1.31 3.67 1.79 2.76 4.32 22.69 1.14 2.82 13.19 4.53 11 22 7 12 3 65 25 24 1 8 6 12 1.16 1.39 0.71 1.23 0.54 1.1 1.72 2.37 1.68 0.64 0.75 2.14 1 1 1 1 1 1 1 1 1 1 1 1 CPython 3.4 Jython 2.7 PyPy 2.3 ZipPy Baseline ZipPy + Peeling 37

Slide 38

Slide 38 text

Generator peeling conclusions • We present a dynamic program transformation that optimizes generators for optimizing AST interpreters • Not restricted to ZipPy or Python • As a result, programmers are free to enjoy generators’ upsides 38

Slide 39

Slide 39 text

Object model for dynamic languages Java primitive Java object PythonObject boxing 1: built-in numeric types 2: built-in immutable types 3: custom mutable types Multiple data representations for built-in and custom types 39

Slide 40

Slide 40 text

Modeling mutable object in Python HashMap based approach PythonObject layout 0: 42 1: 2 spill array ObjectLayout 'ham' : loc 0 'egg' : loc 1 Hidden class approach HashMap table size mod loadFactor 0 1 2 'egg' : 42 'ham' : 2 'spam' : 0 PythonObject hashmap 40 object storage object layout

Slide 41

Slide 41 text

class%FixedPythonObjectStorage%extends%PythonObject%{ %%%%static%final%int%INT_LOCATIONS_COUNT%=%5; %%%%protected%int%primitiveInt0; %%%%protected%int%primitiveInt1; %%%%protected%int%primitiveInt2; %%%%protected%int%primitiveInt3; %%%%protected%int%primitiveInt4; %%%%static%final%int%DOUBLE_LOCATIONS_COUNT%=%5; %%%%protected%double%primitiveDouble0; %%%%protected%double%primitiveDouble1; %%%%protected%double%primitiveDouble2; %%%%protected%double%primitiveDouble3; %%%%protected%double%primitiveDouble4; %%%%static%final%int%OBJECT_LOCATIONS_COUNT%=%5; %%%%protected%Object%fieldObject0; %%%%protected%Object%fieldObject1; %%%%protected%Object%fieldObject2; %%%%protected%Object%fieldObject3; %%%%protected%Object%fieldObject4; %%%%protected%Object[]%objectsArray%=%null; %%%%public%FixedPythonObjectStorage(PythonClass%pythonClass)%{ %%%%%%%%super(pythonClass); %%%%} } Implementation of object storage class 41

Slide 42

Slide 42 text

PythonObject layout 0: 42 1: spill array ObjectLayout 'ham' : loc 0 1 PythonObject layout 0: 42 1: 2 spill array ObjectLayout 'ham' : loc 0 2 add 'egg' : loc 1 add PythonObject layout 0: 42 1: 2 spill array ObjectLayout 'ham' : loc 0 3 'egg' : loc 1 'spam': arr 0 0: 404 PythonObject layout 0: 42 1: 404 spill array 4 ObjectLayout 'ham' : loc 0 'spam': loc 1 delete Implementation of object storage class 42

Slide 43

Slide 43 text

PNode LinkedDispatchNode primary attribute LayoutCheckNode check AttributeReadNode read LinkedDispatchNode next GetAttributeNode LayoutCheckNode check AttributeReadNode read UninitDispatchNode next cmp $0xe830f77b,r11d ; ObjectLayout jne 0x00000001102e8ee9 ; next dispatch mov rdi,0x640(%rsp) Inline caching for object accesses dispatch chain JIT compiled dispatch node 43

Slide 44

Slide 44 text

class%Point: %%def%__init__(self,.x,.y): ....self.x.=.x ....self.y.=.y p.=.Point(1.2,.0.3) #"p.x"=="1.2;"p.y"=="0.3" class%Point%extends%FlexiblePythonObjectStorage%{ %%%%protected%double%x; %%%%protected%double%y; %%%%protected%Object[]%objectsArray%=%null; %%%%public%Point(PythonClass%pythonClass)%{ %%%%%%%%super(pythonClass); %%%%} } Flexible storage class generation Python class Point generated storage class for Point 44

Slide 45

Slide 45 text

class%Point: %%def%__init__(self,.x,.y): ....self.x.=.x ....self.y.=.y %%def%addNeighbor(self,.n): ....self.neighbors.=.n n.=.[] . for%i%in%range(5): ..p.=.Point(i*1.0,.i*0.5) ..p.addNeighbors(n).. ..n.append(p) Python object layout change Python class Point client code 45

Slide 46

Slide 46 text

Fixed layout 0: 1: spill array ObjectLayout 'a' : loc 0 1 ObjectLayout 'a' : loc 0 'b' : loc 1 ObjectLayout 'a' : loc 0 'b' : loc 1 'c' : arr 0 Flexible 0 layout a: spill array ObjectLayout 'a' : loc a 2 ObjectLayout 'a' : loc a 'b' : arr 0 ObjectLayout 'a' : loc a 'b' : arr 0 'c' : arr 1 Flexible 1 layout a: b: spill array ObjectLayout 'a' : loc a 3 'b' : loc b ObjectLayout 'a' : loc a 'b' : loc b 'c' : arr 0 Flexible 2 layout a: b: c: 4 ObjectLayout 'a' : loc 0 'b' : loc 1 'c' : arr 0 spill array storage class generation object layout change Continuous storage class generation 46

Slide 47

Slide 47 text

float richards chaos deltablue go mean 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.02 0.98 1.05 1.14 0.88 1.04 1.03 1.01 1.05 1.14 0.89 1.06 1.00 1.00 1.00 1.00 1.00 1.00 fixed 5 flexible flexible w/ continuous generation Performance of different object storage configurations 47

Slide 48

Slide 48 text

float richards chaos deltablue go mean 1.00 1.75 2.50 3.25 4.00 2.87 2.61 3.60 2.57 3.00 2.57 2.23 2.03 2.80 2.00 2.33 2.00 1.63 1.45 2.20 1.43 1.67 1.43 fixed 1 fixed 3 fixed 5 Memory usage of fixed object storages normalized to flexible object storage 48

Slide 49

Slide 49 text

float richards chaos deltablue go mean 0.70 0.85 1.00 1.15 1.30 1.02 0.98 1.05 1.14 0.88 1.04 1.04 1.06 1.03 1.15 0.93 1.05 1.10 1.05 1.06 1.20 1.03 1.15 fixed 1 fixed 3 fixed 5 Slowdown of fixed object storages normalized to flexible object storage 49 slowdown

Slide 50

Slide 50 text

Flexible object storage conclusions • There is always a trade-off when using fixed object storage • Fixed object storage leads up to 20% loss on performance or 3.6x more memory usage • Flexible object storage always optimizes the current state of the target Python class • The coexistence of multiple storage classes can introduce overhead 50

Slide 51

Slide 51 text

Our contributions • Generator peeling: a runtime optimization targeting hosted interpreters • It is not restricted to Python or the implementation of ZipPy/Truffle • Flexible object storage: a space efficient object model technique for class-based dynamic languages • Can be reused by other languages hosted on the JVM 51

Slide 52

Slide 52 text

Publications • Wei Zhang, Per Larsen, Stefan Brunthaler, Michael Franz. Accelerating Iterators in Optimizing AST Interpreters. In Proceedings of the 29th ACM SIGPLAN Conference on Object Oriented Programming: Systems, Languages, and Applications, Portland, OR, USA, October 20-24, 2014 (OOPSLA '14), 2014. • Gülfem Savrun-Yeniçeri, Wei Zhang, Huahan Zhang, Eric Seckler, Chen Li, Stefan Brunthaler, Per Larsen, Michael Franz. Efficient Hosted Interpreters on the JVM. In ACM Transactions on Architecture and Code Optimization, volume 11(1) pages 9:1–9:24, 2014. • Gülfem Savrun-Yeniçeri, Wei Zhang, Huahan Zhang, Chen Li, Stefan Brunthaler, Per Larsen, Michael Franz. Efficient Interpreter Optimizations for the JVM. In Proceedings of the 10th International Conference on Principles and Practice of Programming in Java, Stuttgart, Germany, September 11-13, 2013 (PPPJ '13), 2013. 52

Slide 53

Slide 53 text

Question Please? 53