Modular Virtual Machine Architecture on a Meta-Circular JVM

Slide 1

Slide 1 text

Modular Virtual Machine Architecture on a Meta-Circular JVM Wei Zhang 1

Slide 2

Slide 2 text

张Җ / Wei Zhang Department: EECS Program: Computer Systems & Software 2 Committee: Professor Michael Franz, Chair Professor Pai Chou Professor Rainer Doemer Professor Kwei-Jay Lin Professor Harry Xu

Slide 3

Slide 3 text

Virtual machines on mobile devices... Motivation 3 Android Dalvik V8 ActionScript VM

Slide 4

Slide 4 text

4 Motivation Duplicated modules between VMs Dalvik V8 ActionScript VM Parser Interp Parser Parser GC GC JIT GC JIT Interp JIT

Slide 5

Slide 5 text

Drawbacks: - Big footprint for multiple language implementations - Expensive investment for new VMs 5 Motivation

Slide 6

Slide 6 text

Objectives: + Smaller footprint for multiple language implementations + Simplify implementation for new VMs 6 Motivation JavaScript Python Ruby Host VM

Slide 7

Slide 7 text

7 • Interpreter • Targeting a Host VM • Meta-Circular VM • Early Results Outline

Slide 8

Slide 8 text

+ Easy to implement and maintain + Fast edit-compile-run cycle + More portable + More memory efﬁcient 8 Interpreter

Slide 9

Slide 9 text

- But it’s slow... 9 Interpreter

Slide 10

Slide 10 text

10 Interpreter Inefﬁcient Interpreter 1000X Efﬁcient Interpreter 10X Optimizing Compiler 1X Performance slowdown[1] [1]: M. Anton Ertl and D. Gregg, Journal of Instruction-Level Parallelism 2003

Slide 11

Slide 11 text

11 Cost of interpretation[1] Instruction dispatch Operand access Performing the computation [1]: D. Gregg et al., The Case for Virtual Register Machines, IVME 2003 IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB

Slide 12

Slide 12 text

for (;;) switch(program[ip++]){ /*...*/ case add: sp[1]=sp[0]+sp[1]; sp++; break; /*...*/ } 12 Instruction Dispatch Instruction stream Dispatching loop Instruction implementations Switch-based dispatch IF ID EX ME WB

Slide 13

Slide 13 text

Inst thread[]= {&add, &pop...}; goto *thread++; add: sp[1]=sp[0]+sp[1]; sp++; goto *thread++; /*...*/ 13 Instruction Dispatch Threaded-code Instruction implementations Direct Threading[1] [1]: James R. Bell, Threaded Code, CACM, 1973 IF ID EX ME WB

Slide 14

Slide 14 text

14 thread: &get &a &get &b &add thread: &i_get_a &i_get_b &i_add i_get_a: &get &a i_get_b: &get &b i_add: &add get: sp[0]=*(*ip+1); sp++; goto *(*ip++); add: sp[0]=sp[1]+sp[0]; goto *(*ip++); Direct Threading Indirect Threading Indirection Operation Routine Indirect Threading Instruction Dispatch IF ID EX ME WB

Slide 15

Slide 15 text

15 thread: call get_a call get_b call add Subroutine Threading Instruction Dispatch Threaded-code Instruction implementations IF ID EX ME WB

Slide 16

Slide 16 text

16 Operand Access 1:push 1 2:push 2 3:push 4 4:mul 5:add 6:set a To perform: a = 1+2*4 1 1 2 1 2 4 1 8 9 1 push 1 push 1 push 2 pop,1 push 2 pop,1 push 1 pop Bytecode Stack Stack ops 10 stack ops IF ID EX ME WB

Slide 17

Slide 17 text

17 Operand Access 1:push 1 2:push 2 3:push 4 4:mul 5:add 6:set a 0 0 1 push 1 pop 0 0 Bytecode Stack Stack ops 2 stack ops Stack caching[1] : keep top-of-stack in registers 1 1 1 2 2 4 1 8 9 [1]: M. Anton Ertl, Stack Caching for Interpreters, PLDI 1995 IF ID EX ME WB

Slide 18

Slide 18 text

Stack-based vs. Register-based architecture 18 Operand Access 1:get a 2:get b 3:add 4:set c 1:add c a b With register-based architecture: -35% native instruction count[1] +45% code size[1] [1]: Yunhe Shi et al., Virtual Machine Showdown: Stack versus registers, TACO 2008 IF ID EX ME WB

Slide 19

Slide 19 text

19 1 + 2 -> 3 ‘a’ + ‘b’ -> ‘ab’ 1 + ‘a’ -> ‘1a’ ‘1’ + 2 -> ‘12’ ... Possibilities of : a + b Runtime Overhead of Dynamic Typing IF ID EX ME WB

Slide 20

Slide 20 text

20 if (Num && Num) { return a + b; } else if (String && String) return a.concat(b); } else if (Num && String) return a.toString().concat(b); } else if (String && Num) return a.concat(b.toString()); } ... Runtime overhead of Dynamic Typing Implementation of : add IF ID EX ME WB

Slide 21

Slide 21 text

Quickening[1] 21 Instruction Stream Generic Instruction Quickened Instruction Stream Specialized Instruction [1]: S. Brunthaler, Inline Caching Meets Quickening, ECOOP 2010 IF ID EX ME WB Performing the Computation get a get b add get a get b nadd generic add number add guard To fallback

Slide 22

Slide 22 text

22 • VM offers beneﬁts for hosted applications • Can we utilize those beneﬁts while building our VM? Targeting a Host VM

Slide 23

Slide 23 text

+ Cross platform + Automatic memory management + Libraries + Better IDE support 23 Targeting a Host VM (JVM/CLR)

Slide 24

Slide 24 text

Compile to Java bytecode: - Same complexity as writing compiler - Need to emulate language semantics which do not map well with JVM 24 Targeting JVM: Option #1 x.js x.class JVM

Slide 25

Slide 25 text

Interpreter running on JVM: - Lack of low level machine control - Overhead of double interpretation 25 Targeting JVM: Option #2 x.js Interpreter JVM

Slide 26

Slide 26 text

Restricted by well deﬁned JVM interface 26 Problem Application Guest VM JVM

Slide 27

Slide 27 text

27 Meta-Circular VM [1]: John McCarthy, LISP 1.5 Programmer’s Manual 1961 [2]: A. Goldberg & D. Robson, Smalltalk-80: the Language and Its Implementation 1983 • Meta-Circular virtual machine is written in the same language it implements • Original idea: meta-circular evaluator in LISP[1] • Smalltalk: the blue book reference implementation[2]

Slide 28

Slide 28 text

28 Meta-Circular JVM Conventional JVM Maxine VM[1] [1]: B. Titzer et al., VEE 2010 Courtesy Bernd Mathiske Application JDK OS Native library JVM Java code Application JDK OS Native library JVM Native code

Slide 29

Slide 29 text

29 Meta-Circular JVM Guest VM Maxine Compiler Garbage collector Runtime system

Slide 30

Slide 30 text

30 Meta-Circular JVM Word target = ArrayAccess.getWord(threadedCode, index); Intrinsics.jump(target.asAddress()); ... Jump to an address using Maxine internal:

Slide 31

Slide 31 text

Modular VM Guest VMs Host VM 31 Ruby Python JavaScript Runtime JIT GC Execution Parser Execution Parser Execution Parser

Slide 32

Slide 32 text

32 Current Progress • MBS JavaScript VM • Parser generated using ANTLR • Two interpreters • No regex yet • 19/26 SunSpider benchmarks run

Slide 33

Slide 33 text

33 Managed Bytecode Script VM Output Parser JavaScript AST walker ANTLR Runtime Baseline interpreter Optimized interpreter Bytecode

Slide 34

Slide 34 text

34 •Baseline Interpreter Standard Java Switch-based dispatching •Optimized Interpreter Direct call threading +30% Quickening +8% Managed Bytecode Script VM

Slide 35

Slide 35 text

35 Comparator: Rhino JavaScript VM • Open source JavaScript VM written in Java from Mozilla Foundation • Compile JavaScript source to Java classﬁle • Interpretation mode is included (AST) MBS Maxine VM Rhino Maxine VM

Slide 36

Slide 36 text

36 Performance: MBS vs. Rhino on Maxine 0% 20% 40% 60% 80% 100% 120% 140% 160% 3d_cube.js 3d_morph.js 3d_raytrace.js access_binary_trees.js access_fannkuch.js access_nbody.js access_nsieve.js bitops_3bit_bits_in_byte.js bitops_bits_in_byte.js bitops_bitwise_and.js bitops_nsieve_bits.js controlﬂow_recursive.js crypto_md5.js crypto_sha1.js math_cordic.js math_partial_sums.js math_spectral_norm.js string_base64.js string_fasta.js arithmetic mean geometry mean Geometric Mean -25% Arithmetic Mean -18%

Slide 37

Slide 37 text

37 MBS versus Rhino • MBS is 10X smaller than Rhino (jar ﬁle size) • MBS is written in a very short period • MBS’s performance is comparable with that of Rhino .jar 4 MB .jar 400 KB MBS Rhino

Slide 38

Slide 38 text

38 Future Work •Threaded Code Dispatch Direct threading / Subroutine threading Jython / JRuby •Quickening Dynamic derivative generation Automation

Slide 39

Slide 39 text

Thanks 39

Slide 40

Slide 40 text

40 MBS versus Rhino: LOC LOC MBS Rhino Front-end 20k 7k Interpreter 5.6k 3k Runtime 3.4k ~21k Classﬁle generation - ~22k Other - ~20k Total 29k 73k Total w/o front-end 9k 66k

Slide 41

Slide 41 text

41 •LLVM[1] Framework for building optimizing compilers Persistent low level intermediate representation •Tracing PyPy’s interpreter[2] Running Python interpreter on Python tracing JIT Trace hot code in work load Modular VM: Related Work [1]: Chris Lattner and Vikram Adve, LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation, CGO 2004 [2]: C.F. Bolz et al., Tracing the Meta-Level: PyPy’s Tracing JIT Compiler, ICOOOLPS

Slide 42

Slide 42 text

42 •Customize object layout for dynamic languages Customizable scheme for object layout Efﬁcient resizable object layout[1] Memory Optimization on Maxine [1]: C. Chamers et al., An Efﬁcient Implementation of Self, a Dynamically-Typed Object-Oriented Language Based on Prototypes, OOPSLA 1989