Efficient VM with JIT in Go

Slide 1

Slide 1 text

Efficient VM with JIT in Go quasilyte @ GoWayFest 4.0 (2020)

Slide 2

Slide 2 text

Part 0/7 Backstory > 0 - Backstory 1 - go-jdk overview 2 - Making the code run fast 3 - GC-friendly slots 4 - Interop / FFI 5 - Object layout / mem alloc 6 - Challenges & limitations 7 - Closing words

Slide 3

Slide 3 text

Once upon a time: “Can we use Lucene from Go?”

Slide 4

Slide 4 text

Once upon a time: “Can we use Lucene from Go?” Sure...

Slide 5

Slide 5 text

Once upon a time: “Can we use Lucene from Go?” Sure...

Slide 6

Slide 6 text

How to use Java from Go? ● JNI (with CGo blessing) ● Pass arguments through serialization https://github.com/timob/jnigi

Slide 7

Slide 7 text

JVM CGo Go→Java interop with JNI overview Go Java JNI Slow!

Slide 8

Slide 8 text

Why JNI is not good for Go? ● Locked OS thread for JVM goroutines

Slide 9

Slide 9 text

Why JNI is not good for Go? ● Locked OS thread for JVM goroutines ● Every JNI call has CGo call overhead

Slide 10

Slide 10 text

Why JNI is not good for Go? ● Locked OS thread for JVM goroutines ● Every JNI call has CGo call overhead ● Expensive Go↔JNI values conversion

Slide 11

Slide 11 text

Two active runtimes in one application Go Runtime JVM OS signals Application

Slide 12

Slide 12 text

Two active runtimes in one application Go Runtime JVM OS signals Application

Slide 13

Slide 13 text

Two active runtimes in one application Go Runtime JVM OS signals Application

Slide 14

Slide 14 text

Long story short... We’re now using Lucene from our Go application, but

Slide 15

Slide 15 text

Long story short... We’re now using Lucene from our Go application, but it bothers me how inefficient it is. Can we do better?

Slide 16

Slide 16 text

Part 1/7 go-jdk overview 0 - Backstory > 1 - go-jdk overview 2 - Making the code run fast 3 - GC-friendly slots 4 - Interop / FFI 5 - Object layout / mem alloc 6 - Challenges & limitations 7 - Closing words

Slide 17

Slide 17 text

Let’s try build an efficient JVM that can be easily embedded into Go applications. Me (just now) Quote

Slide 18

Slide 18 text

So, what exactly do we want? ● Cheap Go↔Java calls (and no CGo)

Slide 19

Slide 19 text

So, what exactly do we want? ● Cheap Go↔Java calls (and no CGo) ● Optimized machine code (no interpretation)

Slide 20

Slide 20 text

So, what exactly do we want? ● Cheap Go↔Java calls (and no CGo) ● Optimized machine code (no interpretation) ● Efficient objects layout and allocation

Slide 21

Slide 21 text

So, what exactly do we want? ● Cheap Go↔Java calls (and no CGo) ● Optimized machine code (no interpretation) ● Efficient objects layout and allocation DO IT

Slide 22

Slide 22 text

go-jdk interop Go go-jdk Fast! Direct connection

Slide 23

Slide 23 text

go-jdk project ● Java class ﬁle loader ● JIT compiler (non-tracing) ● Runtime and interop primitives ● Utility tools like “javap” https://github.com/quasilyte/go-jdk

Slide 24

Slide 24 text

go-jdk inputs Class ﬁle Source code JVM

Slide 25

Slide 25 text

go-jdk inputs Class ﬁle JVM go-jdk uses class ﬁles as its input Source code

Slide 26

Slide 26 text

How class is loaded Decode class ﬁle Bytecode→IR Optimize IR IR→Machine code Loaded class data For every class method

Slide 27

Slide 27 text

How class is loaded Decode class ﬁle Bytecode→IR Optimize IR IR→Machine code Loaded class data We eagerly emit the machine code

Slide 28

Slide 28 text

How class is loaded Decode class ﬁle Bytecode→IR Optimize IR IR→Machine code Loaded class data Used in Go application

Slide 29

Slide 29 text

Part 2/7 Making the code run fast 0 - Backstory 1 - go-jdk overview > 2 - Making the code run fast 3 - GC-friendly slots 4 - Interop / FFI 5 - Object layout / mem alloc 6 - Challenges & limitations 7 - Closing words

Slide 30

Slide 30 text

Convert Java class ﬁle into… what? Java class ﬁle Metadata Bytecode Consts ?

Slide 31

Slide 31 text

Idea 1: direct bytecode to machine code translation Bytecode amd64 code

Slide 32

Slide 32 text

Our example class and example static method class Example { public static int add1(int x) { return x + 1; } }

Slide 33

Slide 33 text

bytecode→amd64 iload_0 iconst_1 iadd ireturn MOVQ local_0(CX), AX MOVQ AX, (CX) ADDQ $8, CX MOVQ $1, (CX) ADDQ $8, CX MOVQ -16(CX), AX ADDQ -8(CX), AX MOVQ AX, -16(CX) SUBQ $8, CX RET

Slide 34

Slide 34 text

bytecode→amd64 iload_0 iconst_1 iadd ireturn MOVQ local_0(CX), AX MOVQ AX, (CX) ADDQ $8, CX MOVQ $1, (CX) ADDQ $8, CX MOVQ -16(CX), AX ADDQ -8(CX), AX MOVQ AX, -16(CX) SUBQ $8, CX RET Stack bookkeeping

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Stack vs Register architecture Suggested reading: VM Showdown: Stack Versus Registers We can’t change the input bytecode format, but we can add intermediate representation.

Slide 37

Slide 37 text

Idea 2: add intermediate representation Bytecode x86-64 code IR

Slide 38

Slide 38 text

bytecode→IR iload_0 iconst_1 iadd ireturn r1 = iadd r0 1 iret r1

Slide 39

Slide 39 text

IR→amd64 r1 = iadd r0 1 iret r1 MOVQ local_0(CX), AX ADDQ $1, AX MOVQ AX, local_1(CX) MOVQ local_1, AX RET

Slide 40

Slide 40 text

IR→amd64 r1 = iadd r0 1 iret r1 MOVQ local_0(CX), AX ADDQ $1, AX MOVQ AX, local_1(CX) MOVQ local_1, AX RET Can be optimized-out

Slide 41

Slide 41 text

IR→amd64 ret = iadd r0 1 iret ret MOVQ local_0(CX), AX ADDQ $1, AX RET Mapped to AX (or X0)

Slide 42

Slide 42 text

Part 3/7 GC-friendly slots 0 - Backstory 1 - go-jdk overview 2 - Making the code run fast > 3 - GC-friendly slots 4 - Interop / FFI 5 - Object layout / mem alloc 6 - Challenges & limitations 7 - Closing words

Slide 43

Slide 43 text

Run time data lives inside the run time stack f1() Run time stack Function frames

Slide 44

Slide 44 text

Run time data lives inside the run time stack f1() Run time stack f1() calls f2() f2()

Slide 45

Slide 45 text

Run time data lives inside the run time stack f1() f2() f3() Run time stack f1() calls f2() f2() calls f3()

Slide 46

Slide 46 text

Run time data lives inside the run time stack f1() Run time stack f1() calls f2() f2() calls f3() f3() returns f2()

Slide 47

Slide 47 text

Run time data lives inside the run time stack f1() Run time stack f1() calls f2() f2() calls f3() f3() returns f2() returns

Slide 48

Slide 48 text

Function frame model (abstract) Args Locals Temps Function frame Slots

Slide 49

Slide 49 text

Function frame model (concrete) r0 (arg) r2 (local) Function frame r1 (arg) r3 (tmp) []slot{r0,r1,r2,r3}

Slide 50

Slide 50 text

What do we store inside a slot? slot int long ... Object

Slide 51

Slide 51 text

What do we store inside a slot? slot int long ... Object Scalars Pointers

Slide 52

Slide 52 text

What do we store inside a slot? int long ... Object Scalars Pointers Seems like everything ﬁts in 64-bit slots

Slide 53

Slide 53 text

Uint64 slots r0 r1 r2 ... Function frame type slot struct { value uint64 }

Slide 54

Slide 54 text

Uint64 slots Not safe to store pointers there! Function frame type slot struct { value uint64 } r0 r1 r2 ...

Slide 55

Slide 55 text

Uintptr slots uintptr does not retain pointers neither Function frame type slot struct { value uintptr } r0 r1 r2 ...

Slide 56

Slide 56 text

Pointer slots type slot struct { value unsafe.Pointer } Not safe to store scalars there! Function frame r0 r1 r2 ...

Slide 57

Slide 57 text

{uint64,pointer} slots Paired {scalar, ptr} slots are a safe ﬁx Function frame type slot struct { scalar uint64 ptr *Object } r0 r1 r2 ...

Slide 58

Slide 58 text

Memory reclaim Set every slot.ptr to nil Function frame r0 r1 r2 ... scalar ptr scalar ptr scalar ptr

Slide 59

Slide 59 text

Uint64 + second frame for pointers (alt solution) r0 r1 r2 ... Function frame type slot struct { value uint64 } p0 p1 p2 ... Keeps a pointer alive

Slide 60

Slide 60 text

Part 4/7 Interop / FFI 0 - Backstory 1 - go-jdk overview 2 - Making the code run fast 3 - GC-friendly slots > 4 - Interop / FFI 5 - Object layout / mem alloc 6 - Challenges & limitations 7 - Closing words

Slide 61

Slide 61 text

Calling Java from Go ● Mark machine code buf as executable ● Call as func or do JMP in asm Simple, boring.

Slide 62

Slide 62 text

Calling Go from Java This is more involved (take a breath):

Slide 63

Slide 63 text

Calling Go from Java This is more involved (take a breath): ● Obtain Go function address (simple)

Slide 64

Slide 64 text

Calling Go from Java This is more involved (take a breath): ● Obtain Go function address (simple) ● Follow the Go calling convention (normal)

Slide 65

Slide 65 text

Calling Go from Java This is more involved (take a breath): ● Obtain Go function address (simple) ● Follow the Go calling convention (normal) ● Deal with fatal error issues (hard)

Slide 66

Slide 66 text

How to get a Go function code address? func funcAddr(fn interface{}) uintptr { type eface struct { typ uintptr value *uintptr } e := (*eface)(unsafe.Pointer(&fn)) return *e.value }

Slide 67

Slide 67 text

Go calling convention (source)

Slide 68

Slide 68 text

Assembling Java→Go call 1. Push arguments to the stack 2. CALL $func_addr 3. Move results to local slots The exact actions depend on the current Go calling convention. Use funcAddr to get that

Slide 69

Slide 69 text

Let’s try it! ...

Slide 70

Slide 70 text

Go runtime is not impressed! ● “Unknown caller PC” ● “Unknown return PC” ● “Missing stackmap”

Slide 71

Slide 71 text

Calling Go directly from the JIT’ed code main() JIT code foo() Go call stack Bad!

Slide 72

Slide 72 text

You've run into a really hairy area of asm code. My ﬁrst suggestion is not try to call from assembler into Go. Ian Lance Taylor Quote

Slide 73

Slide 73 text

You've run into a really hairy area of asm code. My ﬁrst suggestion is not try to call from assembler into Go. Ian Lance Taylor Quote

Slide 74

Slide 74 text

You've run into a really hairy area of asm code. My ﬁrst suggestion is not try to call from assembler into Go. Ian Lance Taylor Quote

Slide 75

Slide 75 text

How to ﬁx these fatals? Add a Go→Java calls proxy. Java→Go calls via trampoline. ● Provides a stackmap for Java→Go calls ● Provides a known caller/return PC

Slide 76

Slide 76 text

Calling Go via proxy main() callJava() foo() Go call stack JIT code JMP gocall Entry or gocall return

Slide 77

Slide 77 text

Go→Java call proxy (simpliﬁed) // callJava(e *Env, code *byte) TEXT ·callJava(SB), 0, $96-16 NO_LOCAL_POINTERS JMP code+8(FP) RET gocall: CALL CX JMP -8(BP)

Slide 78

Slide 78 text

Go→Java call proxy (simpliﬁed) // callJava(e *Env, code *byte) TEXT ·callJava(SB), 0, $96-16 NO_LOCAL_POINTERS JMP code+8(FP) RET gocall: CALL CX JMP -8(BP) Stackmap ﬁx

Slide 79

Slide 79 text

NO_LOCAL_POINTERS macro It’s safe for us, as long as: ● We never rely on Go stack values address ● Our heap values are reachable elsewhere

Slide 80

Slide 80 text

Go→Java call proxy (simpliﬁed) // callJava(e *Env, code *byte) TEXT ·callJava(SB), 0, $96-16 NO_LOCAL_POINTERS JMP code+8(FP) RET gocall: CALL CX JMP -8(BP) Caller PC ﬁx

Slide 81

Slide 81 text

Go→Java call proxy (ﬁxing return PC) // callJava(e *Env, code *byte) TEXT ·callJava(SB), 0, $96-16 NO_LOCAL_POINTERS MOVQ code+8(FP), CX JCALL(CX) RET gocall: CALL CX JMP -8(BP) Return PC ﬁx

Slide 82

Slide 82 text

Go→Java call proxy (ﬁxing return PC) // callJava(e *Env, code *byte) TEXT ·callJava(SB), 0, $96-16 NO_LOCAL_POINTERS MOVQ code+8(FP), CX JCALL(CX) RET gocall: CALL CX JMP -8(BP) Saves following RET inst addr and Jumps to CX (see next slide)

Slide 83

Slide 83 text

JCALL macro // Encoding `lea rax, [rip+N]` with BYTE // since Go has no real RIP-relative // addressing mode. #define JCALL(fnreg) \ BYTE $0x48; … 8d0509000000 \ // Lea MOVQ AX, (SI) \ // Store RET addr ADDQ $16, SI \ // Move to next slot JMP fnreg // Run JIT code

Slide 84

Slide 84 text

Java native methods // In Java file: public class Foo { public static native void printInt(int x); }

Slide 85

Slide 85 text

Java native methods // In Java file: public class Foo { public static native void printInt(int x); } // In Go file: func fooPrintInt(x int32) { fmt.Println(x) }

Slide 86

Slide 86 text

Java native methods // In Java file: public class Foo { public static native void printInt(int x); } // In Go file: func fooPrintInt(x int32) { fmt.Println(x) } // Before loading Foo class: vm.Bind("Foo.printInt", fooPrintInt)

Slide 87

Slide 87 text

Why do we need fast Java→Go? If calls to Go are fast, we can: ● Implement runtime funcs as Go funcs ● Re-use Go code easily in out Java code

Slide 88

Slide 88 text

Part 5/7 Object layout and memory allocation 0 - Backstory 1 - go-jdk overview 2 - Making the code run fast 3 - GC-friendly slots 4 - Interop / FFI > 5 - Object layout / mem alloc 6 - Challenges & limitations 7 - Closing words

Slide 89

Slide 89 text

Foo class public class Foo { public int x; // scalar 1 public int y; // scalar 2 public Bar bar; // pointer field }

Slide 90

Slide 90 text

Foo class public class Foo { public int x; // scalar 1 public int y; // scalar 2 public Bar bar; // pointer field } type Foo struct { X int32 Y int32 Bar *Bar } Perfect, but impossible

Slide 91

Slide 91 text

Foo class values (naive version) // Object is a slice of interface{} fields. // Pointer slot gets a slice pointer. foo = []interface{}{x, y, bar} slot.ptr = &foo

Slide 92

Slide 92 text

Read x:int ﬁeld from Foo *[ ]interface{} interface{} [ ]interface{} int Deref a slot.ptr Get element Deref eface.value getfield foo.x Slow!

Slide 93

Slide 93 text

Proposed object layout type Object struct { Class *ClassInfo Ptrdata **Object } type Object64 struct { Object Data [8]byte }

Slide 94

Slide 94 text

Proposed object layout type Object struct { Class *ClassInfo Ptrdata **Object } type Object64 struct { Object Data [8]byte } Common object header

Slide 95

Slide 95 text

Proposed object layout type Object struct { Class *ClassInfo Ptrdata **Object } type Object64 struct { Object Data [8]byte } All object pointer ﬁelds are stored here

Slide 96

Slide 96 text

Proposed object layout type Object struct { Class *ClassInfo Ptrdata **Object } type Object64 struct { Object Data [8]byte } Object with 8-byte storage for scalar ﬁelds, Object<64>

Slide 97

Slide 97 text

Proposed object layout type Object struct { Class *ClassInfo Ptrdata **Object } type Object64 struct { Object Data [8]byte } X and Y ﬁelds can be stored here

Slide 98

Slide 98 text

Conversion between Object and Object Object *Object Unsafe cast Violates “unsafe” package rules (but it’s still OK)

Slide 99

Slide 99 text

Abstract Object layout ptrdata **Object *Object[0] *Object[...] scalar[0] scalar[...] info *ClassInfo *Object *Object

Slide 100

Slide 100 text

Abstract Object layout ptrdata **Object *Object[0] *Object[...] scalar[0] scalar[...] info *ClassInfo Reachable for GC

Slide 101

Slide 101 text

Abstract Object layout ptrdata **Object *Object[0] *Object[...] scalar[0] scalar[...] info *ClassInfo Data (no ptr)

Slide 102

Slide 102 text

Foo layout in memory ptrdata **Object bar:*Object x:int y:int info *ClassInfo

Slide 103

Slide 103 text

Read x:int ﬁeld from Foo *Object int Deref a slot.ptr At a proper offset getfield foo.x

Slide 104

Slide 104 text

Read bar:Bar ﬁeld from Foo *Object **Object *Object Get a ptrdata Deref ptrdata at a proper offset getfield foo.bar

Slide 105

Slide 105 text

Can we use []byte allocations? No, Go GC will not track any pointers that are stored inside that memory.

Slide 106

Slide 106 text

So, how to allocate? ● Choose the closest Object ● Allocate Object ● Return as *Object May want to adjust sizes to the Go memory allocator size classes.

Slide 107

Slide 107 text

How many Object types do we need? Object [64]byte Object [128]byte Object [256]byte Object64 Object128 Object256 For *huge* objects we can use a less efficient fallback ...

Slide 108

Slide 108 text

Part 6/7 Challenges and limitations 0 - Backstory 1 - go-jdk overview 2 - Making the code run fast 3 - GC-friendly slots 4 - Interop / FFI 5 - Object layout / mem alloc > 6 - Challenges & limitations 7 - Closing words

Slide 109

Slide 109 text

Null pointer check / explicit var p *int // p is nil println(*p)

Slide 110

Slide 110 text

Null pointer check / explicit var p *int // p is nil nilcheck(p) // Inserted by a compiler println(*p)

Slide 111

Slide 111 text

Null pointer check / explicit var p *int // p is nil nilcheck(p) // Inserted by a compiler println(*p) Simple, but not very efficient

Slide 112

Slide 112 text

Null pointer check / signals Hardware exceptions and interrupts + OS signals handling More: https://stackoverﬂow.com/a/36955888/4017439

Slide 113

Slide 113 text

Remember this picture? Go Runtime go-jdk OS signals

Slide 114

Slide 114 text

Limitation: bytecode patching For some reasons, it’s quite common in Java world to modify the bytecode that is being loaded...

Slide 115

Slide 115 text

Limitation: bytecode patching For some reasons, it’s quite common in Java world to modify the bytecode that is being loaded... Since we convert bytecode into the machine code, we have a problem...

Slide 116

Slide 116 text

Challenge: method re-load If method changes and we can’t ﬁt its code into the old executable buffer, method address will change...

Slide 117

Slide 117 text

Challenge: method re-load If method changes and we can’t ﬁt its code into the old executable buffer, method address will change... This requires re-linking all method callers. If calls were inlined it’s even harder.

Slide 118

Slide 118 text

Part 7/7 Closing words 0 - Backstory 1 - go-jdk overview 2 - Making the code run fast 3 - GC-friendly slots 4 - Interop / FFI 5 - Object layout / mem alloc 6 - Challenges & limitations > 7 - Closing words

Slide 119

Slide 119 text

Testing import testutil.T; class Test { public static void run(int x) { T.println(x + 5); } } System.out.println in OpenJDK, fmt.Println in go-jdk

Slide 120

Slide 120 text

N-body benchmark results OpenJDK 3.9s go-jdk 4.8s OpenJDK (no JIT) ~11s go-jdk (no JIT) ~22s

Slide 121

Slide 121 text

N-body benchmark results OpenJDK 3.9s go-jdk 4.8s OpenJDK (no JIT) ~11s go-jdk (no JIT) ~22s

Slide 122

Slide 122 text

N-body benchmark results OpenJDK 3.9s go-jdk 4.8s OpenJDK (no JIT) ~11s go-jdk (no JIT) ~22s

Slide 123

Slide 123 text

Resources ● go-jdk repository ● VM Showdown: Stack Versus Registers ● Calling Go funcs from asm (ru) ● Go calling convention ● JNI bindings for Go

Slide 124

Slide 124 text

Efficient VM with JIT in Go quasilyte @ GoWayFest 4.0 (2020)