Efficient VM with JIT in Go
quasilyte @ GoWayFest 4.0 (2020)
Slide 2
Slide 2 text
Part 0/7
Backstory
> 0 - Backstory
1 - go-jdk overview
2 - Making the code run fast
3 - GC-friendly slots
4 - Interop / FFI
5 - Object layout / mem alloc
6 - Challenges & limitations
7 - Closing words
Slide 3
Slide 3 text
Once upon a time:
“Can we use Lucene from Go?”
Slide 4
Slide 4 text
Once upon a time:
“Can we use Lucene from Go?”
Sure...
Slide 5
Slide 5 text
Once upon a time:
“Can we use Lucene from Go?”
Sure...
Slide 6
Slide 6 text
How to use Java from Go?
● JNI (with CGo blessing)
● Pass arguments through serialization
https://github.com/timob/jnigi
Slide 7
Slide 7 text
JVM
CGo
Go→Java interop with JNI overview
Go Java
JNI
Slow!
Slide 8
Slide 8 text
Why JNI is not good for Go?
● Locked OS thread for JVM goroutines
Slide 9
Slide 9 text
Why JNI is not good for Go?
● Locked OS thread for JVM goroutines
● Every JNI call has CGo call overhead
Slide 10
Slide 10 text
Why JNI is not good for Go?
● Locked OS thread for JVM goroutines
● Every JNI call has CGo call overhead
● Expensive Go↔JNI values conversion
Slide 11
Slide 11 text
Two active runtimes in one application
Go
Runtime
JVM
OS
signals
Application
Slide 12
Slide 12 text
Two active runtimes in one application
Go
Runtime
JVM
OS
signals
Application
Slide 13
Slide 13 text
Two active runtimes in one application
Go
Runtime
JVM
OS
signals
Application
Slide 14
Slide 14 text
Long story short...
We’re now using Lucene from our
Go application, but
Slide 15
Slide 15 text
Long story short...
We’re now using Lucene from our
Go application, but
it bothers me how inefficient it is.
Can we do better?
Slide 16
Slide 16 text
Part 1/7
go-jdk overview
0 - Backstory
> 1 - go-jdk overview
2 - Making the code run fast
3 - GC-friendly slots
4 - Interop / FFI
5 - Object layout / mem alloc
6 - Challenges & limitations
7 - Closing words
Slide 17
Slide 17 text
Let’s try build an efficient JVM
that can be easily embedded into
Go applications.
Me (just now)
Quote
Slide 18
Slide 18 text
So, what exactly do we want?
● Cheap Go↔Java calls (and no CGo)
Slide 19
Slide 19 text
So, what exactly do we want?
● Cheap Go↔Java calls (and no CGo)
● Optimized machine code (no interpretation)
Slide 20
Slide 20 text
So, what exactly do we want?
● Cheap Go↔Java calls (and no CGo)
● Optimized machine code (no interpretation)
● Efficient objects layout and allocation
Slide 21
Slide 21 text
So, what exactly do we want?
● Cheap Go↔Java calls (and no CGo)
● Optimized machine code (no interpretation)
● Efficient objects layout and allocation
DO IT
Slide 22
Slide 22 text
go-jdk interop
Go go-jdk
Fast!
Direct connection
Slide 23
Slide 23 text
go-jdk project
● Java class file loader
● JIT compiler (non-tracing)
● Runtime and interop primitives
● Utility tools like “javap”
https://github.com/quasilyte/go-jdk
Slide 24
Slide 24 text
go-jdk inputs
Class file
Source code
JVM
Slide 25
Slide 25 text
go-jdk inputs
Class file
JVM
go-jdk uses class
files as its input
Source code
Slide 26
Slide 26 text
How class is loaded
Decode class file
Bytecode→IR
Optimize IR
IR→Machine code
Loaded
class data
For every
class method
Slide 27
Slide 27 text
How class is loaded
Decode class file
Bytecode→IR
Optimize IR
IR→Machine code
Loaded
class data
We eagerly emit the
machine code
Slide 28
Slide 28 text
How class is loaded
Decode class file
Bytecode→IR
Optimize IR
IR→Machine code
Loaded
class data
Used in Go
application
Slide 29
Slide 29 text
Part 2/7
Making the code
run fast
0 - Backstory
1 - go-jdk overview
> 2 - Making the code run fast
3 - GC-friendly slots
4 - Interop / FFI
5 - Object layout / mem alloc
6 - Challenges & limitations
7 - Closing words
Slide 30
Slide 30 text
Convert Java class file into… what?
Java class
file
Metadata
Bytecode
Consts
?
Slide 31
Slide 31 text
Idea 1: direct bytecode to machine code translation
Bytecode amd64 code
Slide 32
Slide 32 text
Our example class and example static method
class Example {
public static int add1(int x) {
return x + 1;
}
}
bytecode→amd64
iload_0
iconst_1
iadd
ireturn
MOVQ local_0(CX), AX
MOVQ AX, (CX)
ADDQ $8, CX
MOVQ $1, (CX)
ADDQ $8, CX
MOVQ -16(CX), AX
ADDQ -8(CX), AX
MOVQ AX, -16(CX)
SUBQ $8, CX
RET
Hard to
analyze and
optimize
Slide 36
Slide 36 text
Stack vs Register architecture
Suggested reading:
VM Showdown: Stack Versus Registers
We can’t change the input bytecode format,
but we can add intermediate representation.
Slide 37
Slide 37 text
Idea 2: add intermediate representation
Bytecode x86-64 code
IR
IR→amd64
r1 = iadd r0 1
iret r1
MOVQ local_0(CX), AX
ADDQ $1, AX
MOVQ AX, local_1(CX)
MOVQ local_1, AX
RET
Can be optimized-out
Slide 41
Slide 41 text
IR→amd64
ret = iadd r0 1
iret ret
MOVQ local_0(CX), AX
ADDQ $1, AX
RET
Mapped to
AX (or X0)
Slide 42
Slide 42 text
Part 3/7
GC-friendly slots
0 - Backstory
1 - go-jdk overview
2 - Making the code run fast
> 3 - GC-friendly slots
4 - Interop / FFI
5 - Object layout / mem alloc
6 - Challenges & limitations
7 - Closing words
Slide 43
Slide 43 text
Run time data lives inside the run time stack
f1()
Run time stack
Function frames
Slide 44
Slide 44 text
Run time data lives inside the run time stack
f1()
Run time stack
f1() calls f2()
f2()
Slide 45
Slide 45 text
Run time data lives inside the run time stack
f1()
f2()
f3()
Run time stack
f1() calls f2()
f2() calls f3()
Slide 46
Slide 46 text
Run time data lives inside the run time stack
f1()
Run time stack
f1() calls f2()
f2() calls f3()
f3() returns
f2()
Slide 47
Slide 47 text
Run time data lives inside the run time stack
f1()
Run time stack
f1() calls f2()
f2() calls f3()
f3() returns
f2() returns
Slide 48
Slide 48 text
Function frame model (abstract)
Args
Locals
Temps
Function frame
Slots
Slide 49
Slide 49 text
Function frame model (concrete)
r0 (arg)
r2 (local)
Function frame
r1 (arg)
r3 (tmp)
[]slot{r0,r1,r2,r3}
Slide 50
Slide 50 text
What do we store inside a slot?
slot int
long
...
Object
Slide 51
Slide 51 text
What do we store inside a slot?
slot int
long
...
Object
Scalars
Pointers
Slide 52
Slide 52 text
What do we store inside a slot?
int
long
...
Object
Scalars
Pointers
Seems like
everything fits in
64-bit slots
Slide 53
Slide 53 text
Uint64 slots
r0 r1 r2 ...
Function frame
type slot struct {
value uint64
}
Slide 54
Slide 54 text
Uint64 slots
Not safe to store pointers there!
Function frame
type slot struct {
value uint64
}
r0 r1 r2 ...
Slide 55
Slide 55 text
Uintptr slots
uintptr does not retain pointers neither
Function frame
type slot struct {
value uintptr
}
r0 r1 r2 ...
Slide 56
Slide 56 text
Pointer slots
type slot struct {
value unsafe.Pointer
}
Not safe to store scalars there!
Function frame
r0 r1 r2 ...
Slide 57
Slide 57 text
{uint64,pointer} slots
Paired {scalar, ptr} slots are a safe fix
Function frame
type slot struct {
scalar uint64
ptr *Object
}
r0 r1 r2 ...
Slide 58
Slide 58 text
Memory reclaim
Set every slot.ptr to nil
Function frame
r0 r1 r2 ...
scalar
ptr
scalar
ptr
scalar
ptr
Slide 59
Slide 59 text
Uint64 + second frame for pointers (alt solution)
r0 r1 r2 ...
Function frame
type slot struct {
value uint64
}
p0 p1 p2 ...
Keeps a
pointer
alive
Slide 60
Slide 60 text
Part 4/7
Interop / FFI
0 - Backstory
1 - go-jdk overview
2 - Making the code run fast
3 - GC-friendly slots
> 4 - Interop / FFI
5 - Object layout / mem alloc
6 - Challenges & limitations
7 - Closing words
Slide 61
Slide 61 text
Calling Java from Go
● Mark machine code buf as executable
● Call as func or do JMP in asm
Simple, boring.
Slide 62
Slide 62 text
Calling Go from Java
This is more involved (take a breath):
Slide 63
Slide 63 text
Calling Go from Java
This is more involved (take a breath):
● Obtain Go function address (simple)
Slide 64
Slide 64 text
Calling Go from Java
This is more involved (take a breath):
● Obtain Go function address (simple)
● Follow the Go calling convention (normal)
Slide 65
Slide 65 text
Calling Go from Java
This is more involved (take a breath):
● Obtain Go function address (simple)
● Follow the Go calling convention (normal)
● Deal with fatal error issues (hard)
Slide 66
Slide 66 text
How to get a Go function code address?
func funcAddr(fn interface{}) uintptr {
type eface struct {
typ uintptr
value *uintptr
}
e := (*eface)(unsafe.Pointer(&fn))
return *e.value
}
Slide 67
Slide 67 text
Go calling convention (source)
Slide 68
Slide 68 text
Assembling Java→Go call
1. Push arguments to the stack
2. CALL $func_addr
3. Move results to local slots
The exact actions depend on the current Go
calling convention.
Use funcAddr to
get that
Slide 69
Slide 69 text
Let’s try it!
...
Slide 70
Slide 70 text
Go runtime is not impressed!
● “Unknown caller PC”
● “Unknown return PC”
● “Missing stackmap”
Slide 71
Slide 71 text
Calling Go directly from the JIT’ed code
main()
> JIT code
foo()
Go call stack
Bad!
Slide 72
Slide 72 text
You've run into a really hairy area
of asm code.
My first suggestion is not try to
call from assembler into Go.
Ian Lance Taylor
Quote
Slide 73
Slide 73 text
You've run into a really hairy area
of asm code.
My first suggestion is not try to
call from assembler into Go.
Ian Lance Taylor
Quote
Slide 74
Slide 74 text
You've run into a really hairy area
of asm code.
My first suggestion is not try to
call from assembler into Go.
Ian Lance Taylor
Quote
Slide 75
Slide 75 text
How to fix these fatals?
Add a Go→Java calls proxy.
Java→Go calls via trampoline.
● Provides a stackmap for Java→Go calls
● Provides a known caller/return PC
Slide 76
Slide 76 text
Calling Go via proxy
main()
callJava()
foo()
Go call stack
JIT code
JMP gocall
Entry or gocall return
NO_LOCAL_POINTERS macro
It’s safe for us, as long as:
● We never rely on Go stack values address
● Our heap values are reachable elsewhere
Slide 80
Slide 80 text
Go→Java call proxy (simplified)
// callJava(e *Env, code *byte)
TEXT ·callJava(SB), 0, $96-16
NO_LOCAL_POINTERS
JMP code+8(FP)
RET
gocall:
CALL CX
JMP -8(BP)
Caller PC fix
Slide 81
Slide 81 text
Go→Java call proxy (fixing return PC)
// callJava(e *Env, code *byte)
TEXT ·callJava(SB), 0, $96-16
NO_LOCAL_POINTERS
MOVQ code+8(FP), CX
JCALL(CX)
RET
gocall:
CALL CX
JMP -8(BP)
Return PC fix
Slide 82
Slide 82 text
Go→Java call proxy (fixing return PC)
// callJava(e *Env, code *byte)
TEXT ·callJava(SB), 0, $96-16
NO_LOCAL_POINTERS
MOVQ code+8(FP), CX
JCALL(CX)
RET
gocall:
CALL CX
JMP -8(BP)
Saves following RET inst
addr and Jumps to CX
(see next slide)
Slide 83
Slide 83 text
JCALL macro
// Encoding `lea rax, [rip+N]` with BYTE
// since Go has no real RIP-relative
// addressing mode.
#define JCALL(fnreg) \
BYTE $0x48; … 8d0509000000 \ // Lea
MOVQ AX, (SI) \ // Store RET addr
ADDQ $16, SI \ // Move to next slot
JMP fnreg // Run JIT code
Slide 84
Slide 84 text
Java native methods
// In Java file:
public class Foo {
public static native void printInt(int x);
}
Slide 85
Slide 85 text
Java native methods
// In Java file:
public class Foo {
public static native void printInt(int x);
}
// In Go file:
func fooPrintInt(x int32) {
fmt.Println(x)
}
Slide 86
Slide 86 text
Java native methods
// In Java file:
public class Foo {
public static native void printInt(int x);
}
// In Go file:
func fooPrintInt(x int32) {
fmt.Println(x)
}
// Before loading Foo class:
vm.Bind("Foo.printInt", fooPrintInt)
Slide 87
Slide 87 text
Why do we need fast Java→Go?
If calls to Go are fast, we can:
● Implement runtime funcs as Go funcs
● Re-use Go code easily in out Java code
Slide 88
Slide 88 text
Part 5/7
Object layout and
memory
allocation
0 - Backstory
1 - go-jdk overview
2 - Making the code run fast
3 - GC-friendly slots
4 - Interop / FFI
> 5 - Object layout / mem alloc
6 - Challenges & limitations
7 - Closing words
Slide 89
Slide 89 text
Foo class
public class Foo {
public int x; // scalar 1
public int y; // scalar 2
public Bar bar; // pointer field
}
Slide 90
Slide 90 text
Foo class
public class Foo {
public int x; // scalar 1
public int y; // scalar 2
public Bar bar; // pointer field
}
type Foo struct {
X int32
Y int32
Bar *Bar
}
Perfect, but impossible
Slide 91
Slide 91 text
Foo class values (naive version)
// Object is a slice of interface{} fields.
// Pointer slot gets a slice pointer.
foo = []interface{}{x, y, bar}
slot.ptr = &foo
Slide 92
Slide 92 text
Read x:int field from Foo
*[ ]interface{}
interface{}
[ ]interface{}
int
Deref a slot.ptr
Get element
Deref eface.value
getfield
foo.x
Slow!
Slide 93
Slide 93 text
Proposed object layout
type Object struct {
Class *ClassInfo
Ptrdata **Object
}
type Object64 struct {
Object
Data [8]byte
}
Slide 94
Slide 94 text
Proposed object layout
type Object struct {
Class *ClassInfo
Ptrdata **Object
}
type Object64 struct {
Object
Data [8]byte
}
Common object
header
Slide 95
Slide 95 text
Proposed object layout
type Object struct {
Class *ClassInfo
Ptrdata **Object
}
type Object64 struct {
Object
Data [8]byte
}
All object
pointer fields
are stored here
Slide 96
Slide 96 text
Proposed object layout
type Object struct {
Class *ClassInfo
Ptrdata **Object
}
type Object64 struct {
Object
Data [8]byte
}
Object with
8-byte storage
for scalar fields,
Object<64>
Slide 97
Slide 97 text
Proposed object layout
type Object struct {
Class *ClassInfo
Ptrdata **Object
}
type Object64 struct {
Object
Data [8]byte
}
X and Y fields
can be stored
here
Slide 98
Slide 98 text
Conversion between Object and Object
Object
*Object
Unsafe cast
Violates “unsafe”
package rules
(but it’s still OK)
Abstract Object layout
ptrdata **Object *Object[0]
*Object[...]
scalar[0]
scalar[...]
info *ClassInfo
Reachable for GC
Slide 101
Slide 101 text
Abstract Object layout
ptrdata **Object *Object[0]
*Object[...]
scalar[0]
scalar[...]
info *ClassInfo
Data (no ptr)
Slide 102
Slide 102 text
Foo layout in memory
ptrdata **Object bar:*Object
x:int
y:int
info *ClassInfo
Slide 103
Slide 103 text
Read x:int field from Foo
*Object
int
Deref a slot.ptr
At a proper offset
getfield
foo.x
Slide 104
Slide 104 text
Read bar:Bar field from Foo
*Object
**Object
*Object
Get a ptrdata
Deref ptrdata at a
proper offset
getfield
foo.bar
Slide 105
Slide 105 text
Can we use []byte allocations?
No, Go GC will not track any pointers that are
stored inside that memory.
Slide 106
Slide 106 text
So, how to allocate?
● Choose the closest Object
● Allocate Object
● Return as *Object
May want to adjust sizes to the Go memory
allocator size classes.
Slide 107
Slide 107 text
How many Object types do we need?
Object [64]byte
Object [128]byte
Object [256]byte
Object64
Object128
Object256
For *huge* objects we
can use a less
efficient fallback
...
Slide 108
Slide 108 text
Part 6/7
Challenges and
limitations
0 - Backstory
1 - go-jdk overview
2 - Making the code run fast
3 - GC-friendly slots
4 - Interop / FFI
5 - Object layout / mem alloc
> 6 - Challenges & limitations
7 - Closing words
Slide 109
Slide 109 text
Null pointer check / explicit
var p *int // p is nil
println(*p)
Slide 110
Slide 110 text
Null pointer check / explicit
var p *int // p is nil
nilcheck(p) // Inserted by a compiler
println(*p)
Slide 111
Slide 111 text
Null pointer check / explicit
var p *int // p is nil
nilcheck(p) // Inserted by a compiler
println(*p)
Simple, but not very efficient
Slide 112
Slide 112 text
Null pointer check / signals
Hardware exceptions and interrupts
+
OS signals handling
More: https://stackoverflow.com/a/36955888/4017439
Slide 113
Slide 113 text
Remember this picture?
Go
Runtime
go-jdk
OS
signals
Slide 114
Slide 114 text
Limitation: bytecode patching
For some reasons, it’s quite common in Java
world to modify the bytecode that is being
loaded...
Slide 115
Slide 115 text
Limitation: bytecode patching
For some reasons, it’s quite common in Java
world to modify the bytecode that is being
loaded...
Since we convert bytecode into the machine
code, we have a problem...
Slide 116
Slide 116 text
Challenge: method re-load
If method changes and we can’t fit its code
into the old executable buffer, method address
will change...
Slide 117
Slide 117 text
Challenge: method re-load
If method changes and we can’t fit its code
into the old executable buffer, method address
will change...
This requires re-linking all method callers.
If calls were inlined it’s even harder.
Slide 118
Slide 118 text
Part 7/7
Closing words
0 - Backstory
1 - go-jdk overview
2 - Making the code run fast
3 - GC-friendly slots
4 - Interop / FFI
5 - Object layout / mem alloc
6 - Challenges & limitations
> 7 - Closing words
Slide 119
Slide 119 text
Testing
import testutil.T;
class Test {
public static void run(int x) {
T.println(x + 5);
}
}
System.out.println in OpenJDK,
fmt.Println in go-jdk
Slide 120
Slide 120 text
N-body benchmark results
OpenJDK 3.9s
go-jdk 4.8s
OpenJDK (no JIT) ~11s
go-jdk (no JIT) ~22s
Slide 121
Slide 121 text
N-body benchmark results
OpenJDK 3.9s
go-jdk 4.8s
OpenJDK (no JIT) ~11s
go-jdk (no JIT) ~22s
Slide 122
Slide 122 text
N-body benchmark results
OpenJDK 3.9s
go-jdk 4.8s
OpenJDK (no JIT) ~11s
go-jdk (no JIT) ~22s
Slide 123
Slide 123 text
Resources
● go-jdk repository
● VM Showdown: Stack Versus Registers
● Calling Go funcs from asm (ru)
● Go calling convention
● JNI bindings for Go
Slide 124
Slide 124 text
Efficient VM with JIT in Go
quasilyte @ GoWayFest 4.0 (2020)