1.2k

# I Wanna Go Fast!

GopherCon 2017

Lightning talk about the Go inliner and why "assembly" doesn't always mean "fast".

Example code: https://github.com/gtank/gophercon2017-examples July 15, 2017

## Transcript

2. ### Let’s make integer multiplication fast • Crypto means “multiplying really

big numbers together” • 64 bits x 64 bits will produce 128-bit output • We don't have 128-bit registers on 64-bit machines • Results have to be stored in multiple registers (ignoring vector insns) • Processors provide "widening multiply" instructions that do this • Some languages provide uint128 and handle this for you • Go does not ◦ https://github.com/golang/go/issues/9455
3. ### A multiplier Basically, splits large numbers into smaller components that

we can handle more easily result = hi << 64 + lo func multiply1(a, b uint64) uint64 { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return uint64{lo, hi} }
4. ### A multiplier Basically, splits large numbers into smaller components that

we can handle more easily
5. ### Benchmarks Computers go fast, right? // Arbitrary value. aLargeNumber^2 is

110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiply1(b *testing.B) { for i := 0; i < b.N; i++ { _ = multiply1(aLargeNumber, aLargeNumber) } } \$ go test -bench Multiply1 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply1-2 100000000 13.3 ns/op PASS

7. ### Multiplier #2 Uses two return values instead of an array

func multiply2(a, b uint64) (uint64, uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return lo, hi }
8. ### Benchmarks Better, but still not great. // Arbitrary value. aLargeNumber^2

is 110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiply2(b *testing.B) { for i := 0; i < b.N; i++ { _ = multiply2(aLargeNumber, aLargeNumber) } } \$ go test -bench Multiply2 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply2-2 200000000 9.89 ns/op PASS
9. ### I heard assembly is fast The amd64 instruction we want

is called mulq mulq X Unsigned full multiply of %rax by X Result stored in %rdx:%rax The -q suffix means "quadword" - an eight-byte (64-bit) value.
10. ### Multiplier in asm Uses MULQ Nothing clever How much do

we love Plan9? // +build amd64 // func mulq(x, y uint64) (lo uint64, hi uint64) TEXT ·mulq(SB),4,\$0 MOVQ x+0(FP), AX MOVQ y+8(FP), CX MULQ CX MOVQ AX, ret+16(FP) // result low bits MOVQ DX, ret+24(FP) // result high bits RET
11. ### Benchmarks Take my word for it, this is still REALLY

slow. // Arbitrary value. aLargeNumber^2 is 110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiplyAsm(b *testing.B) { for i := 0; i < b.N; i++ { _ = mulq(aLargeNumber, aLargeNumber) } } \$ go test -bench MultiplyAsm goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiplyAsm-2 300000000 5.59 ns/op PASS
12. ### Function call overhead A function call takes multiple nanoseconds, which

completely dominates the runtime of small arithmetic functions like this multiplier. pprof again! this mode is called “weblist” https://github.com/google/pprof
13. ### Inlining To avoid this problem, compilers try to move the

code of small/simple functions directly into the caller. This is called inlining. The Go inliner lives here: https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/inl.go It walks the instruction tree of each function calculating a “cost” that roughly reflects the complexity of the function. If the cost exceeds the hard-coded max budget, the function will not be inlined.
14. ### Inlining Functions accrue +1 cost for each node in the

instruction tree Some instructions are more equal than others: • Slice ops are +2 • 3-arg slice ops are +3 • Direct function calls (OCALLFUNC/OCALLMETH) only OK if target is inlinable and we have budget for it Some things are hard stops: • Most other types of calls. Interface calls, type conversions • Panic/recover • Certain runtime funcs and all non-intrinsic assembly [#17373]
15. ### Multiplier #2 Does not inline func multiply2(a, b uint64) (uint64,

uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return lo, hi }
16. ### Multiplier #3 Inlines! Propagated some assignments Shifts cost less than

shift + assign if you only use them once But how do we know? func multiply3(a, b uint64) (uint64, uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 t1 := ah*bl + ((al * bl) >> 32) t2 := al*bh + (t1 & 0xFFFFFFFF) hi := ah*bh + t1>>32 + t2>>32 lo := (a * b) return lo, hi }
17. ### Compiler Hack Prints the inliner cost of each function during

builds The magic number is 80 \$ go test -bench . -gcflags -m # github.com/gtank/gophercon2017-examples [inl] func multiply1 costs 99 [inl] func multiply2 costs 97 [inl] func multiply3 costs 77 ./multiply.go:50:6: can inline multiply3
18. ### Benchmarks Computers *are* fast \$ go test -bench Multiply3 goos:

linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply3-2 2000000000 1.18 ns/op PASS That’s ⅕ of the time the assembly function took.
19. ### More is NOT better This only costs 55 points. But

don’t write code like this - it’s a threshold effect. func multiply6(a, b uint64) (uint64, uint64) { t1 := (a>>32)*(b&0xFFFFFFFF) + ((a & 0xFFFFFFFF) * (b & 0xFFFFFFFF) >> 32) t2 := (a&0xFFFFFFFF)*(b>>32) + (t1 & 0xFFFFFFFF) return (a * b), (a>>32)*(b>>32) + t1>>32 + t2>>32 } \$ go test -bench Multiply6 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply6-2 2000000000 1.18 ns/op PASS
20. ### Questions? No time for questions! Annoy me on Twitter: George

Tankersley @gtank__ (two underscores) https://github.com/gtank/gophercon2017-examples