1.4k

# I Wanna Go Fast!

GopherCon 2017

Lightning talk about the Go inliner and why "assembly" doesn't always mean "fast".

Example code: https://github.com/gtank/gophercon2017-examples July 15, 2017

## Transcript

1. I wanna Go fast!
George Tankersley
@gtank__

2. Let’s make integer multiplication fast
● Crypto means “multiplying really big numbers together”
● 64 bits x 64 bits will produce 128-bit output
● We don't have 128-bit registers on 64-bit machines
● Results have to be stored in multiple registers (ignoring vector insns)
● Processors provide "widening multiply" instructions that do this
● Some languages provide uint128 and handle this for you
● Go does not
○ https://github.com/golang/go/issues/9455

3. A multiplier
Basically, splits large
numbers into smaller
components that we can
handle more easily
result = hi << 64 + lo
func multiply1(a, b uint64) uint64 {
al := a & 0xFFFFFFFF
ah := a >> 32
bl := b & 0xFFFFFFFF
bh := b >> 32
c0 := (al * bl) >> 32
t1 := ah*bl + c0
t1_lo := t1 & 0xFFFFFFFF
c1 := t1 >> 32
t2 := al*bh + t1_lo
c2 := t2 >> 32
hi := ah*bh + c1 + c2
lo := (a * b)
return uint64{lo, hi}
}

4. A multiplier
Basically, splits large
numbers into smaller
components that we can
handle more easily

5. Benchmarks
Computers go fast, right?
// Arbitrary value. aLargeNumber^2 is 110 bits
const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51)
func BenchmarkMultiply1(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = multiply1(aLargeNumber, aLargeNumber)
}
}
\$ go test -bench Multiply1
goos: linux
goarch: amd64
pkg: github.com/gtank/gophercon2017-examples
BenchmarkMultiply1-2 100000000 13.3 ns/op
PASS

6. Profiling
Which part is slow?
Returning that array!

7. Multiplier #2
Uses two return values
func multiply2(a, b uint64) (uint64, uint64) {
al := a & 0xFFFFFFFF
ah := a >> 32
bl := b & 0xFFFFFFFF
bh := b >> 32
c0 := (al * bl) >> 32
t1 := ah*bl + c0
t1_lo := t1 & 0xFFFFFFFF
c1 := t1 >> 32
t2 := al*bh + t1_lo
c2 := t2 >> 32
hi := ah*bh + c1 + c2
lo := (a * b)
return lo, hi
}

8. Benchmarks
Better, but still not great.
// Arbitrary value. aLargeNumber^2 is 110 bits
const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51)
func BenchmarkMultiply2(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = multiply2(aLargeNumber, aLargeNumber)
}
}
\$ go test -bench Multiply2
goos: linux
goarch: amd64
pkg: github.com/gtank/gophercon2017-examples
BenchmarkMultiply2-2 200000000 9.89 ns/op
PASS

9. I heard assembly is fast
The amd64 instruction we want is called mulq
mulq X
Unsigned full multiply of %rax by X
Result stored in %rdx:%rax
The -q suffix means "quadword" - an eight-byte (64-bit) value.

10. Multiplier in asm
Uses MULQ
Nothing clever
How much do we love
Plan9?
// +build amd64
// func mulq(x, y uint64) (lo uint64, hi uint64)
TEXT ·mulq(SB),4,\$0
MOVQ x+0(FP), AX
MOVQ y+8(FP), CX
MULQ CX
MOVQ AX, ret+16(FP) // result low bits
MOVQ DX, ret+24(FP) // result high bits
RET

11. Benchmarks
Take my word for it, this
is still REALLY slow.
// Arbitrary value. aLargeNumber^2 is 110 bits
const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51)
func BenchmarkMultiplyAsm(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = mulq(aLargeNumber, aLargeNumber)
}
}
\$ go test -bench MultiplyAsm
goos: linux
goarch: amd64
pkg: github.com/gtank/gophercon2017-examples
BenchmarkMultiplyAsm-2 300000000 5.59 ns/op
PASS

A function call takes multiple nanoseconds, which completely dominates the
runtime of small arithmetic functions like this multiplier.
pprof again! this mode is called “weblist” https://github.com/google/pprof

13. Inlining
To avoid this problem, compilers try to move the code of small/simple functions
directly into the caller. This is called inlining. The Go inliner lives here:
https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/inl.go
It walks the instruction tree of each function calculating a “cost” that roughly
reflects the complexity of the function.
If the cost exceeds the hard-coded max budget, the function will not be inlined.

14. Inlining
Functions accrue +1 cost for each node in the instruction tree
Some instructions are more equal than others:
● Slice ops are +2
● 3-arg slice ops are +3
● Direct function calls (OCALLFUNC/OCALLMETH) only OK if target is inlinable
and we have budget for it
Some things are hard stops:
● Most other types of calls. Interface calls, type conversions
● Panic/recover
● Certain runtime funcs and all non-intrinsic assembly [#17373]

15. Multiplier #2
Does not inline
func multiply2(a, b uint64) (uint64, uint64) {
al := a & 0xFFFFFFFF
ah := a >> 32
bl := b & 0xFFFFFFFF
bh := b >> 32
c0 := (al * bl) >> 32
t1 := ah*bl + c0
t1_lo := t1 & 0xFFFFFFFF
c1 := t1 >> 32
t2 := al*bh + t1_lo
c2 := t2 >> 32
hi := ah*bh + c1 + c2
lo := (a * b)
return lo, hi
}

16. Multiplier #3
Inlines!
Propagated some
assignments
Shifts cost less than shift
+ assign if you only use
them once
But how do we know?
func multiply3(a, b uint64) (uint64, uint64) {
al := a & 0xFFFFFFFF
ah := a >> 32
bl := b & 0xFFFFFFFF
bh := b >> 32
t1 := ah*bl + ((al * bl) >> 32)
t2 := al*bh + (t1 & 0xFFFFFFFF)
hi := ah*bh + t1>>32 + t2>>32
lo := (a * b)
return lo, hi
}

17. Compiler Hack
Prints the inliner cost of
each function during
builds
The magic number is 80
\$ go test -bench . -gcflags -m
# github.com/gtank/gophercon2017-examples
[inl] func multiply1 costs 99
[inl] func multiply2 costs 97
[inl] func multiply3 costs 77
./multiply.go:50:6: can inline multiply3

18. Benchmarks
Computers *are* fast
\$ go test -bench Multiply3
goos: linux
goarch: amd64
pkg: github.com/gtank/gophercon2017-examples
BenchmarkMultiply3-2 2000000000 1.18 ns/op
PASS
That’s ⅕ of the time the assembly function
took.

19. More is NOT better
This only costs 55 points. But don’t write code like this - it’s a threshold effect.
func multiply6(a, b uint64) (uint64, uint64) {
t1 := (a>>32)*(b&0xFFFFFFFF) + ((a & 0xFFFFFFFF) * (b & 0xFFFFFFFF) >> 32)
t2 := (a&0xFFFFFFFF)*(b>>32) + (t1 & 0xFFFFFFFF)
return (a * b), (a>>32)*(b>>32) + t1>>32 + t2>>32
}
\$ go test -bench Multiply6
goos: linux
goarch: amd64
pkg: github.com/gtank/gophercon2017-examples
BenchmarkMultiply6-2 2000000000 1.18 ns/op
PASS

20. Questions?
No time for questions! Annoy me on Twitter:
George Tankersley
@gtank__
(two underscores)
https://github.com/gtank/gophercon2017-examples