I Wanna Go Fast!

I Wanna Go Fast!

GopherCon 2017

Lightning talk about the Go inliner and why "assembly" doesn't always mean "fast".

Example code: https://github.com/gtank/gophercon2017-examples

702d182dc365825040b1ad0b85c0fa3c?s=128

George Tankersley

July 15, 2017
Tweet

Transcript

  1. 2.

    Let’s make integer multiplication fast • Crypto means “multiplying really

    big numbers together” • 64 bits x 64 bits will produce 128-bit output • We don't have 128-bit registers on 64-bit machines • Results have to be stored in multiple registers (ignoring vector insns) • Processors provide "widening multiply" instructions that do this • Some languages provide uint128 and handle this for you • Go does not ◦ https://github.com/golang/go/issues/9455
  2. 3.

    A multiplier Basically, splits large numbers into smaller components that

    we can handle more easily result = hi << 64 + lo func multiply1(a, b uint64) [2]uint64 { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return [2]uint64{lo, hi} }
  3. 5.

    Benchmarks Computers go fast, right? // Arbitrary value. aLargeNumber^2 is

    110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiply1(b *testing.B) { for i := 0; i < b.N; i++ { _ = multiply1(aLargeNumber, aLargeNumber) } } $ go test -bench Multiply1 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply1-2 100000000 13.3 ns/op PASS
  4. 6.

    Profiling Which part is slow? Returning that array! btw pprof

    is awesome https://github.com/google/pprof
  5. 7.

    Multiplier #2 Uses two return values instead of an array

    func multiply2(a, b uint64) (uint64, uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return lo, hi }
  6. 8.

    Benchmarks Better, but still not great. // Arbitrary value. aLargeNumber^2

    is 110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiply2(b *testing.B) { for i := 0; i < b.N; i++ { _ = multiply2(aLargeNumber, aLargeNumber) } } $ go test -bench Multiply2 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply2-2 200000000 9.89 ns/op PASS
  7. 9.

    I heard assembly is fast The amd64 instruction we want

    is called mulq mulq X Unsigned full multiply of %rax by X Result stored in %rdx:%rax The -q suffix means "quadword" - an eight-byte (64-bit) value.
  8. 10.

    Multiplier in asm Uses MULQ Nothing clever How much do

    we love Plan9? // +build amd64 // func mulq(x, y uint64) (lo uint64, hi uint64) TEXT ·mulq(SB),4,$0 MOVQ x+0(FP), AX MOVQ y+8(FP), CX MULQ CX MOVQ AX, ret+16(FP) // result low bits MOVQ DX, ret+24(FP) // result high bits RET
  9. 11.

    Benchmarks Take my word for it, this is still REALLY

    slow. // Arbitrary value. aLargeNumber^2 is 110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiplyAsm(b *testing.B) { for i := 0; i < b.N; i++ { _ = mulq(aLargeNumber, aLargeNumber) } } $ go test -bench MultiplyAsm goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiplyAsm-2 300000000 5.59 ns/op PASS
  10. 12.

    Function call overhead A function call takes multiple nanoseconds, which

    completely dominates the runtime of small arithmetic functions like this multiplier. pprof again! this mode is called “weblist” https://github.com/google/pprof
  11. 13.

    Inlining To avoid this problem, compilers try to move the

    code of small/simple functions directly into the caller. This is called inlining. The Go inliner lives here: https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/inl.go It walks the instruction tree of each function calculating a “cost” that roughly reflects the complexity of the function. If the cost exceeds the hard-coded max budget, the function will not be inlined.
  12. 14.

    Inlining Functions accrue +1 cost for each node in the

    instruction tree Some instructions are more equal than others: • Slice ops are +2 • 3-arg slice ops are +3 • Direct function calls (OCALLFUNC/OCALLMETH) only OK if target is inlinable and we have budget for it Some things are hard stops: • Most other types of calls. Interface calls, type conversions • Panic/recover • Certain runtime funcs and all non-intrinsic assembly [#17373]
  13. 15.

    Multiplier #2 Does not inline func multiply2(a, b uint64) (uint64,

    uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return lo, hi }
  14. 16.

    Multiplier #3 Inlines! Propagated some assignments Shifts cost less than

    shift + assign if you only use them once But how do we know? func multiply3(a, b uint64) (uint64, uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 t1 := ah*bl + ((al * bl) >> 32) t2 := al*bh + (t1 & 0xFFFFFFFF) hi := ah*bh + t1>>32 + t2>>32 lo := (a * b) return lo, hi }
  15. 17.

    Compiler Hack Prints the inliner cost of each function during

    builds The magic number is 80 $ go test -bench . -gcflags -m # github.com/gtank/gophercon2017-examples [inl] func multiply1 costs 99 [inl] func multiply2 costs 97 [inl] func multiply3 costs 77 ./multiply.go:50:6: can inline multiply3
  16. 18.

    Benchmarks Computers *are* fast $ go test -bench Multiply3 goos:

    linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply3-2 2000000000 1.18 ns/op PASS That’s ⅕ of the time the assembly function took.
  17. 19.

    More is NOT better This only costs 55 points. But

    don’t write code like this - it’s a threshold effect. func multiply6(a, b uint64) (uint64, uint64) { t1 := (a>>32)*(b&0xFFFFFFFF) + ((a & 0xFFFFFFFF) * (b & 0xFFFFFFFF) >> 32) t2 := (a&0xFFFFFFFF)*(b>>32) + (t1 & 0xFFFFFFFF) return (a * b), (a>>32)*(b>>32) + t1>>32 + t2>>32 } $ go test -bench Multiply6 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply6-2 2000000000 1.18 ns/op PASS
  18. 20.

    Questions? No time for questions! Annoy me on Twitter: George

    Tankersley @gtank__ (two underscores) https://github.com/gtank/gophercon2017-examples