Let’s make integer multiplication fast ● Crypto means “multiplying really big numbers together” ● 64 bits x 64 bits will produce 128-bit output ● We don't have 128-bit registers on 64-bit machines ● Results have to be stored in multiple registers (ignoring vector insns) ● Processors provide "widening multiply" instructions that do this ● Some languages provide uint128 and handle this for you ● Go does not ○ https://github.com/golang/go/issues/9455
A multiplier Basically, splits large numbers into smaller components that we can handle more easily result = hi << 64 + lo func multiply1(a, b uint64) [2]uint64 { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return [2]uint64{lo, hi} }
I heard assembly is fast The amd64 instruction we want is called mulq mulq X Unsigned full multiply of %rax by X Result stored in %rdx:%rax The -q suffix means "quadword" - an eight-byte (64-bit) value.
Multiplier in asm Uses MULQ Nothing clever How much do we love Plan9? // +build amd64 // func mulq(x, y uint64) (lo uint64, hi uint64) TEXT ·mulq(SB),4,$0 MOVQ x+0(FP), AX MOVQ y+8(FP), CX MULQ CX MOVQ AX, ret+16(FP) // result low bits MOVQ DX, ret+24(FP) // result high bits RET
Benchmarks Take my word for it, this is still REALLY slow. // Arbitrary value. aLargeNumber^2 is 110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiplyAsm(b *testing.B) { for i := 0; i < b.N; i++ { _ = mulq(aLargeNumber, aLargeNumber) } } $ go test -bench MultiplyAsm goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiplyAsm-2 300000000 5.59 ns/op PASS
Function call overhead A function call takes multiple nanoseconds, which completely dominates the runtime of small arithmetic functions like this multiplier. pprof again! this mode is called “weblist” https://github.com/google/pprof
Inlining To avoid this problem, compilers try to move the code of small/simple functions directly into the caller. This is called inlining. The Go inliner lives here: https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/inl.go It walks the instruction tree of each function calculating a “cost” that roughly reflects the complexity of the function. If the cost exceeds the hard-coded max budget, the function will not be inlined.
Inlining Functions accrue +1 cost for each node in the instruction tree Some instructions are more equal than others: ● Slice ops are +2 ● 3-arg slice ops are +3 ● Direct function calls (OCALLFUNC/OCALLMETH) only OK if target is inlinable and we have budget for it Some things are hard stops: ● Most other types of calls. Interface calls, type conversions ● Panic/recover ● Certain runtime funcs and all non-intrinsic assembly [#17373]
Multiplier #3 Inlines! Propagated some assignments Shifts cost less than shift + assign if you only use them once But how do we know? func multiply3(a, b uint64) (uint64, uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 t1 := ah*bl + ((al * bl) >> 32) t2 := al*bh + (t1 & 0xFFFFFFFF) hi := ah*bh + t1>>32 + t2>>32 lo := (a * b) return lo, hi }
Compiler Hack Prints the inliner cost of each function during builds The magic number is 80 $ go test -bench . -gcflags -m # github.com/gtank/gophercon2017-examples [inl] func multiply1 costs 99 [inl] func multiply2 costs 97 [inl] func multiply3 costs 77 ./multiply.go:50:6: can inline multiply3
Benchmarks Computers *are* fast $ go test -bench Multiply3 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply3-2 2000000000 1.18 ns/op PASS That’s ⅕ of the time the assembly function took.