I Wanna Go Fast!

I wanna Go fast! George Tankersley @gtank__

Let’s make integer multiplication fast • Crypto means “multiplying really
big numbers together” • 64 bits x 64 bits will produce 128-bit output • We don't have 128-bit registers on 64-bit machines • Results have to be stored in multiple registers (ignoring vector insns) • Processors provide "widening multiply" instructions that do this • Some languages provide uint128 and handle this for you • Go does not ◦ https://github.com/golang/go/issues/9455

A multiplier Basically, splits large numbers into smaller components that
we can handle more easily result = hi << 64 + lo func multiply1(a, b uint64) [2]uint64 { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return [2]uint64{lo, hi} }

A multiplier Basically, splits large numbers into smaller components that
we can handle more easily

Benchmarks Computers go fast, right? // Arbitrary value. aLargeNumber^2 is
110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiply1(b *testing.B) { for i := 0; i < b.N; i++ { _ = multiply1(aLargeNumber, aLargeNumber) } } $ go test -bench Multiply1 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply1-2 100000000 13.3 ns/op PASS

Profiling Which part is slow? Returning that array! btw pprof
is awesome https://github.com/google/pprof

Multiplier #2 Uses two return values instead of an array
func multiply2(a, b uint64) (uint64, uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return lo, hi }

Benchmarks Better, but still not great. // Arbitrary value. aLargeNumber^2
is 110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiply2(b *testing.B) { for i := 0; i < b.N; i++ { _ = multiply2(aLargeNumber, aLargeNumber) } } $ go test -bench Multiply2 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply2-2 200000000 9.89 ns/op PASS

I heard assembly is fast The amd64 instruction we want
is called mulq mulq X Unsigned full multiply of %rax by X Result stored in %rdx:%rax The -q suffix means "quadword" - an eight-byte (64-bit) value.

Multiplier in asm Uses MULQ Nothing clever How much do
we love Plan9? // +build amd64 // func mulq(x, y uint64) (lo uint64, hi uint64) TEXT ·mulq(SB),4,$0 MOVQ x+0(FP), AX MOVQ y+8(FP), CX MULQ CX MOVQ AX, ret+16(FP) // result low bits MOVQ DX, ret+24(FP) // result high bits RET

Benchmarks Take my word for it, this is still REALLY
slow. // Arbitrary value. aLargeNumber^2 is 110 bits const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51) func BenchmarkMultiplyAsm(b *testing.B) { for i := 0; i < b.N; i++ { _ = mulq(aLargeNumber, aLargeNumber) } } $ go test -bench MultiplyAsm goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiplyAsm-2 300000000 5.59 ns/op PASS

Function call overhead A function call takes multiple nanoseconds, which
completely dominates the runtime of small arithmetic functions like this multiplier. pprof again! this mode is called “weblist” https://github.com/google/pprof

Inlining To avoid this problem, compilers try to move the
code of small/simple functions directly into the caller. This is called inlining. The Go inliner lives here: https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/inl.go It walks the instruction tree of each function calculating a “cost” that roughly reflects the complexity of the function. If the cost exceeds the hard-coded max budget, the function will not be inlined.

Inlining Functions accrue +1 cost for each node in the
instruction tree Some instructions are more equal than others: • Slice ops are +2 • 3-arg slice ops are +3 • Direct function calls (OCALLFUNC/OCALLMETH) only OK if target is inlinable and we have budget for it Some things are hard stops: • Most other types of calls. Interface calls, type conversions • Panic/recover • Certain runtime funcs and all non-intrinsic assembly [#17373]

Multiplier #2 Does not inline func multiply2(a, b uint64) (uint64,
uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 c0 := (al * bl) >> 32 t1 := ah*bl + c0 t1_lo := t1 & 0xFFFFFFFF c1 := t1 >> 32 t2 := al*bh + t1_lo c2 := t2 >> 32 hi := ah*bh + c1 + c2 lo := (a * b) return lo, hi }

Multiplier #3 Inlines! Propagated some assignments Shifts cost less than
shift + assign if you only use them once But how do we know? func multiply3(a, b uint64) (uint64, uint64) { al := a & 0xFFFFFFFF ah := a >> 32 bl := b & 0xFFFFFFFF bh := b >> 32 t1 := ah*bl + ((al * bl) >> 32) t2 := al*bh + (t1 & 0xFFFFFFFF) hi := ah*bh + t1>>32 + t2>>32 lo := (a * b) return lo, hi }

Compiler Hack Prints the inliner cost of each function during
builds The magic number is 80 $ go test -bench . -gcflags -m # github.com/gtank/gophercon2017-examples [inl] func multiply1 costs 99 [inl] func multiply2 costs 97 [inl] func multiply3 costs 77 ./multiply.go:50:6: can inline multiply3

Benchmarks Computers *are* fast $ go test -bench Multiply3 goos:
linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply3-2 2000000000 1.18 ns/op PASS That’s ⅕ of the time the assembly function took.

More is NOT better This only costs 55 points. But
don’t write code like this - it’s a threshold effect. func multiply6(a, b uint64) (uint64, uint64) { t1 := (a>>32)*(b&0xFFFFFFFF) + ((a & 0xFFFFFFFF) * (b & 0xFFFFFFFF) >> 32) t2 := (a&0xFFFFFFFF)*(b>>32) + (t1 & 0xFFFFFFFF) return (a * b), (a>>32)*(b>>32) + t1>>32 + t2>>32 } $ go test -bench Multiply6 goos: linux goarch: amd64 pkg: github.com/gtank/gophercon2017-examples BenchmarkMultiply6-2 2000000000 1.18 ns/op PASS

Questions? No time for questions! Annoy me on Twitter: George
Tankersley @gtank__ (two underscores) https://github.com/gtank/gophercon2017-examples

I Wanna Go Fast!

I Wanna Go Fast!

George Tankersley

More Decks by George Tankersley

Other Decks in Programming

Featured

Transcript

I wanna Go fast! George Tankersley @gtank__

Let’s make integer multiplication fast • Crypto means “multiplying really

A multiplier Basically, splits large numbers into smaller components that

A multiplier Basically, splits large numbers into smaller components that

Benchmarks Computers go fast, right? // Arbitrary value. aLargeNumber^2 is

Profiling Which part is slow? Returning that array! btw pprof

Multiplier #2 Uses two return values instead of an array

Benchmarks Better, but still not great. // Arbitrary value. aLargeNumber^2

I heard assembly is fast The amd64 instruction we want

Multiplier in asm Uses MULQ Nothing clever How much do

Benchmarks Take my word for it, this is still REALLY

Function call overhead A function call takes multiple nanoseconds, which

Inlining To avoid this problem, compilers try to move the

Inlining Functions accrue +1 cost for each node in the

Multiplier #2 Does not inline func multiply2(a, b uint64) (uint64,

Multiplier #3 Inlines! Propagated some assignments Shifts cost less than

Compiler Hack Prints the inliner cost of each function during

Benchmarks Computers are fast $ go test -bench Multiply3 goos:

More is NOT better This only costs 55 points. But

Questions? No time for questions! Annoy me on Twitter: George