$30 off During Our Annual Pro Sale. View Details »

I Wanna Go Fast!

I Wanna Go Fast!

GopherCon 2017

Lightning talk about the Go inliner and why "assembly" doesn't always mean "fast".

Example code: https://github.com/gtank/gophercon2017-examples

George Tankersley

July 15, 2017
Tweet

More Decks by George Tankersley

Other Decks in Programming

Transcript

  1. I wanna Go fast!
    George Tankersley
    @gtank__

    View Slide

  2. Let’s make integer multiplication fast
    ● Crypto means “multiplying really big numbers together”
    ● 64 bits x 64 bits will produce 128-bit output
    ● We don't have 128-bit registers on 64-bit machines
    ● Results have to be stored in multiple registers (ignoring vector insns)
    ● Processors provide "widening multiply" instructions that do this
    ● Some languages provide uint128 and handle this for you
    ● Go does not
    ○ https://github.com/golang/go/issues/9455

    View Slide

  3. A multiplier
    Basically, splits large
    numbers into smaller
    components that we can
    handle more easily
    result = hi << 64 + lo
    func multiply1(a, b uint64) [2]uint64 {
    al := a & 0xFFFFFFFF
    ah := a >> 32
    bl := b & 0xFFFFFFFF
    bh := b >> 32
    c0 := (al * bl) >> 32
    t1 := ah*bl + c0
    t1_lo := t1 & 0xFFFFFFFF
    c1 := t1 >> 32
    t2 := al*bh + t1_lo
    c2 := t2 >> 32
    hi := ah*bh + c1 + c2
    lo := (a * b)
    return [2]uint64{lo, hi}
    }

    View Slide

  4. A multiplier
    Basically, splits large
    numbers into smaller
    components that we can
    handle more easily

    View Slide

  5. Benchmarks
    Computers go fast, right?
    // Arbitrary value. aLargeNumber^2 is 110 bits
    const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51)
    func BenchmarkMultiply1(b *testing.B) {
    for i := 0; i < b.N; i++ {
    _ = multiply1(aLargeNumber, aLargeNumber)
    }
    }
    $ go test -bench Multiply1
    goos: linux
    goarch: amd64
    pkg: github.com/gtank/gophercon2017-examples
    BenchmarkMultiply1-2 100000000 13.3 ns/op
    PASS

    View Slide

  6. Profiling
    Which part is slow?
    Returning that array!
    btw pprof is awesome https://github.com/google/pprof

    View Slide

  7. Multiplier #2
    Uses two return values
    instead of an array
    func multiply2(a, b uint64) (uint64, uint64) {
    al := a & 0xFFFFFFFF
    ah := a >> 32
    bl := b & 0xFFFFFFFF
    bh := b >> 32
    c0 := (al * bl) >> 32
    t1 := ah*bl + c0
    t1_lo := t1 & 0xFFFFFFFF
    c1 := t1 >> 32
    t2 := al*bh + t1_lo
    c2 := t2 >> 32
    hi := ah*bh + c1 + c2
    lo := (a * b)
    return lo, hi
    }

    View Slide

  8. Benchmarks
    Better, but still not great.
    // Arbitrary value. aLargeNumber^2 is 110 bits
    const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51)
    func BenchmarkMultiply2(b *testing.B) {
    for i := 0; i < b.N; i++ {
    _ = multiply2(aLargeNumber, aLargeNumber)
    }
    }
    $ go test -bench Multiply2
    goos: linux
    goarch: amd64
    pkg: github.com/gtank/gophercon2017-examples
    BenchmarkMultiply2-2 200000000 9.89 ns/op
    PASS

    View Slide

  9. I heard assembly is fast
    The amd64 instruction we want is called mulq
    mulq X
    Unsigned full multiply of %rax by X
    Result stored in %rdx:%rax
    The -q suffix means "quadword" - an eight-byte (64-bit) value.

    View Slide

  10. Multiplier in asm
    Uses MULQ
    Nothing clever
    How much do we love
    Plan9?
    // +build amd64
    // func mulq(x, y uint64) (lo uint64, hi uint64)
    TEXT ·mulq(SB),4,$0
    MOVQ x+0(FP), AX
    MOVQ y+8(FP), CX
    MULQ CX
    MOVQ AX, ret+16(FP) // result low bits
    MOVQ DX, ret+24(FP) // result high bits
    RET

    View Slide

  11. Benchmarks
    Take my word for it, this
    is still REALLY slow.
    // Arbitrary value. aLargeNumber^2 is 110 bits
    const aLargeNumber uint64 = 3*(1<<52) + 7*(1<<51)
    func BenchmarkMultiplyAsm(b *testing.B) {
    for i := 0; i < b.N; i++ {
    _ = mulq(aLargeNumber, aLargeNumber)
    }
    }
    $ go test -bench MultiplyAsm
    goos: linux
    goarch: amd64
    pkg: github.com/gtank/gophercon2017-examples
    BenchmarkMultiplyAsm-2 300000000 5.59 ns/op
    PASS

    View Slide

  12. Function call overhead
    A function call takes multiple nanoseconds, which completely dominates the
    runtime of small arithmetic functions like this multiplier.
    pprof again! this mode is called “weblist” https://github.com/google/pprof

    View Slide

  13. Inlining
    To avoid this problem, compilers try to move the code of small/simple functions
    directly into the caller. This is called inlining. The Go inliner lives here:
    https://github.com/golang/go/blob/master/src/cmd/compile/internal/gc/inl.go
    It walks the instruction tree of each function calculating a “cost” that roughly
    reflects the complexity of the function.
    If the cost exceeds the hard-coded max budget, the function will not be inlined.

    View Slide

  14. Inlining
    Functions accrue +1 cost for each node in the instruction tree
    Some instructions are more equal than others:
    ● Slice ops are +2
    ● 3-arg slice ops are +3
    ● Direct function calls (OCALLFUNC/OCALLMETH) only OK if target is inlinable
    and we have budget for it
    Some things are hard stops:
    ● Most other types of calls. Interface calls, type conversions
    ● Panic/recover
    ● Certain runtime funcs and all non-intrinsic assembly [#17373]

    View Slide

  15. Multiplier #2
    Does not inline
    func multiply2(a, b uint64) (uint64, uint64) {
    al := a & 0xFFFFFFFF
    ah := a >> 32
    bl := b & 0xFFFFFFFF
    bh := b >> 32
    c0 := (al * bl) >> 32
    t1 := ah*bl + c0
    t1_lo := t1 & 0xFFFFFFFF
    c1 := t1 >> 32
    t2 := al*bh + t1_lo
    c2 := t2 >> 32
    hi := ah*bh + c1 + c2
    lo := (a * b)
    return lo, hi
    }

    View Slide

  16. Multiplier #3
    Inlines!
    Propagated some
    assignments
    Shifts cost less than shift
    + assign if you only use
    them once
    But how do we know?
    func multiply3(a, b uint64) (uint64, uint64) {
    al := a & 0xFFFFFFFF
    ah := a >> 32
    bl := b & 0xFFFFFFFF
    bh := b >> 32
    t1 := ah*bl + ((al * bl) >> 32)
    t2 := al*bh + (t1 & 0xFFFFFFFF)
    hi := ah*bh + t1>>32 + t2>>32
    lo := (a * b)
    return lo, hi
    }

    View Slide

  17. Compiler Hack
    Prints the inliner cost of
    each function during
    builds
    The magic number is 80
    $ go test -bench . -gcflags -m
    # github.com/gtank/gophercon2017-examples
    [inl] func multiply1 costs 99
    [inl] func multiply2 costs 97
    [inl] func multiply3 costs 77
    ./multiply.go:50:6: can inline multiply3

    View Slide

  18. Benchmarks
    Computers *are* fast
    $ go test -bench Multiply3
    goos: linux
    goarch: amd64
    pkg: github.com/gtank/gophercon2017-examples
    BenchmarkMultiply3-2 2000000000 1.18 ns/op
    PASS
    That’s ⅕ of the time the assembly function
    took.

    View Slide

  19. More is NOT better
    This only costs 55 points. But don’t write code like this - it’s a threshold effect.
    func multiply6(a, b uint64) (uint64, uint64) {
    t1 := (a>>32)*(b&0xFFFFFFFF) + ((a & 0xFFFFFFFF) * (b & 0xFFFFFFFF) >> 32)
    t2 := (a&0xFFFFFFFF)*(b>>32) + (t1 & 0xFFFFFFFF)
    return (a * b), (a>>32)*(b>>32) + t1>>32 + t2>>32
    }
    $ go test -bench Multiply6
    goos: linux
    goarch: amd64
    pkg: github.com/gtank/gophercon2017-examples
    BenchmarkMultiply6-2 2000000000 1.18 ns/op
    PASS

    View Slide

  20. Questions?
    No time for questions! Annoy me on Twitter:
    George Tankersley
    @gtank__
    (two underscores)
    https://github.com/gtank/gophercon2017-examples

    View Slide