Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding CPU Microarchitecture for Performance (JChampionsConf)

alblue
January 20, 2022

Understanding CPU Microarchitecture for Performance (JChampionsConf)

Microprocessors have evolved over decades to eke out performance from existing code. But the microarchitecture of the CPU leaks into the assumptions of a flat memory model, with the result that equivalent code can run significantly faster by working with, rather than fighting against, the microarchitecture of the CPU.

This talk, given for the JChampionsConf in 2022, presents the microarchitecture of modern CPUs, showing how misaligned data can cause cache line false sharing, how branch prediction works and when it fails, how to read CPU specific performance monitoring counters and use that in conjunction with tools like perf and toplev to discover where bottlenecks in CPU heavy code live. We’ll use these facts to revisit performance advice on general code patterns and the things to look out for in executing systems. The talk will be language agnostic, although it will be based on the Linux/x86_64 architecture.

The presentation was recorded at the JChampionsConf meeting in January 2022, and a recording is available here: https://youtu.be/Pa_l3aHCoGc

alblue

January 20, 2022
Tweet

More Decks by alblue

Other Decks in Technology

Transcript

  1. @alblue
    ©2022 Alex Blewitt
    22
    Understanding CPU
    Microarchitecture
    🏎
    For Maximum Performance

    View full-size slide

  2. @alblue
    22
    ©2022 Alex Blewitt
    Overview
    • What happens inside a CPU?

    • Where do CPU intensive programs get delayed?

    • What tools are there to help measure performance bottlenecks?

    • How can we make programs run faster?

    View full-size slide

  3. @alblue
    22
    ©2022 Alex Blewitt
    distributed architecture
    system architecture
    algorithm
    hardware
    memory
    cpu
    inst
    Performance Pyramid
    This talk
    Other

    talks
    }

    View full-size slide

  4. @alblue
    22
    ©2022 Alex Blewitt
    DMI x4
    **
    Platform Topologies
    8S Configuration
    SKL
    SKL
    LBG
    LBG
    LBG
    DMI
    LBG
    SKL
    SKL
    SKL
    SKL
    SKL
    SKL
    3x16
    PCIe*
    4S Configurations
    SKL
    SKL
    SKL
    SKL
    2S Configurations
    SKL
    SKL
    (4S-2UPI & 4S-3UPI shown)
    (2S-2UPI & 2S-3UPI shown)
    Intel®
    UPI
    LBG 3x16
    PCIe* 1x100G
    Intel® OP Fabric
    3x16
    PCIe* 1x100G
    Intel® OP Fabric
    LBG
    LBG
    LBG
    DMI
    3x16
    PCIe*
    This slide under embargo until 9:15 AM PDT July 11, 2017
    Intel® Xeon® Scalable Processor supports
    configurations ranging from 2S-2UPI to 8S
    Non Uniform Memory Architecture (NUMA)
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

    View full-size slide

  5. @alblue
    22
    ©2022 Alex Blewitt
    DMI x4
    **
    Platform Topologies
    8S Configuration
    SKL
    SKL
    LBG
    LBG
    LBG
    DMI
    LBG
    SKL
    SKL
    SKL
    SKL
    SKL
    SKL
    3x16
    PCIe*
    4S Configurations
    SKL
    SKL
    SKL
    SKL
    2S Configurations
    SKL
    SKL
    (4S-2UPI & 4S-3UPI shown)
    (2S-2UPI & 2S-3UPI shown)
    Intel®
    UPI
    LBG 3x16
    PCIe* 1x100G
    Intel® OP Fabric
    3x16
    PCIe* 1x100G
    Intel® OP Fabric
    LBG
    LBG
    LBG
    DMI
    3x16
    PCIe*
    This slide under embargo until 9:15 AM PDT July 11, 2017
    Intel® Xeon® Scalable Processor supports
    configurations ranging from 2S-2UPI to 8S
    Non Uniform Memory Architecture (NUMA)
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

    View full-size slide

  6. @alblue
    22
    ©2022 Alex Blewitt
    DMI x4
    **
    Platform Topologies
    8S Configuration
    SKL
    SKL
    LBG
    LBG
    LBG
    DMI
    LBG
    SKL
    SKL
    SKL
    SKL
    SKL
    SKL
    3x16
    PCIe*
    4S Configurations
    SKL
    SKL
    SKL
    SKL
    2S Configurations
    SKL
    SKL
    (4S-2UPI & 4S-3UPI shown)
    (2S-2UPI & 2S-3UPI shown)
    Intel®
    UPI
    LBG 3x16
    PCIe* 1x100G
    Intel® OP Fabric
    3x16
    PCIe* 1x100G
    Intel® OP Fabric
    LBG
    LBG
    LBG
    DMI
    3x16
    PCIe*
    This slide under embargo until 9:15 AM PDT July 11, 2017
    Intel® Xeon® Scalable Processor supports
    configurations ranging from 2S-2UPI to 8S
    Non Uniform Memory Architecture (NUMA)
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

    View full-size slide

  7. @alblue
    22
    ©2022 Alex Blewitt
    DMI x4
    **
    Platform Topologies
    8S Configuration
    SKL
    SKL
    LBG
    LBG
    LBG
    DMI
    LBG
    SKL
    SKL
    SKL
    SKL
    SKL
    SKL
    3x16
    PCIe*
    4S Configurations
    SKL
    SKL
    SKL
    SKL
    2S Configurations
    SKL
    SKL
    (4S-2UPI & 4S-3UPI shown)
    (2S-2UPI & 2S-3UPI shown)
    Intel®
    UPI
    LBG 3x16
    PCIe* 1x100G
    Intel® OP Fabric
    3x16
    PCIe* 1x100G
    Intel® OP Fabric
    LBG
    LBG
    LBG
    DMI
    3x16
    PCIe*
    This slide under embargo until 9:15 AM PDT July 11, 2017
    Intel® Xeon® Scalable Processor supports
    configurations ranging from 2S-2UPI to 8S
    Non Uniform Memory Architecture (NUMA)
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

    View full-size slide

  8. @alblue
    22
    ©2022 Alex Blewitt
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
    12

    View full-size slide

  9. @alblue
    22
    ©2022 Alex Blewitt
    18
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

    View full-size slide

  10. @alblue
    22
    ©2022 Alex Blewitt
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

    View full-size slide

  11. @alblue
    22
    ©2022 Alex Blewitt
    10
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
    Cascade/Skylake 10-core die

    View full-size slide

  12. @alblue
    22
    ©2022 Alex Blewitt
    18
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
    Cascade/Skylake 18-core die

    View full-size slide

  13. @alblue
    22
    ©2022 Alex Blewitt
    Sub NUMA cluster 1
    Sub NUMA cluster 0
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
    Cascade/Skylake 28-core die

    View full-size slide

  14. @alblue
    22
    ©2022 Alex Blewitt
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
    Cascade 56 core ‘die’

    View full-size slide

  15. @alblue
    22
    ©2022 Alex Blewitt
    Cascade 56 core package
    Package
    Die
    Die
    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

    View full-size slide

  16. @alblue
    22
    ©2022 Alex Blewitt
    Ice Lake SP 28 core
    https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_Intel_Irma_ICX-CPU-
    fi
    nal3.pdf

    View full-size slide

  17. @alblue
    22
    ©2022 Alex Blewitt
    Ice Lake SP 40 core
    https://www.nextplatform.com/2021/04/19/deep-dive-into-intels-ice-lake-xeon-sp-architecture/
    Sunny
    Cove
    core

    View full-size slide

  18. @alblue
    22
    ©2022 Alex Blewitt
    Alder Lake/Saphire Rapids
    • Alder Lake is the successor to Ice Lake, built on Willow/Golden Cove

    • Server will be available in Saphire Rapids

    • Desktop/laptop cores exist under Alder Lake name

    • Increases in cache sizes, reduced L1 access latency, higher µop

    • Moving towards non-heterogenous System on Chip designs for laptops

    • Mix of E
    ffi
    ciency and Power cores in same system

    • E
    ffi
    ciency use less power but are slower

    View full-size slide

  19. @alblue
    22
    ©2022 Alex Blewitt
    L3$ (LLC)
    fi
    le


    280 Integer


    224 Floating Point
    L1 Data (L1D$)


    48 KiB 12-way
    L1 Instruction (L1I$)


    32 KiB 8-way
    L2$


    512 KiB 8-way


    Inclusive
    L3$ (LLC)


    1.5 MiB 12-way


    Non-inclusive
    Information for Ice Lake (Sunny Cove)
    RAM
    RAM
    RAM
    RAM
    1🕐
    5🕐 5🕐
    40🕐
    12🕐
    50🕐
    150🕐
    300🕐
    Clock


    Cycles

    View full-size slide

  20. @alblue
    22
    ©2022 Alex Blewitt
    Magnetic Tape
    Magnetic disk
    Beer cache hierarchy
    https://www.slideshare.net/michael_heinrichs/quantum-physics-of-java
    https://netopyr.com/2014/11/28/reactions-to-the-beer-cache-hierarchy/
    Network
    SSD
    Main memory
    L3/LLC
    L2
    L1
    Register
    🤤
    🍺
    🍻
    🗄
    🏪
    🚙

    🍺💰
    Beer in mouth
    Beer in hand
    Beer in ice bucket
    Beer in fridge
    Beer in local store
    Beer in remote store
    Beer in another country
    🧑🍳Beer being brewed
    🌱Beer being planted

    View full-size slide

  21. @alblue
    22
    ©2022 Alex Blewitt
    💻 lstopo --no-io
    Machine (16GB)
    Package P#0
    L4 (128MB)
    L3 (6144KB)
    L2 (256KB)
    L1d (32KB)
    L1i (32KB)
    Core P#0
    PU P#0
    PU P#4
    L2 (256KB)
    L1d (32KB)
    L1i (32KB)
    Core P#1
    PU P#1
    PU P#5
    L2 (256KB)
    L1d (32KB)
    L1i (32KB)
    Core P#2
    PU P#2
    PU P#6
    L2 (256KB)
    L1d (32KB)
    L1i (32KB)
    Core P#3
    PU P#3
    PU P#7
    Shared memory
    between CPU and
    GPU
    HyperThreads
    Single socket
    system
    Cache levels
    Four core processor

    View full-size slide

  22. @alblue
    22
    ©2022 Alex Blewitt https://goodies.dotnetos.org/
    fi
    les/dotnetos-poster-ram.pdf

    View full-size slide

  23. @alblue
    22
    ©2022 Alex Blewitt
    L3$ (LLC)
    fi
    le


    280 Integer


    224 Floating Point
    L1 Data (L1D$)


    48 KiB 12-way
    L1 Instruction (L1I$)


    32 KiB 8-way
    L2$


    512KiB 8-way


    Inclusive
    L3$ (LLC)


    1.5 MiB 12-way


    Non-inclusive
    Data TLB


    4K: 128 8-way


    2M/4M: 64 8-way


    1G: 4 4-way
    Instruction TLB


    4K: 128 8-way


    2M/4M: 16/T assoc


    1G: 4 4-way
    STLB


    4K: 2048 12-way


    2M/4M: 1024 12-way


    1G: 1024 4-way
    RAM
    RAM
    RAM
    RAM
    Virtual Physical PCID
    00008000(1234) 5e38450c(1234) 10
    00008000(1234) 48656c6f(1234) 20
    ff ff ffff ff
    fb(8080) 2345
    ffff
    ff
    b(8080) 0
    1🕐
    5🕐 5🕐
    40🕐
    12🕐
    50🕐
    150🕐
    300🕐
    Clock


    Cycles
    Information for Ice Lake (Sunny Cove)
    grep /proc/cpuinfo for pcid ↑

    View full-size slide

  24. @alblue
    22
    ©2022 Alex Blewitt
    Memory Pages
    8000
    ffaa
    ffbb
    f000
    0000
    CR3
    0000
    ffff
    7fff CR3
    0000
    ffff
    7fff
    8000
    f000
    Two layer page table structure shown


    x86_64 has 4 level paging (48 bits, 256TiB virtual, 64TiB real)


    Ice Lake processors support 5 level paging (57 bits, 128Pb virtual, 4PiB real)
    0x000080001234 0x000080001234
    Pages can be
    4k, 2M or 1G

    View full-size slide

  25. @alblue
    22
    ©2022 Alex Blewitt
    Huge Pages
    0000
    Pages can be
    4K, 2M or 1G
    grep /proc/cpuinfo


    pse: 2M support

    pdpe1g: 1G support
    👍 Better use of TLB
    👎 More complex to set up
    👍 Fewer memory cache misses
    👎 May waste memory
    👎 Hugetblfs needs to be con
    fi
    gured

    View full-size slide

  26. @alblue
    22
    ©2022 Alex Blewitt
    👎 Hugetblfs 👎
    • Requires kernel con
    fi
    guration to reserve memory ahead of time

    • Boot parameter hugepages=N puts aside memory for huge page use

    • Boot parameter hugepagesz={2M,1G} speci
    fi
    es huge page size

    • Requires a hugetblfs mount to be provided

    • Requires root (or suitably permissioned app) to use hugepages

    View full-size slide

  27. @alblue
    22
    ©2022 Alex Blewitt
    👍 Transparent Huge Pages 👎
    • Does not require boot time con
    fi
    guration or special permissions

    • khugepaged assembles contiguous physical memory for large pages

    • Default page size is still 4k, but processes can madvise() use of large pages

    • Allows speci
    fi
    c apps to opt-in on demand

    • Bene
    fi
    ts of smaller TLB with less wasted memory

    # echo madvise > /sys/kernel/mm/transparent_hugepages/enabled


    # echo defer > /sys/kernel/mm/transparent_hugepage/defrag
    Defer instead of blocking large page request
    Enable opt-in
    through use
    of madvise
    💡

    View full-size slide

  28. @alblue
    22
    ©2022 Alex Blewitt
    Cache lines, loads and stores
    • Unit of granularity of a cache entry is 64/128 bytes (512/1024 bits)

    • Even if you only read/write 1 byte you're writing 64/128 bytes

    • Cache lines can generally be in di
    ff
    erent states:

    ➡ M – exclusively owned by that core, and modi
    fi
    ed (dirty)

    ➡ E – exclusively owned by that core, but not modi
    fi
    ed

    ➡ S – shared read-only with other cores

    ➡ I – invalid, cache line not used
    Various extensions to MESI exist … see https://en.wikipedia.org/wiki/MESI_protocol for more

    View full-size slide

  29. @alblue
    22
    ©2022 Alex Blewitt
    Memory prefetching (CPU)
    CPU issues
    automatic prefetch
    for streamed data
    Also notices
    striding by certain
    amounts as well
    Can also use
    __builtin_prefetch

    to explicitly suggest
    prefetching memory
    elsewhere but needs to be
    a measured improvement
    💡

    View full-size slide

  30. @alblue
    22
    ©2022 Alex Blewitt
    False sharing
    • Two cores trying to write to bytes in the same cache-line will thrash

    • First thread will try to acquire exclusive ownership of cache line

    • Second thread (on di
    ff
    erent core) will try to do the same

    • Performance will su
    ff
    er when cache line repeatedly moved

    • Avoid by padding to at least cacheline size * 2 (128/256 bytes) for writes
    Thread 1
    data[0] = 'A'
    Thread 2
    data[7] = 'C'

    View full-size slide

  31. @alblue
    22
    ©2022 Alex Blewitt
    Memory performance strategies
    • Ensure data
    fi
    ts in L1/L2/L3 cache where possible

    • Stream or stride through data in a single pass if possible

    • Consider pivoting data (array-of-structs or structs-of-arrays)

    • Add padding for multi-threaded contended writes

    • Prefer thread-local or cpu-local accumulators with
    fi
    nal merge step

    • Compress data where practical (compressed pointers)
    💡

    View full-size slide

  32. @alblue
    22
    ©2022 Alex Blewitt
    Pinning memory/threads
    • Pinning memory or threads to a particular core can improve performance

    • Reduces intra-core memory ownership tra
    ff
    i
    c

    • Less likely to have cache invalidations

    • isolcpu allows reservation of CPUs for non-kernel use with cpusets

    • taskset allows binding of a process to speci
    fi
    c cores

    • numactl allows cores/memory to be clamped for a process

    • libnuma has additional a
    ff
    i
    nity settings for programmatic use

    View full-size slide

  33. @alblue
    22
    ©2022 Alex Blewitt
    Frontend
    Core
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    Backend
    x86_64
    µop 5/cycle

    View full-size slide

  34. @alblue
    22
    ©2022 Alex Blewitt
    Core
    x86_64
    Pre-decode Instructions µop decoders
    µop cache


    2304 8-way
    loop decode
    branch
    prediction
    Backend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    µop 5/cycle
    256/5120 entry

    View full-size slide

  35. @alblue
    22
    ©2022 Alex Blewitt
    Core
    x86_64
    Pre-decode Instructions µop decoders
    µop cache


    2304 8-way
    loop decode
    branch
    prediction
    Backend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    55 48 89 e5 fe 04 25 d2
    04 00 00 41 6c 42 6c 75
    µop 5/cycle
    256/5120 entry

    View full-size slide

  36. @alblue
    22
    ©2022 Alex Blewitt
    Core
    x86_64
    Pre-decode Instructions µop decoders
    µop cache


    2304 8-way
    loop decode
    branch
    prediction
    Backend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    55|48 89 e5|fe 04 25 d2
    04 00 00|41 6c 42 6c 75
    µop 5/cycle
    256/5120 entry

    View full-size slide

  37. @alblue
    22
    ©2022 Alex Blewitt
    Core
    x86_64
    Pre-decode Instructions µop decoders
    µop cache


    2304 8-way
    loop decode
    branch
    prediction
    Backend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    push %rbp
    mov %rsp, %rbp
    ?
    incb 0x4d2
    µop 5/cycle
    256/5120 entry

    View full-size slide

  38. @alblue
    22
    ©2022 Alex Blewitt
    Core
    x86_64
    Pre-decode Instructions µop decoders
    µop cache


    2304 8-way
    loop decode
    branch
    prediction
    Backend
    µop
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way incb 0x4d2
    incb 0x4d2
    incb 0x4d2
    256/5120 entry

    View full-size slide

  39. @alblue
    22
    ©2022 Alex Blewitt
    Core
    x86_64
    Pre-decode Instructions µop decoders
    µop cache


    2304 8-way
    loop decode
    branch
    prediction
    Backend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    TMP = [0x4d2]
    INC TMP
    [0x4d2] = TMP
    µop 5/cycle
    256/5120 entry

    View full-size slide

  40. @alblue
    22
    ©2022 Alex Blewitt
    Core
    x86_64
    Pre-decode Instructions µop decoders
    µop cache


    2304 8-way
    loop decode
    branch
    prediction
    Backend
    µop
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    TMP = [0x4d2] INC TMP [0x4d2] = TMP
    256/5120 entry

    View full-size slide

  41. @alblue
    22
    ©2022 Alex Blewitt
    Branch Prediction ⤵🧐
    • Correct 95% of the time

    • Queues up instructions assuming the branch has been taken

    • Learns patterns in code based on existing behaviour

    • Iterating through predictable (sorted) data may be more e
    ffi
    cient

    • Throws away inaccurate work if incorrect

    • May cause observable side channel behaviour e.g. cache invalidation 👻

    • Unrolled loops and inlining avoid branches, so improves performance
    cmp eax,42; jne

    View full-size slide

  42. @alblue
    22
    ©2022 Alex Blewitt
    Branch Target Predictor 🎯
    • Predicts where the target is going if taken

    • Hard coded addresses/o
    ff
    sets always predictable

    • Jump to location of register may be more di
    ffi
    cult

    • Often seen when jumping through object oriented code

    • Inlining is the master optimisation because it avoids unknowable branches

    • Dan Luu has a good write up https://danluu.com/branch-prediction/
    jmp [eax]

    View full-size slide

  43. @alblue
    22
    ©2022 Alex Blewitt
    Pipeline stalls
    • CPUs are deeply pipelined (15-20 stages)

    • When PC (program counter/instruction pointer) changes causes a
    fl
    ush

    • Gaps occur after jump due to pipeline re
    fi
    lling

    • Good branch predictor guesses correctly to avoid this

    • HyperThreading can take advantage of 'empty' slots

    View full-size slide

  44. @alblue
    22
    ©2022 Alex Blewitt
    Pipeline stalls
    F D X W
    F D X W
    F D X W
    F D X W
    F D X W
    F D X
    Jump causes stall while bu
    ff
    er
    fi
    ls
    when branch not predicted
    Time
    👻
    Instructions
    Jump instruction

    View full-size slide

  45. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way

    View full-size slide

  46. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    TMP = [0x4d2] INC TMP [0x4d2] = TMP

    View full-size slide

  47. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    R99 = [0x4d2] INC R99 [0x4d2] = R99

    View full-size slide

  48. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    R99 = [0x4d2] INC R99
    [0x4d2] = R99

    View full-size slide

  49. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    R99 = 2A INC R99
    [0x4d2] = R99

    View full-size slide

  50. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    R99 = 2A
    INC R99
    [0x4d2] = R99

    View full-size slide

  51. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    INC R99
    [0x4d2] = R99
    R99 = 2B

    View full-size slide

  52. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    INC R99
    R99 = 2B
    [0x4d2] = 2B

    View full-size slide

  53. @alblue
    22
    ©2022 Alex Blewitt
    Core
    Allocate


    Rename


    Retire
    load buffer


    128 entry
    store buffer


    72 entry
    register
    fi
    les


    280 + 224
    2
    3
    4
    7
    8
    9
    0
    1
    5
    6
    Scheduler
    Integer Unit Floating Unit
    ALU LEA Shift Branch ALU FMA Shift Divide
    ALU LEA Multiply Divide ALU FMA Shift Shu
    ffl
    e
    ALU LEA Multiply ALU FMA Shu
    ffl
    e
    ALU LEA Shift Branch
    Execution units added in Ice Lake
    Port 0 and 1 can be fused for a 512 bit operation


    Port 5 is a 512 bit wide operation


    All others handle 256 bits
    Port 8 and 9


    added in Ice Lake
    Address


    generation
    reorder


    352 entry
    Frontend
    L1 Data


    48 KiB 12-way
    L1 Instruction


    32 KiB 8-way
    INC R99
    R99 = 2B
    [0x4d2] = 2B

    View full-size slide

  54. @alblue
    22
    ©2022 Alex Blewitt
    Micro-architecture diagrams
    https://drive.google.com/drive/folders/1W4CIRKtNML74BKjSbXerRsIzAUk3ppSG
    https://twitter.com/Cardyak

    View full-size slide

  55. @alblue
    22
    ©2022 Alex Blewitt
    perf
    • Linux perf (compiled from linux/tools/perf, or from linux-tools/linux-perf)

    • Running in Docker may require compilation from source if kernel mismatch

    • Commands available

    • record – record execution performance for process/pid

    • report – generate a report from prior recording

    • annotate – annotate a report from a prior recording

    • stat – record performance counters for process/pid
    https://perf.wiki.kernel.org
    https://github.com/alblue/scripts/blob/master/perf-Docker
    fi
    le

    View full-size slide

  56. @alblue
    22
    ©2022 Alex Blewitt
    JMH and perf
    • The Java Microbenchmarking Harness supports perf natively on Linux

    • Also able to run on Windows as well using a di
    ff
    erent mechanism

    • perf - runs and collects perf statistics

    • perfnorm - and normalises them

    • perfasm - shows assembly output of perf
    https://github.com/openjdk/jmh

    View full-size slide

  57. @alblue
    22
    ©2022 Alex Blewitt
    Async pro
    fi
    ler and perf
    • The Async Pro
    fi
    ler for Java supports perf output for native JIT

    • When running in containers need to make sure the perfmap is exported

    • perf-map-agent - Java agent to generate JIT maps for use with perf

    • Can be used to generate mixed-mode
    fl
    ame graphs (JIT, Java etc.)
    https://github.com/jvm-pro
    fi
    ling-tools

    View full-size slide

  58. @alblue
    22
    ©2022 Alex Blewitt
    perf record
    • Perf record will sample the process(es) and generate stack traces

    • Events may be skewed from their location

    • Improve accuracy with :p, :pp or :ppp su
    ff
    i
    x to event

    • Can capture branches, last branch records or use processor tracing

    • perf record -b program


    • perf record --call-graph lbr -j any_call,any_ret program


    • perf record -e intel_pt//u program
    https://lwn.net/Articles/680985/
    https://lwn.net/Articles/680996/

    View full-size slide

  59. @alblue
    22
    ©2022 Alex Blewitt
    perf stat
    $ perf stat base64 <(echo hello)


    d29ybGQK


    Performance counter stats for 'base64 /dev/fd/63':


    0.341382 task-clock (msec) # 0.649 CPUs utilized


    0 context-switches # 0.000 K/sec


    0 cpu-migrations # 0.000 K/sec


    65 page-faults # 0.190 M/sec


    1,218,176 cycles # 3.568 GHz


    811,468 stalled-cycles-frontend # 66.61% frontend cycles idle


    855,999 instructions # 0.70 insn per cycle


    # 0.95 stalled cycles per insn


    169,032 branches # 495.140 M/sec


    8,883 branch-misses # 5.26% of all branches


    0.000526160 seconds time elapsed


    https://perf.wiki.kernel.org
    IPC > 4 👍
    < 1 👎

    View full-size slide

  60. @alblue
    22
    ©2022 Alex Blewitt
    Performance counters
    • Intel cores have a few dedicated and programmable counters

    • Instruction cycles, branches, branch misses …

    • Counters can be multiplexed (read X for 1µs, read Y for 1µs)

    • Programmable counters can be set to speci
    fi
    c measurements

    • iTLB-load-misses, LLC-load-misses, uops_dispatched_port.port_5 ...

    • Undocumented performance counters can be speci
    fi
    ed with events

    • cpu/event=0x3c,umask=0x0,any=1/

    View full-size slide

  61. @alblue
    22
    ©2022 Alex Blewitt
    19
    Locating Issues
    Have Precise events for sampling
    Precise events added in Skylake
    Top-down Microarchitecture Analysis
    https://www.researchgate.net/publication/269302126_A_Top-Down_method_for_performance_analysis_and_counters_architecture
    Ahmed Yasin

    View full-size slide

  62. @alblue
    22
    ©2022 Alex Blewitt
    Top-down Analysis Method
    USING PERFORMANCE MONITORING EVENTS
    Additionally, the metric uses the UOPS_ISSUED.ANY, which is common in recent Intel microarchitec-
    tures, as the denominator. The UOPS_ISSUED.ANY event counts the total number of Uops that the RAT
    issues to RS.
    The VectorMixRate metric gives the percentage of injected blend uops out of all uops issued. Usually a
    VectorMixRate over 5% is worth investigating.
    VectorMixRate[%] = 100 * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY
    Note the actual penalty may vary as it stems from the additional data-dependency on the destination
    register the injected blend operations add.
    B.2 PERFORMANCE MONITORING AND MICROARCHITECTURE
    This section provides information of performance monitoring hardware and terminology related to the
    Silvermont, Airmont and Goldmont microarchitectures. The features described here may be specific to
    individual microarchitecture, as indicated in Table B-1.
    Figure B-3. TMAM Hierarchy Supported by Skylake Microarchitecture
    WŝƉĞůŝŶĞ^ůŽƚƐ
    ZĞƚŝƌŝŶŐ ĂĚ^ƉĞĐƵůĂƚŝŽŶ &ƌŽŶƚŶĚŽƵŶĚ ĂĐŬŶĚŽƵŶĚ
    EŽƚ^ƚĂůůĞĚ ^ƚĂůůĞĚ
    ĂƐĞ
    ƌĂŶĐŚ
    D
    ŝƐƉƌĞĚŝĐƚ
    &ĞƚĐŚ
    >ĂƚĞŶĐLJ
    D
    Ğŵ
    ŽƌLJŽƵŶĚ
    ŽƌĞŽƵŶĚ
    &ĞƚĐŚ
    ĂŶĚǁ
    ŝĚƚŚ
    D
    ĂĐŚŝŶĞ
    ůĞĂƌ
    D

    ZKD
    džƚ͘
    D
    Ğŵ
    ŽƌLJ
    ŽƵŶĚ
    >ϯŽƵŶĚ
    >ϮŽƵŶĚ
    >ϭŽƵŶĚ
    ^ƚŽƌĞƐŽƵŶĚ
    ŝǀŝĚĞƌ
    džĞĐƵƚŝŽŶ
    ƉŽƌƚƐ
    hƚŝůŝnjĂƚŝŽŶ
    >^
    D/d
    ƌĂŶĐŚ
    ZĞƐƚĞĞƌƐ
    /ĐĂĐŚĞDŝƐƐ
    /d>DŝƐƐ
    KƚŚĞƌ
    &WͲƌŝƚŚ
    ^
    ^^ǁŝƚĐŚĞƐ
    D^^ǁŝƚĐŚĞƐ
    ^ĐĂůĂƌ
    sĞĐƚŽƌ
    ϯнƉŽƌƚƐ
    ϭŽƌϮƉŽƌƚƐ
    ϬƉŽƌƚƐ
    DĞŵĂŶĚǁŝĚƚŚ
    DĞŵ>ĂƚĞŶĐLJ
    yϴϳ
    ^ƚŽƌĞDŝƐƐ
    ^d>,ŝƚ
    ^d>DŝƐƐ
    >Ϯ,ŝƚ
    >ϮDŝƐƐ
    &ĂůƐĞƐŚĂƌŝŶŐ
    d>^ƚŽƌĞ
    ^ƚŽƌĞĨǁĚďůŬ
    ϰ<ĂůŝĂƐŝŶŐ
    ŽŶƚĞƐƚĞĚĂĐĐĞƐƐ
    ĂƚĂƐŚĂƌŝŶŐ
    >ϯůĂƚĞŶĐLJ
    USING PERFORMANCE MONITORING EVENTS
    The single entry point of division at a pipeline’s issue-stage (allocation-stage) makes the four categories
    additive to the total possible slots. The classification at slots granularity (sub-cycle) makes the break-
    down very accurate and robust for superscalar cores, which is a necessity at the top-level.
    Figure B-2. TMAM’s Top Level Drill Down Flowchart
    hŽƉ
    ůůŽĐĂƚĞ͍
    hŽƉǀĞƌ
    ZĞƚŝƌĞƐ͍
    ĂĐŬŶĚ
    ^ƚĂůůƐ͍
    &ƌŽŶƚŶĚ
    ŽƵŶĚ
    ĂĐŬŶĚ
    ŽƵŶĚ
    ZĞƚŝƌŝŶŐ
    ĂĚ
    ^ƉĞĐƵůĂƚŝŽŶ
    zĞƐ
    zĞƐ

    zĞƐ


    https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual
    Ahmed Yasin

    View full-size slide

  63. @alblue
    22
    ©2022 Alex Blewitt
    perf stat --topdown
    $ perf stat -a --topdown sleep 1


    nmi_watchdog enabled with topdown. May give wrong results.


    Disable with echo 0 > /proc/sys/kernel/nmi_watchdog


    Performance counter stats for 'system wide':


    retiring bad speculation frontend bound backend
    bound


    S0-C0 2 15.3% 2.8% 32.1% 49.9%
    S0-C1 2 23.3% 4.0% 27.3% 45.4%
    S0-C2 2 15.2% 2.9% 29.8% 52.1%
    S0-C3 2 16.7% 0.0% 31.8% 51.5%
    S0-C4 2 35.7% 10.7% 26.2% 27.4%
    S0-C5 2 14.9% 2.5% 34.1% 48.5%
    1.000889285 seconds time elapsed


    View full-size slide

  64. @alblue
    22
    ©2022 Alex Blewitt
    Toplev PMU tools
    • Andi Kleen has written toplev.py which allows top-down analysis

    • Initial download caches processor information from download.01.org

    • Uses perf to record stats, but with custom event
    fi
    lters

    • If workload is repeatable, can use --no-multiplex to repeat results

    • Run with -l1, see if issues are present, run with -l2 ...
    https://github.com/andikleen/pmu-tools/wiki/toplev-manual

    View full-size slide

  65. @alblue
    22
    ©2022 Alex Blewitt
    toplev.py --single-thread
    $ dd if=/dev/urandom of=/tmp/rand bs=4096 count=4096


    $ ./toplev.py --single-thread --no-multiplex -l1 -- base64 /tmp/rand > /dev/null


    # 3.6-full on Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz


    BE Backend_Bound % Slots 24.07 <==



    $ ./toplev.py --single-thread --no-multiplex -l2 -- base64 /tmp/rand > /dev/null


    BE Backend_Bound % Slots 23.82


    BE/Core Backend_Bound.Core_Bound % Slots 16.08 <==



    $ ./toplev.py --single-thread --no-multiplex -l3 -- base64 /tmp/rand > /dev/null


    BE Backend_Bound % Slots 23.96


    BE/Core Backend_Bound.Core_Bound % Slots 16.35


    BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 24.51 <==


    View full-size slide

  66. @alblue
    22
    ©2022 Alex Blewitt
    Cache line
    Instruction layout
    Before
    After
    Error
    Is Error?
    Before
    After
    Error
    Is Error?
    __builtin_expect(error,0)
    __builtin_expect(error,1)

    View full-size slide

  67. @alblue
    22
    ©2022 Alex Blewitt
    Cache line
    Cache line
    Loop stream detector
    Good


    Loop
    Bad


    Loop
    32 bit
    aliognment
    Align with 

    -mllvm -align-all-nofallthru-blocks=5

    -mllvm -align-all-functions=5

    View full-size slide

  68. @alblue
    22
    ©2022 Alex Blewitt
    Facebook BOLT
    https://arxiv.org/abs/1807.06735
    Figure 9: Heat maps for instruction memory accesses of the HHVM binary, without and with BOLT. Heat is a log scale.
    Executed
    instructions are
    distributed across
    icache space
    After sorting basic
    blocks guided by
    pro
    fi
    ling data, the
    icache space is
    defragmented
    https://github.com/facebookincubator/BOLT
    https://github.com/llvm/llvm-project/commit/4c106cfdf7cf7eec861ad3983a3dd9a9e8f3a8ae
    Now merged
    into Clang!

    View full-size slide

  69. @alblue
    22
    ©2022 Alex Blewitt
    Google llvm-propeller
    https://github.com/google/llvm-propeller
    https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
    Exe perf.data perf.propeller Optimised exe
    C C
    perf record
    clang -fpropeller-label
    create_llvm_prof clang -fpropeller-optimize
    func1() {…}
    func2() {…}
    func3() {…}

    func1() {…}
    func2() {…}
    func3() {…}
    clang -
    ff
    unction-sections
    clang -fbasicblock-sections=perf.propeller lld + thinLTO + PGO
    Obsolete?!

    View full-size slide

  70. @alblue
    22
    ©2022 Alex Blewitt
    SIMD JSON parser
    https://github.com/lemire/simdjson
    https://arxiv.org/abs/1902.08318
    <- over 2.5 Gb/s
    https://simdjson.org

    View full-size slide

  71. @alblue
    22
    ©2022 Alex Blewitt
    Summary: Memory
    • Use cacheline-aligned or cacheline-aware data structures

    • Compress data in memory and decompress on the
    fl
    y

    • Avoid random memory access when possible

    • Con
    fi
    gure huge pages and use madvise & defer

    • Partition memory with libnuma for data locality

    View full-size slide

  72. @alblue
    22
    ©2022 Alex Blewitt
    Summary: CPU
    • Each CPU is its own networked mesh cluster

    • Branch speculation and memory/TLB misses are costly

    • Use branch free and lock free algorithms when possible

    • Analyse perf counters with top down architectural analysis

    • Use (auto)vectorisation and use XMM/YMM/ZMM when sensible

    View full-size slide

  73. @alblue
    22
    ©2022 Alex Blewitt
    References
    https://danluu.com/branch-prediction/

    https://arxiv.org/abs/1807.06735 → https://github.com/facebookincubator/BOLT

    https://arxiv.org/abs/1902.08318 → https://github.com/lemire/simdjson/

    https://github.com/andikleen/pmu-tools/wiki/toplev-manual

    https://github.com/google/llvm-propeller/

    https://lwn.net/Articles/680985/ && https://lwn.net/Articles/680996/

    https://perf.wiki.kernel.org

    https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-
    A_1000.pdf

    https://www.nextplatform.com/2021/04/19/deep-dive-into-intels-ice-lake-xeon-sp-architecture/

    https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual

    https://www.researchgate.net/publication/269302126_A_Top-
    Down_method_for_performance_analysis_and_counters_architecture

    https://drive.google.com/drive/folders/1W4CIRKtNML74BKjSbXerRsIzAUk3ppSG → https://twitter.com/Cardyak

    https://goodies.dotnetos.org/
    fi
    les/dotnetos-poster-ram.pdf → https://twitter.com/dotnetosorg

    View full-size slide

  74. @alblue
    22
    ©2022 Alex Blewitt
    Links
    https://easyperf.net/notes/

    https://epickrram.blogspot.com/

    https://groups.google.com/forum/#!forum/mechanical-sympathy/

    https://lemire.me/en/

    https://psy-lob-saw.blogspot.com/

    https://richardstartin.github.io/

    https://travisdowns.github.io/

    https://www.agner.org/optimize/

    https://www.real-logic.co.uk/

    View full-size slide

  75. @alblue
    22
    ©2022 Alex Blewitt
    Thanks
    @andreipangin

    @cardyak

    @chriswhocodes

    @danluu

    @dendibakh

    @dotnetosorg

    @epickrram

    @holly_cummins

    @jChampionsConf

    @Java_Champions

    @javajuneau

    @javaperftuning

    @lemire

    @mjpt777

    @mon_beck

    @net0pyr

    @nitsanw

    @perfsummit1

    @richardstartin

    @shipilev

    @sharat_chandler

    @trav_downs

    And many more …

    View full-size slide

  76. @alblue
    22
    ©2022 Alex Blewitt
    Thank you
    🗒 https://alblue.bandlem.com

    🦉 https://twitter.com/alblue

    🐙 https://github.com/alblue

    📺 https://vimeo.com/alblue

    📇 https://speakerdeck.com/alblue

    View full-size slide