Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding CPU Microarchitecture for Performance (JChampionsConf)

alblue
January 20, 2022

Understanding CPU Microarchitecture for Performance (JChampionsConf)

Microprocessors have evolved over decades to eke out performance from existing code. But the microarchitecture of the CPU leaks into the assumptions of a flat memory model, with the result that equivalent code can run significantly faster by working with, rather than fighting against, the microarchitecture of the CPU.

This talk, given for the JChampionsConf in 2022, presents the microarchitecture of modern CPUs, showing how misaligned data can cause cache line false sharing, how branch prediction works and when it fails, how to read CPU specific performance monitoring counters and use that in conjunction with tools like perf and toplev to discover where bottlenecks in CPU heavy code live. We’ll use these facts to revisit performance advice on general code patterns and the things to look out for in executing systems. The talk will be language agnostic, although it will be based on the Linux/x86_64 architecture.

The presentation was recorded at the JChampionsConf meeting in January 2022, and a recording is available here: https://youtu.be/Pa_l3aHCoGc

alblue

January 20, 2022
Tweet

More Decks by alblue

Other Decks in Technology

Transcript

  1. @alblue 22 ©2022 Alex Blewitt Overview • What happens inside

    a CPU? • Where do CPU intensive programs get delayed? • What tools are there to help measure performance bottlenecks? • How can we make programs run faster?
  2. @alblue 22 ©2022 Alex Blewitt distributed architecture system architecture algorithm

    hardware memory cpu inst Performance Pyramid This talk Other talks }
  3. @alblue 22 ©2022 Alex Blewitt DMI x4 ** Platform Topologies

    8S Configuration SKL SKL LBG LBG LBG DMI LBG SKL SKL SKL SKL SKL SKL 3x16 PCIe* 4S Configurations SKL SKL SKL SKL 2S Configurations SKL SKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBG LBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel® Xeon® Scalable Processor supports configurations ranging from 2S-2UPI to 8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  4. @alblue 22 ©2022 Alex Blewitt DMI x4 ** Platform Topologies

    8S Configuration SKL SKL LBG LBG LBG DMI LBG SKL SKL SKL SKL SKL SKL 3x16 PCIe* 4S Configurations SKL SKL SKL SKL 2S Configurations SKL SKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBG LBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel® Xeon® Scalable Processor supports configurations ranging from 2S-2UPI to 8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  5. @alblue 22 ©2022 Alex Blewitt DMI x4 ** Platform Topologies

    8S Configuration SKL SKL LBG LBG LBG DMI LBG SKL SKL SKL SKL SKL SKL 3x16 PCIe* 4S Configurations SKL SKL SKL SKL 2S Configurations SKL SKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBG LBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel® Xeon® Scalable Processor supports configurations ranging from 2S-2UPI to 8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  6. @alblue 22 ©2022 Alex Blewitt DMI x4 ** Platform Topologies

    8S Configuration SKL SKL LBG LBG LBG DMI LBG SKL SKL SKL SKL SKL SKL 3x16 PCIe* 4S Configurations SKL SKL SKL SKL 2S Configurations SKL SKL (4S-2UPI & 4S-3UPI shown) (2S-2UPI & 2S-3UPI shown) Intel® UPI LBG 3x16 PCIe* 1x100G Intel® OP Fabric 3x16 PCIe* 1x100G Intel® OP Fabric LBG LBG LBG DMI 3x16 PCIe* This slide under embargo until 9:15 AM PDT July 11, 2017 Intel® Xeon® Scalable Processor supports configurations ranging from 2S-2UPI to 8S Non Uniform Memory Architecture (NUMA) https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  7. @alblue 22 ©2022 Alex Blewitt Sub NUMA cluster 1 Sub

    NUMA cluster 0 https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf Cascade/Skylake 28-core die
  8. @alblue 22 ©2022 Alex Blewitt Cascade 56 core package Package

    Die Die https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
  9. @alblue 22 ©2022 Alex Blewitt Ice Lake SP 28 core

    https://hc32.hotchips.org/assets/program/conference/day1/HotChips2020_Server_Processors_Intel_Irma_ICX-CPU- fi nal3.pdf
  10. @alblue 22 ©2022 Alex Blewitt Ice Lake SP 40 core

    https://www.nextplatform.com/2021/04/19/deep-dive-into-intels-ice-lake-xeon-sp-architecture/ Sunny Cove core
  11. @alblue 22 ©2022 Alex Blewitt Alder Lake/Saphire Rapids • Alder

    Lake is the successor to Ice Lake, built on Willow/Golden Cove • Server will be available in Saphire Rapids • Desktop/laptop cores exist under Alder Lake name • Increases in cache sizes, reduced L1 access latency, higher µop • Moving towards non-heterogenous System on Chip designs for laptops • Mix of E ffi ciency and Power cores in same system • E ffi ciency use less power but are slower
  12. @alblue 22 ©2022 Alex Blewitt L3$ (LLC) fi le 280

    Integer 224 Floating Point L1 Data (L1D$) 48 KiB 12-way L1 Instruction (L1I$) 32 KiB 8-way L2$ 512 KiB 8-way Inclusive L3$ (LLC) 1.5 MiB 12-way Non-inclusive Information for Ice Lake (Sunny Cove) RAM RAM RAM RAM 1🕐 5🕐 5🕐 40🕐 12🕐 50🕐 150🕐 300🕐 Clock Cycles
  13. @alblue 22 ©2022 Alex Blewitt Magnetic Tape Magnetic disk Beer

    cache hierarchy https://www.slideshare.net/michael_heinrichs/quantum-physics-of-java https://netopyr.com/2014/11/28/reactions-to-the-beer-cache-hierarchy/ Network SSD Main memory L3/LLC L2 L1 Register 🤤 🍺 🍻 🗄 🏪 🚙 ✈ 🍺💰 Beer in mouth Beer in hand Beer in ice bucket Beer in fridge Beer in local store Beer in remote store Beer in another country 🧑🍳Beer being brewed 🌱Beer being planted
  14. @alblue 22 ©2022 Alex Blewitt 💻 lstopo --no-io Machine (16GB)

    Package P#0 L4 (128MB) L3 (6144KB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#4 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#5 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#6 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#7 Shared memory between CPU and GPU HyperThreads Single socket system Cache levels Four core processor
  15. @alblue 22 ©2022 Alex Blewitt L3$ (LLC) fi le 280

    Integer 224 Floating Point L1 Data (L1D$) 48 KiB 12-way L1 Instruction (L1I$) 32 KiB 8-way L2$ 512KiB 8-way Inclusive L3$ (LLC) 1.5 MiB 12-way Non-inclusive Data TLB 4K: 128 8-way 2M/4M: 64 8-way 1G: 4 4-way Instruction TLB 4K: 128 8-way 2M/4M: 16/T assoc 1G: 4 4-way STLB 4K: 2048 12-way 2M/4M: 1024 12-way 1G: 1024 4-way RAM RAM RAM RAM Virtual Physical PCID 00008000(1234) 5e38450c(1234) 10 00008000(1234) 48656c6f(1234) 20 ff ff ffff ff fb(8080) 2345 ffff ff b(8080) 0 1🕐 5🕐 5🕐 40🕐 12🕐 50🕐 150🕐 300🕐 Clock Cycles Information for Ice Lake (Sunny Cove) grep /proc/cpuinfo for pcid ↑
  16. @alblue 22 ©2022 Alex Blewitt Memory Pages 8000 ffaa ffbb

    f000 0000 CR3 0000 ffff 7fff CR3 0000 ffff 7fff 8000 f000 Two layer page table structure shown x86_64 has 4 level paging (48 bits, 256TiB virtual, 64TiB real) Ice Lake processors support 5 level paging (57 bits, 128Pb virtual, 4PiB real) 0x000080001234 0x000080001234 Pages can be 4k, 2M or 1G
  17. @alblue 22 ©2022 Alex Blewitt Huge Pages 0000 Pages can

    be 4K, 2M or 1G grep /proc/cpuinfo pse: 2M support pdpe1g: 1G support 👍 Better use of TLB 👎 More complex to set up 👍 Fewer memory cache misses 👎 May waste memory 👎 Hugetblfs needs to be con fi gured
  18. @alblue 22 ©2022 Alex Blewitt 👎 Hugetblfs 👎 • Requires

    kernel con fi guration to reserve memory ahead of time • Boot parameter hugepages=N puts aside memory for huge page use • Boot parameter hugepagesz={2M,1G} speci fi es huge page size • Requires a hugetblfs mount to be provided • Requires root (or suitably permissioned app) to use hugepages
  19. @alblue 22 ©2022 Alex Blewitt 👍 Transparent Huge Pages 👎

    • Does not require boot time con fi guration or special permissions • khugepaged assembles contiguous physical memory for large pages • Default page size is still 4k, but processes can madvise() use of large pages • Allows speci fi c apps to opt-in on demand • Bene fi ts of smaller TLB with less wasted memory # echo madvise > /sys/kernel/mm/transparent_hugepages/enabled # echo defer > /sys/kernel/mm/transparent_hugepage/defrag Defer instead of blocking large page request Enable opt-in through use of madvise 💡
  20. @alblue 22 ©2022 Alex Blewitt Cache lines, loads and stores

    • Unit of granularity of a cache entry is 64/128 bytes (512/1024 bits) • Even if you only read/write 1 byte you're writing 64/128 bytes • Cache lines can generally be in di ff erent states: ➡ M – exclusively owned by that core, and modi fi ed (dirty) ➡ E – exclusively owned by that core, but not modi fi ed ➡ S – shared read-only with other cores ➡ I – invalid, cache line not used Various extensions to MESI exist … see https://en.wikipedia.org/wiki/MESI_protocol for more
  21. @alblue 22 ©2022 Alex Blewitt Memory prefetching (CPU) CPU issues

    automatic prefetch for streamed data Also notices striding by certain amounts as well Can also use __builtin_prefetch to explicitly suggest prefetching memory elsewhere but needs to be a measured improvement 💡
  22. @alblue 22 ©2022 Alex Blewitt False sharing • Two cores

    trying to write to bytes in the same cache-line will thrash • First thread will try to acquire exclusive ownership of cache line • Second thread (on di ff erent core) will try to do the same • Performance will su ff er when cache line repeatedly moved • Avoid by padding to at least cacheline size * 2 (128/256 bytes) for writes Thread 1 data[0] = 'A' Thread 2 data[7] = 'C'
  23. @alblue 22 ©2022 Alex Blewitt Memory performance strategies • Ensure

    data fi ts in L1/L2/L3 cache where possible • Stream or stride through data in a single pass if possible • Consider pivoting data (array-of-structs or structs-of-arrays) • Add padding for multi-threaded contended writes • Prefer thread-local or cpu-local accumulators with fi nal merge step • Compress data where practical (compressed pointers) 💡
  24. @alblue 22 ©2022 Alex Blewitt Pinning memory/threads • Pinning memory

    or threads to a particular core can improve performance • Reduces intra-core memory ownership tra ff i c • Less likely to have cache invalidations • isolcpu allows reservation of CPUs for non-kernel use with cpusets • taskset allows binding of a process to speci fi c cores • numactl allows cores/memory to be clamped for a process • libnuma has additional a ff i nity settings for programmatic use
  25. @alblue 22 ©2022 Alex Blewitt Frontend Core L1 Data 48

    KiB 12-way L1 Instruction 32 KiB 8-way Backend x86_64 µop 5/cycle
  26. @alblue 22 ©2022 Alex Blewitt Core x86_64 Pre-decode Instructions µop

    decoders µop cache 2304 8-way loop decode branch prediction Backend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way µop 5/cycle 256/5120 entry
  27. @alblue 22 ©2022 Alex Blewitt Core x86_64 Pre-decode Instructions µop

    decoders µop cache 2304 8-way loop decode branch prediction Backend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way 55 48 89 e5 fe 04 25 d2 04 00 00 41 6c 42 6c 75 µop 5/cycle 256/5120 entry
  28. @alblue 22 ©2022 Alex Blewitt Core x86_64 Pre-decode Instructions µop

    decoders µop cache 2304 8-way loop decode branch prediction Backend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way 55|48 89 e5|fe 04 25 d2 04 00 00|41 6c 42 6c 75 µop 5/cycle 256/5120 entry
  29. @alblue 22 ©2022 Alex Blewitt Core x86_64 Pre-decode Instructions µop

    decoders µop cache 2304 8-way loop decode branch prediction Backend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way push %rbp mov %rsp, %rbp ? incb 0x4d2 µop 5/cycle 256/5120 entry
  30. @alblue 22 ©2022 Alex Blewitt Core x86_64 Pre-decode Instructions µop

    decoders µop cache 2304 8-way loop decode branch prediction Backend µop L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way incb 0x4d2 incb 0x4d2 incb 0x4d2 256/5120 entry
  31. @alblue 22 ©2022 Alex Blewitt Core x86_64 Pre-decode Instructions µop

    decoders µop cache 2304 8-way loop decode branch prediction Backend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way TMP = [0x4d2] INC TMP [0x4d2] = TMP µop 5/cycle 256/5120 entry
  32. @alblue 22 ©2022 Alex Blewitt Core x86_64 Pre-decode Instructions µop

    decoders µop cache 2304 8-way loop decode branch prediction Backend µop L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way TMP = [0x4d2] INC TMP [0x4d2] = TMP 256/5120 entry
  33. @alblue 22 ©2022 Alex Blewitt Branch Prediction ⤵🧐 • Correct

    95% of the time • Queues up instructions assuming the branch has been taken • Learns patterns in code based on existing behaviour • Iterating through predictable (sorted) data may be more e ffi cient • Throws away inaccurate work if incorrect • May cause observable side channel behaviour e.g. cache invalidation 👻 • Unrolled loops and inlining avoid branches, so improves performance cmp eax,42; jne
  34. @alblue 22 ©2022 Alex Blewitt Branch Target Predictor 🎯 •

    Predicts where the target is going if taken • Hard coded addresses/o ff sets always predictable • Jump to location of register may be more di ffi cult • Often seen when jumping through object oriented code • Inlining is the master optimisation because it avoids unknowable branches • Dan Luu has a good write up https://danluu.com/branch-prediction/ jmp [eax]
  35. @alblue 22 ©2022 Alex Blewitt Pipeline stalls • CPUs are

    deeply pipelined (15-20 stages) • When PC (program counter/instruction pointer) changes causes a fl ush • Gaps occur after jump due to pipeline re fi lling • Good branch predictor guesses correctly to avoid this • HyperThreading can take advantage of 'empty' slots
  36. @alblue 22 ©2022 Alex Blewitt Pipeline stalls F D X

    W F D X W F D X W F D X W F D X W F D X Jump causes stall while bu ff er fi ls when branch not predicted Time 👻 Instructions Jump instruction
  37. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way
  38. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way TMP = [0x4d2] INC TMP [0x4d2] = TMP
  39. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way R99 = [0x4d2] INC R99 [0x4d2] = R99
  40. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way R99 = [0x4d2] INC R99 [0x4d2] = R99
  41. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way R99 = 2A INC R99 [0x4d2] = R99
  42. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way R99 = 2A INC R99 [0x4d2] = R99
  43. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way INC R99 [0x4d2] = R99 R99 = 2B
  44. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way INC R99 R99 = 2B [0x4d2] = 2B
  45. @alblue 22 ©2022 Alex Blewitt Core Allocate Rename Retire load

    buffer 128 entry store buffer 72 entry register fi les 280 + 224 2 3 4 7 8 9 0 1 5 6 Scheduler Integer Unit Floating Unit ALU LEA Shift Branch ALU FMA Shift Divide ALU LEA Multiply Divide ALU FMA Shift Shu ffl e ALU LEA Multiply ALU FMA Shu ffl e ALU LEA Shift Branch Execution units added in Ice Lake Port 0 and 1 can be fused for a 512 bit operation Port 5 is a 512 bit wide operation All others handle 256 bits Port 8 and 9 added in Ice Lake Address generation reorder 352 entry Frontend L1 Data 48 KiB 12-way L1 Instruction 32 KiB 8-way INC R99 R99 = 2B [0x4d2] = 2B
  46. @alblue 22 ©2022 Alex Blewitt perf • Linux perf (compiled

    from linux/tools/perf, or from linux-tools/linux-perf) • Running in Docker may require compilation from source if kernel mismatch • Commands available • record – record execution performance for process/pid • report – generate a report from prior recording • annotate – annotate a report from a prior recording • stat – record performance counters for process/pid https://perf.wiki.kernel.org https://github.com/alblue/scripts/blob/master/perf-Docker fi le
  47. @alblue 22 ©2022 Alex Blewitt JMH and perf • The

    Java Microbenchmarking Harness supports perf natively on Linux • Also able to run on Windows as well using a di ff erent mechanism • perf - runs and collects perf statistics • perfnorm - and normalises them • perfasm - shows assembly output of perf https://github.com/openjdk/jmh
  48. @alblue 22 ©2022 Alex Blewitt Async pro fi ler and

    perf • The Async Pro fi ler for Java supports perf output for native JIT • When running in containers need to make sure the perfmap is exported • perf-map-agent - Java agent to generate JIT maps for use with perf • Can be used to generate mixed-mode fl ame graphs (JIT, Java etc.) https://github.com/jvm-pro fi ling-tools
  49. @alblue 22 ©2022 Alex Blewitt perf record • Perf record

    will sample the process(es) and generate stack traces • Events may be skewed from their location • Improve accuracy with :p, :pp or :ppp su ff i x to event • Can capture branches, last branch records or use processor tracing • perf record -b program • perf record --call-graph lbr -j any_call,any_ret program • perf record -e intel_pt//u program https://lwn.net/Articles/680985/ https://lwn.net/Articles/680996/
  50. @alblue 22 ©2022 Alex Blewitt perf stat $ perf stat

    base64 <(echo hello) d29ybGQK Performance counter stats for 'base64 /dev/fd/63': 0.341382 task-clock (msec) # 0.649 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 65 page-faults # 0.190 M/sec 1,218,176 cycles # 3.568 GHz 811,468 stalled-cycles-frontend # 66.61% frontend cycles idle 855,999 instructions # 0.70 insn per cycle # 0.95 stalled cycles per insn 169,032 branches # 495.140 M/sec 8,883 branch-misses # 5.26% of all branches 0.000526160 seconds time elapsed https://perf.wiki.kernel.org IPC > 4 👍 < 1 👎
  51. @alblue 22 ©2022 Alex Blewitt Performance counters • Intel cores

    have a few dedicated and programmable counters • Instruction cycles, branches, branch misses … • Counters can be multiplexed (read X for 1µs, read Y for 1µs) • Programmable counters can be set to speci fi c measurements • iTLB-load-misses, LLC-load-misses, uops_dispatched_port.port_5 ... • Undocumented performance counters can be speci fi ed with events • cpu/event=0x3c,umask=0x0,any=1/
  52. @alblue 22 ©2022 Alex Blewitt 19 Locating Issues Have Precise

    events for sampling Precise events added in Skylake Top-down Microarchitecture Analysis https://www.researchgate.net/publication/269302126_A_Top-Down_method_for_performance_analysis_and_counters_architecture Ahmed Yasin
  53. @alblue 22 ©2022 Alex Blewitt Top-down Analysis Method USING PERFORMANCE

    MONITORING EVENTS Additionally, the metric uses the UOPS_ISSUED.ANY, which is common in recent Intel microarchitec- tures, as the denominator. The UOPS_ISSUED.ANY event counts the total number of Uops that the RAT issues to RS. The VectorMixRate metric gives the percentage of injected blend uops out of all uops issued. Usually a VectorMixRate over 5% is worth investigating. VectorMixRate[%] = 100 * UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY Note the actual penalty may vary as it stems from the additional data-dependency on the destination register the injected blend operations add. B.2 PERFORMANCE MONITORING AND MICROARCHITECTURE This section provides information of performance monitoring hardware and terminology related to the Silvermont, Airmont and Goldmont microarchitectures. The features described here may be specific to individual microarchitecture, as indicated in Table B-1. Figure B-3. TMAM Hierarchy Supported by Skylake Microarchitecture WŝƉĞůŝŶĞ^ůŽƚƐ ZĞƚŝƌŝŶŐ ĂĚ^ƉĞĐƵůĂƚŝŽŶ &ƌŽŶƚŶĚŽƵŶĚ ĂĐŬŶĚŽƵŶĚ EŽƚ^ƚĂůůĞĚ ^ƚĂůůĞĚ ĂƐĞ ƌĂŶĐŚ D ŝƐƉƌĞĚŝĐƚ &ĞƚĐŚ >ĂƚĞŶĐLJ D Ğŵ ŽƌLJŽƵŶĚ ŽƌĞŽƵŶĚ &ĞƚĐŚ ĂŶĚǁ ŝĚƚŚ D ĂĐŚŝŶĞ ůĞĂƌ D ^Ͳ ZKD džƚ͘ D Ğŵ ŽƌLJ ŽƵŶĚ >ϯŽƵŶĚ >ϮŽƵŶĚ >ϭŽƵŶĚ ^ƚŽƌĞƐŽƵŶĚ ŝǀŝĚĞƌ džĞĐƵƚŝŽŶ ƉŽƌƚƐ hƚŝůŝnjĂƚŝŽŶ >^ D/d ƌĂŶĐŚ ZĞƐƚĞĞƌƐ /ĐĂĐŚĞDŝƐƐ /d>DŝƐƐ KƚŚĞƌ &WͲƌŝƚŚ ^ ^^ǁŝƚĐŚĞƐ D^^ǁŝƚĐŚĞƐ ^ĐĂůĂƌ sĞĐƚŽƌ ϯнƉŽƌƚƐ ϭŽƌϮƉŽƌƚƐ ϬƉŽƌƚƐ DĞŵĂŶĚǁŝĚƚŚ DĞŵ>ĂƚĞŶĐLJ yϴϳ ^ƚŽƌĞDŝƐƐ ^d>,ŝƚ ^d>DŝƐƐ >Ϯ,ŝƚ >ϮDŝƐƐ &ĂůƐĞƐŚĂƌŝŶŐ d>^ƚŽƌĞ ^ƚŽƌĞĨǁĚďůŬ ϰ<ĂůŝĂƐŝŶŐ ŽŶƚĞƐƚĞĚĂĐĐĞƐƐ ĂƚĂƐŚĂƌŝŶŐ >ϯůĂƚĞŶĐLJ USING PERFORMANCE MONITORING EVENTS The single entry point of division at a pipeline’s issue-stage (allocation-stage) makes the four categories additive to the total possible slots. The classification at slots granularity (sub-cycle) makes the break- down very accurate and robust for superscalar cores, which is a necessity at the top-level. Figure B-2. TMAM’s Top Level Drill Down Flowchart hŽƉ ůůŽĐĂƚĞ͍ hŽƉǀĞƌ ZĞƚŝƌĞƐ͍ ĂĐŬŶĚ ^ƚĂůůƐ͍ &ƌŽŶƚŶĚ ŽƵŶĚ ĂĐŬŶĚ ŽƵŶĚ ZĞƚŝƌŝŶŐ ĂĚ ^ƉĞĐƵůĂƚŝŽŶ zĞƐ zĞƐ EŽ zĞƐ EŽ EŽ https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual Ahmed Yasin
  54. @alblue 22 ©2022 Alex Blewitt perf stat --topdown $ perf

    stat -a --topdown sleep 1 nmi_watchdog enabled with topdown. May give wrong results. Disable with echo 0 > /proc/sys/kernel/nmi_watchdog Performance counter stats for 'system wide': retiring bad speculation frontend bound backend bound S0-C0 2 15.3% 2.8% 32.1% 49.9% S0-C1 2 23.3% 4.0% 27.3% 45.4% S0-C2 2 15.2% 2.9% 29.8% 52.1% S0-C3 2 16.7% 0.0% 31.8% 51.5% S0-C4 2 35.7% 10.7% 26.2% 27.4% S0-C5 2 14.9% 2.5% 34.1% 48.5% 1.000889285 seconds time elapsed
  55. @alblue 22 ©2022 Alex Blewitt Toplev PMU tools • Andi

    Kleen has written toplev.py which allows top-down analysis • Initial download caches processor information from download.01.org • Uses perf to record stats, but with custom event fi lters • If workload is repeatable, can use --no-multiplex to repeat results • Run with -l1, see if issues are present, run with -l2 ... https://github.com/andikleen/pmu-tools/wiki/toplev-manual
  56. @alblue 22 ©2022 Alex Blewitt toplev.py --single-thread $ dd if=/dev/urandom

    of=/tmp/rand bs=4096 count=4096 $ ./toplev.py --single-thread --no-multiplex -l1 -- base64 /tmp/rand > /dev/null # 3.6-full on Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz BE Backend_Bound % Slots 24.07 <== 
 $ ./toplev.py --single-thread --no-multiplex -l2 -- base64 /tmp/rand > /dev/null BE Backend_Bound % Slots 23.82 BE/Core Backend_Bound.Core_Bound % Slots 16.08 <== 
 $ ./toplev.py --single-thread --no-multiplex -l3 -- base64 /tmp/rand > /dev/null BE Backend_Bound % Slots 23.96 BE/Core Backend_Bound.Core_Bound % Slots 16.35 BE/Core Backend_Bound.Core_Bound.Ports_Utilization % Clocks 24.51 <==
  57. @alblue 22 ©2022 Alex Blewitt Cache line Instruction layout Before

    After Error Is Error? Before After Error Is Error? __builtin_expect(error,0) __builtin_expect(error,1)
  58. @alblue 22 ©2022 Alex Blewitt Cache line Cache line Loop

    stream detector Good Loop Bad Loop 32 bit aliognment Align with 
 -mllvm -align-all-nofallthru-blocks=5
 -mllvm -align-all-functions=5
  59. @alblue 22 ©2022 Alex Blewitt Facebook BOLT https://arxiv.org/abs/1807.06735 Figure 9:

    Heat maps for instruction memory accesses of the HHVM binary, without and with BOLT. Heat is a log scale. Executed instructions are distributed across icache space After sorting basic blocks guided by pro fi ling data, the icache space is defragmented https://github.com/facebookincubator/BOLT https://github.com/llvm/llvm-project/commit/4c106cfdf7cf7eec861ad3983a3dd9a9e8f3a8ae Now merged into Clang!
  60. @alblue 22 ©2022 Alex Blewitt Google llvm-propeller https://github.com/google/llvm-propeller https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf Exe

    perf.data perf.propeller Optimised exe C C perf record clang -fpropeller-label create_llvm_prof clang -fpropeller-optimize func1() {…} func2() {…} func3() {…} … func1() {…} func2() {…} func3() {…} clang - ff unction-sections clang -fbasicblock-sections=perf.propeller lld + thinLTO + PGO Obsolete?!
  61. @alblue 22 ©2022 Alex Blewitt Summary: Memory • Use cacheline-aligned

    or cacheline-aware data structures • Compress data in memory and decompress on the fl y • Avoid random memory access when possible • Con fi gure huge pages and use madvise & defer • Partition memory with libnuma for data locality
  62. @alblue 22 ©2022 Alex Blewitt Summary: CPU • Each CPU

    is its own networked mesh cluster • Branch speculation and memory/TLB misses are costly • Use branch free and lock free algorithms when possible • Analyse perf counters with top down architectural analysis • Use (auto)vectorisation and use XMM/YMM/ZMM when sensible
  63. @alblue 22 ©2022 Alex Blewitt References https://danluu.com/branch-prediction/ https://arxiv.org/abs/1807.06735 → https://github.com/facebookincubator/BOLT

    https://arxiv.org/abs/1902.08318 → https://github.com/lemire/simdjson/ https://github.com/andikleen/pmu-tools/wiki/toplev-manual https://github.com/google/llvm-propeller/ https://lwn.net/Articles/680985/ && https://lwn.net/Articles/680996/ https://perf.wiki.kernel.org https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track- A_1000.pdf https://www.nextplatform.com/2021/04/19/deep-dive-into-intels-ice-lake-xeon-sp-architecture/ https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual https://www.researchgate.net/publication/269302126_A_Top- Down_method_for_performance_analysis_and_counters_architecture https://drive.google.com/drive/folders/1W4CIRKtNML74BKjSbXerRsIzAUk3ppSG → https://twitter.com/Cardyak https://goodies.dotnetos.org/ fi les/dotnetos-poster-ram.pdf → https://twitter.com/dotnetosorg
  64. @alblue 22 ©2022 Alex Blewitt Links https://easyperf.net/notes/ https://epickrram.blogspot.com/ https://groups.google.com/forum/#!forum/mechanical-sympathy/ https://lemire.me/en/

    https://psy-lob-saw.blogspot.com/ https://richardstartin.github.io/ https://travisdowns.github.io/ https://www.agner.org/optimize/ https://www.real-logic.co.uk/
  65. @alblue 22 ©2022 Alex Blewitt Thanks @andreipangin @cardyak @chriswhocodes @danluu

    @dendibakh @dotnetosorg @epickrram @holly_cummins @jChampionsConf @Java_Champions @javajuneau @javaperftuning @lemire @mjpt777 @mon_beck @net0pyr @nitsanw @perfsummit1 @richardstartin @shipilev @sharat_chandler @trav_downs And many more …
  66. @alblue 22 ©2022 Alex Blewitt Thank you 🗒 https://alblue.bandlem.com 🦉

    https://twitter.com/alblue 🐙 https://github.com/alblue 📺 https://vimeo.com/alblue 📇 https://speakerdeck.com/alblue