Computer Architecture, C++, and High Performance (CppCon 2016)

Slide 1

Slide 1 text

Computer Architecture, C++, and High Performance Matt P. Dziubinski CppCon 2016 [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)

Slide 178

Slide 178 text

Branches & Expectations: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1 n = 1000000 type = 2 random duration: 0.00528841 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,c n = 1000000 type = 2 random duration: 0.00550316 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.94 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 47.54 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 16.41 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./branches_g 1000000 2 166

Slide 181

Slide 181 text

Branches & Expectations: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 1000000 type = 2 random duration: 0.0055571 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 1000000 type = 2 random duration: 0.00556777 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.54 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 39.20 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 15.18 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 169

Slide 192

Slide 192 text

Virtual Functions & Indirect Branches: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0xc5,umask=0x0/u,cp n = 10000000 type = 2 random duration: 0.131247 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,cycles:u,cpu/event=0xa3,umask=0x4,cmask=4 n = 10000000 type = 2 random duration: 0.131361 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 36.02 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.41 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u BAD Bad_Speculation: 12.92 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.75 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data . 180

Slide 193

Slide 193 text

Virtual Functions & Indirect Branches: TMAM, Level 3 (GCC) $ ~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.13145 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.44 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.69 % [100.00%] This metric represents cycles fraction the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of miss-predicted branches. For example, branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sampling events: br_misp_retired.all_branches:u BAD Bad_Speculation: 12.97 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.82 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./vbranches_g 10000000 2 181

Slide 196

Slide 196 text

Virtual Functions & Indirect Branches: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 10000000 type = 2 random duration: 0.0859943 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 10000000 type = 2 random duration: 0.0861661 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.61 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.64 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 184

Slide 197

Slide 197 text

Virtual Functions & Indirect Branches: TMAM, Level 3 (Clang) ~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.65 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.63 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.MS_Switches: 8.40 % [100.00%] This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals. Sampling events: idq.ms_switches:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,edge=1,cmask=1,name=MS_Switches_IDQ_MS_SWITCHES,period=2000003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./vbranches_c 10000000 2 185

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text