Computer Architecture, C++, and High Performance (Meeting C++ 2016)

Computer Architecture, C++, and High Performance Matt P. Dziubinski Meeting
C++ 2016 [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)

Outline • Performance • Why do we care? • What
is it? • How to • measure it - reason about it - improve it? 2

Why? 3

Costs and Curves Moore, Gordon E. (1965). ”Cramming more components
onto integrated circuits”. Electronics Magazine. 4

Cramming more components onto integrated circuits Moore, Gordon E. (1965).
”Cramming more components onto integrated circuits”. Electronics Magazine. 5

Spending Moore’s Dividend ”Spending Moore’s Dividend,” James Larus, Microsoft Research
Technical Report MSR-TR-2008-69, May 2008. 6

Transformation Hierarchy Yale N. Patt, Microprocessor Performance, Phase 2: Can
We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 7

Phase I & The Walls Yale N. Patt, Microprocessor Performance,
Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 8

CPU Performance Trends 1 5 9 13 18 24 51
80 117 183 280 481 649 993 1,267 1,779 3,016 4,195 6,043 6,681 7,108 11,865 14,387 19,484 21,871 24,129 1 10 100 1000 10,000 100,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Performance (vs. VAX-11/780) 25%/year 52%/year 22%/year IBM POWERstation 100, 150 MHz Digital Alphastation 4/266, 266 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164 Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor IBM Power4, 1.3 GHz Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 1.5, VAX-11/785 AMD Athlon 64, 2.8 GHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz VAX 8700, 22 MHz AX-11/780, 5 MHz Hennessy, John L.; Patterson, David A., 2011, ”Computer Architecture: A Quantitative Approach,” Morgan Kaufmann. 9

40 Years of Microprocessor Trend Data https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend- data/ 10

Processor-Memory Performance Gap 1 100 10 1000 Performance 10,000 100,000
1980 2010 2005 2000 1995 Year Processor Memory 1990 1985 The difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access. Hennessy, John L.; Patterson, David A., 2011, ”Computer Architecture: A Quantitative Approach,” Morgan Kaufmann. Computer Architecture is Back: Parallel Computing Landscape https://www.youtube.com/watch?v=On-k-E5HpcQ 11

DRAM Performance Trends D. Lee: ”Reducing DRAM Latency at Low
Cost by Exploiting Heterogeneity.” http://arxiv.org/abs/1604.08041 (2016) D. Lee et al., ”Tiered-latency DRAM: A low latency and low cost DRAM architecture,” in HPCA, 2013. 12

Emerging Memory Technologies - Further Down The Hierarchy Qureshi et
al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. 13

Feature Scaling Trends Lee, Yunsup, ”Decoupled Vector-Fetch Architecture with a
Scalarizing Compiler,” EECS Department, University of California, Berkeley. 2016. http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-82.html 14

Process-Architecture-Optimization Intel’s Annual Report on Form 10-K for the ﬁscal
year ended December 26, 2015, ﬁled with the SEC on February 12, 2016. https://www.sec.gov/Archives/edgar/data/50863/000005086316000105/a10kdocument12262015q4.htm 15

Make it fast Butler W. Lampson. 1983. ”Hints for computer
system design.” In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP ’83). ACM, New York, NY, USA, 33-48. 16

What? 17

Performance: The Early Days A. Greenbaum and T. Chartier. ”Numerical
Methods: Design, analysis, and computer implementation of algorithms.” 2010. Course Notes for Short Course on Numerical Analysis. 18

Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), ”On
the computational complexity of algorithms”, Transactions of the American Mathematical Society 117: 285–306. 19

Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), ”On
the computational complexity of algorithms”, Transactions of the American Mathematical Society 117: 285–306. 20

Analysis of Algorithms - Scientific Method Robert Sedgewick and Kevin
Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 21

Analysis of Algorithms - Problem Size N vs. Running Time
T(N) Robert Sedgewick and Kevin Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 22

Analysis of Algorithms - Tilde Notation & Tilde Approximations Robert
Sedgewick and Kevin Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 23

Analysis of Algorithms - Doubling Ratio Experiments Robert Sedgewick and
Kevin Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 24

Complexity: Algorithms & Data Structures O(N) http://en.cppreference.com/w/cpp/algorithm/ﬁnd O(N·log(N)) http://en.cppreference.com/w/cpp/algorithm/sort http://en.cppreference.com/w/cpp/algorithm/lower_bound
log(N) http://en.cppreference.com/w/cpp/container/set/ﬁnd 25

Find Example - Benchmark (Nonius) Code I #include <algorithm> #include
<cstddef> #include <cstdint> #include <cstdio> #include <iterator> #include <random> #include <set> #include <vector> #include <boost/container/flat_set.hpp> #include <EASTL/vector_set.h> #include <nonius/nonius.h++> #include <nonius/main.h++> NONIUS_PARAM(size, std::size_t{100u}) NONIUS_PARAM(queries, std::size_t{10u}) 26

Find Example - Benchmark (Nonius) Code II // EASTL //
https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned de return malloc(size); } void* operator new[](size_t size, size_t alignment, size_t alignmentOffset, return malloc(size); } using T = std::uint32_t; std::vector<T> odd_numbers(std::size_t count) { std::vector<T> result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 27

Find Example - Benchmark (Nonius) Code III template <typename container_type>
T ctor_and_find(const char * type_name, const std::vector<T> & v, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution<T> uniform(0, 2 * n + 2); const container_type s(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto it = s.find(uniform(prng)); sum += (it != end(s)) ? *it : 0; } return sum; } 28

Find Example - Benchmark (Nonius) Code IV T ctor_and_find(const char
* type_name, const std::vector<T> & v_src, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution<T> uniform(0, 2*n + 2); auto v = v_src; std::sort(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } return sum; } 29

Find Example - Benchmark (Nonius) Code V NONIUS_BENCHMARK("std::set", [](nonius::chronometer meter)
{ const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<std::set<T>>("std::set", v, q); }); }); NONIUS_BENCHMARK("std::vector: copy & sort", [](nonius::chronometer meter) const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find("std::vector: copy & sort", v, q); }); }); 30

Find Example - Benchmark (Nonius) Code VI NONIUS_BENCHMARK("boost::container::flat_set", [](nonius::chronometer meter
const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<boost::container::flat_set<T>>("boost::container::flat_se }); }); NONIUS_BENCHMARK("eastl::vector_set", [](nonius::chronometer meter) { const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<eastl::vector_set<T>>("eastl::vector_set", v, q); }); }); int main(int argc, char * argv[]) { nonius::main(argc, argv); } 31

Find Example - Benchmark (Nonius) Code I Nonius: statistics-powered micro-benchmarking
framework: https://nonius.io/ https://github.com/libnonius/nonius Running: BNSIZE=10000; BNQUERIES=1000 ./find --param=size:$BNSIZE --param=queries:$BNQUERIES > results.size=$BNSIZE.queries=$BNQUERIES.txt ./find --param=size:$BNSIZE --param=queries:$BNQUERIES --reporter=html --output=results.size=$BNSIZE.queries=$BNQUERIES.html 32

Find Example - Results: size=10,000 queries=1,000 33

Find Example - Results: size=10,000,000 queries=1,000,000 34

How? 35

Asymptotic growth & "random access machines"? Tomasz Jurkiewicz and Kurt
Mehlhorn. 2015. ”On a Model of Virtual Address Translation.” J. Exp. Algorithmics 19. http://arxiv.org/abs/1212.0703 & https://people.mpi-inf.mpg.de/~mehlhorn/ftp/KMvat.pdf 36

Asymptotic growth & "random access machines"? Asymptotic - growing problem
size • for large data need to take into account the costs of actually bringing it in • communication complexity vs. computation complexity • including overlapping computation-communication latencies 37

"Operation"? Jack Dongarra. 2016. ”With Extreme Scale Computing the Rules
Have Changed.” In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16). 38

"Operation"? Jack Dongarra. 2016. ”With Extreme Scale Computing the Rules
Have Changed.” In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16). 39

Complexity - constants, microarchitecture? ”Array Layouts for Comparison-Based Searching” Paul-Virak
Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • ”With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search 40

Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • ”With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). 40

Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • ”With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). • It was only through careful and controlled experimentation with different implementations of each of the search algorithms that we are able to understand how the interactions between processor features such as pipelining, prefetching, speculative execution, and conditional moves affect the running times of the search algorithms.” 40

Reasoning about Performance: The Scientific Method Requires - enabled by
- the knowledge of microachitectural details. Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi, Chapter 2 ”Methods” from ”Readings in Computer Architecture,” Morgan Kaufmann, 2000. Prefetching beneﬁts evaluation: Disable/enable prefetchers using likwid-features: https://github.com/RRZE-HPC/likwid/wiki/likwid-features Example: https://gist.github.com/MattPD/06e293fb935eaf67ee9c301e70db6975 41

Microarchitecture Intel® 64 and IA-32 Architectures Optimization Reference Manual https://www-ssl.intel.com/content/www/us/en/architecture-and-
technology/64-ia-32-architectures-optimization-manual.html 42

Instruction Level Parallelism & Loop Unrolling - Code I #include
<cstddef> #include <cstdint> #include <cstdlib> #include <iostream> #include <vector> #include <boost/timer/timer.hpp> 43

Instruction Level Parallelism & Loop Unrolling - Code II using
T = double; T sum_1(const std::vector<T> & input) { T sum = 0.0; for (std::size_t i = 0, n = input.size(); i != n; ++i) sum += input[i]; return sum; } T sum_2(const std::vector<T> & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { sum1 += input[i]; sum2 += input[i + 1]; } return sum1 + sum2; } 44

Instruction Level Parallelism & Loop Unrolling - Code III int
main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 10000000; const std::size_t f = (argc > 2) ? std::atoll(argv[2]) : 1; std::cout << "n = " << n << '\n'; // iterations count std::cout << "f = " << f << '\n'; // unroll factor const std::vector<T> a(n, T(1)); boost::timer::auto_cpu_timer timer; const T sum = (f == 1) ? sum_1(a) : (f == 2) ? sum_2(a) : 0; std::cout << sum << '\n'; } 45

Instruction Level Parallelism & Loop Unrolling - Results make vector_sums
CXXFLAGS="-std=c++14 -O2 -march=native" LDLIBS=-lboost_timer $ ./vector_sums 1000000000 1 n = 1000000000 f = 1 1e+09 0.841269s wall, 0.840000s user + 0.010000s system = 0.850000s CPU (101.0%) $ ./vector_sums 1000000000 2 n = 1000000000 f = 2 1e+09 0.466293s wall, 0.460000s user + 0.000000s system = 0.460000s CPU (98.7%) 46

perf • https://perf.wiki.kernel.org/ • http://www.brendangregg.com/perf.html 47

perf Results - sum_1 Performance counter stats for './vector_sums 1000000000
1': 1675.812457 task-clock (msec) # 0.850 CPUs utilized 34 context-switches # 0.020 K/sec 5 cpu-migrations # 0.003 K/sec 8,953 page-faults # 0.005 M/sec 5,760,418,457 cycles # 3.437 GHz 3,456,046,515 stalled-cycles-frontend # 60.00% frontend cycles id 8,225,763,566 instructions # 1.43 insns per cycle # 0.42 stalled cycles per 2,050,710,005 branches # 1223.711 M/sec 104,331 branch-misses # 0.01% of all branches 1.970909249 seconds time elapsed 48

perf Results - sum_2 Performance counter stats for './vector_sums 1000000000
2': 1283.910371 task-clock (msec) # 0.835 CPUs utilized 38 context-switches # 0.030 K/sec 3 cpu-migrations # 0.002 K/sec 9,466 page-faults # 0.007 M/sec 4,458,594,733 cycles # 3.473 GHz 2,149,690,303 stalled-cycles-frontend # 48.21% frontend cycles id 6,734,925,029 instructions # 1.51 insns per cycle # 0.32 stalled cycles per 1,552,029,608 branches # 1208.830 M/sec 119,358 branch-misses # 0.01% of all branches 1.537971058 seconds time elapsed 49

Compiler Explorer: sum_1 (C++) http://gcc.godbolt.org/ 50

Compiler Explorer: sum_1 (x86-64 Assembly) http://gcc.godbolt.org/ 51

Compiler Explorer: sum_2 (C++) http://gcc.godbolt.org/ 52

Compiler Explorer: sum_2 (x86-64 Assembly) http://gcc.godbolt.org/ now with: embedded view,
Visual C++, Intel C++ compilers http://xania.org/201611/compiler-explorer-now-supports-embedded-view 53

Intel Architecture Code Analyzer (IACA) #include <iacaMarks.h> T sum_2(const std::vector<T>
& input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { IACA_START sum1 += input[i]; sum2 += input[i + 1]; } IACA_END return sum1 + sum2; } $ g++ -std=c++14 -O2 -march=native vector_sums_2i.cpp -o vector_sums_2i $ iaca -64 -arch IVB -graph ./vector_sums_2i • https://software.intel.com/en-us/articles/intel-architecture-code-analyzer • https://stackoverﬂow.com/questions/26021337/what-is-iaca-and-how-do-i- use-it • http://kylehegeman.com/blog/2013/12/28/introduction-to-iaca/ 54

Microarchitecture: Sandy Bridge Intel® 64 and IA-32 Architectures Optimization Reference
Manual https://www-ssl.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html 55

IACA Results - sum_1 $ iaca -64 -arch IVB -graph
./vector_sums_1i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_1i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 3.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.0 0.0 | 1.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.0 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 1.0 1.0 | | | | | mov rdx, qword ptr [rdi] | 2 | | 1.0 | | 1.0 1.0 | | | CP | vaddsd xmm0, xmm0, qword ptr [rdx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x1 | 1 | | | | | | 1.0 | | cmp rax, rcx | 0F | | | | | | | | jnz 0xffffffffffffffe7 Total Num Of Uops: 5 56

IACA Results - sum_2 $ iaca -64 -arch IVB -graph
./vector_sums_2i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_2i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 6.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.5 0.0 | 3.0 | 1.5 1.5 | 1.5 1.5 | 0.0 | 1.5 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rcx, qword ptr [rdi] | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | CP | vaddsd xmm0, xmm0, qword ptr [rcx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x2 | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | vaddsd xmm1, xmm1, qword ptr [rcx+rdx*1] | 1 | 0.5 | | | | | 0.5 | | add rdx, 0x10 | 1 | | | | | | 1.0 | | cmp rax, rsi | 0F | | | | | | | | jnz 0xffffffffffffffde | 1 | | 1.0 | | | | | CP | vaddsd xmm0, xmm0, xmm1 Total Num Of Uops: 9 57

IACA Data Dependency Graph - sum_1 58

IACA Data Dependency Graph - sum_2 59

Parallelism: Work (no. of ops) / Span (critical path length)
Guy E. Blelloch, ”Programming parallel algorithms”, Communications of the ACM, 1996. 60

ILP & Data (In)dependence G. S. Tjaden and M. J.
Flynn, ‘‘Detection and Parallel Execution of Independent Instructions,’’ IEEE Transactions on Computers, vol. C-19, pp. 889-895, October 1970. 61

ILP vs. Dependencies D. W. Wall, “Limits of instruction-level parallelism,”
Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 62

ILP, Criticality & Latency Hiding D. W. Wall, “Limits of
instruction-level parallelism,” Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 63

Empty Issue Slots: Horizontal Waste & Vertical Waste D. M.
Tullsen, S. J. Eggers and H. M. Levy, ”Simultaneous multithreading: Maximizing on-chip parallelism,” Proceedings, 22nd Annual International Symposium on Computer Architecture, 1995. 64

Wasted Slots: Causes D. M. Tullsen, S. J. Eggers and
H. M. Levy, ”Simultaneous multithreading: Maximizing on-chip parallelism,” Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, Santa Margherita Ligure, Italy, 1995, pp. 392-403. 65

Wasted Slots: Miss Events Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis,
and James E. Smith. 2006. ”A performance counter architecture for computing accurate CPI components.” SIGOPS Oper. Syst. Rev. 40, 5 (October 2006), 175-184. 66

likwid • https://github.com/RRZE-HPC/likwid • https://github.com/RRZE-HPC/likwid/wiki • https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr 67

likwid Results - sum_1: 489 Scalar MUOPS/s $ likwid-perfctr -C
S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 1.090122s wall, 0.880000s user + 0.000000s system = 0.880000s CPU (80.7%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 8002493499 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 4285189526 | | CPU_CLK_UNHALTED_REF | FIXC2 | 3258346806 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000155741 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 2.0456 | | Runtime unhalted [s] | 1.6536 | | Clock [MHz] | 3408.2011 | | CPI | 0.5355 | | MFLOP/s | 488.9303 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 488.9303 | +----------------------+-----------+ 68

likwid Results - sum_2: 595 Scalar MUOPS/s $ likwid-perfctr -C
S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 2 1e+09 0.620421s wall, 0.470000s user + 0.000000s system = 0.470000s CPU (75.8%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 6502566958 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2948446599 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2223894218 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000328727 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.6809 | | Runtime unhalted [s] | 1.1377 | | Clock [MHz] | 3435.8987 | | CPI | 0.4534 | | MFLOP/s | 595.1079 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 595.1079 | +----------------------+-----------+ 69

likwid Results: sum_vectorized: 676 AVX MFLOP/s g++ -std=c++14 -O2 -ftree-vectorize
-ffast-math -march=native -lboost_timer vector_sums.cpp -o vector_sums_vf $ likwid-perfctr -C S0:0 -g FLOPS_DP -f ./vector_sums_vf 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 0.561288s wall, 0.390000s user + 0.000000s system = 0.390000s CPU (69.5%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3002491149 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2709364345 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2043804906 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 91 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 260258099 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.5390 | | Runtime unhalted [s] | 1.0454 | | Clock [MHz] | 3435.5297 | | CPI | 0.9024 | | MFLOP/s | 676.4420 | | AVX MFLOP/s | 676.4420 | | Packed MUOPS/s | 169.1105 | | Scalar MUOPS/s | 0.0001 | +----------------------+-----------+ 70

Intel PCM (Performance Counter Monitor) Intel PCM (Performance Counter Monitor)
https://software.intel.com/en-us/articles/ intel-performance-counter-monitor/ Cross-platform: FreeBSD, Linux, Mac OS X, Windows C++ Performance Counters API https://software.intel.com/en-us/articles/ intel-performance-counter-monitor/#calling_pcm PCM-core utility - similar to perf: pcm-core -e cpu/umask=0x01,event=0x05,name=MISALIGN_MEM_REF.LOADS 71

Performance: CPI Steven K. Przybylski, ”Cache and Memory Hierarchy Design
– A Performance-Directed Approach,” San Fransisco, Morgan-Kaufmann, 1990. 72

Performance: [YMMV]PI - Power Grochowski, E., Ronen, R., Shen, J.,
& Wang, H. (2004). ”Best of Both Latency and Throughput.” Proceedings of the IEEE International Conference on Computer Design. 73

Performance: [YMMV]PI - Graphs Scott Beamer, Krste Asanović, and David
A. Patterson. ”GAIL: The Graph Algorithm Iron Law.” Workshop on Irregular Applications: Architectures and Algorithms (IA 3), at the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015. 74

Performance: [YMMV]PI - Packets packet_processing_times = seconds/packet = instructions/packet *
clock_cycles/instruction * seconds/clock_cycle = clock_cycles/packet * seconds/clock_cycle = CPP / core_frequency cycles per packet (CPP) http://blogs.cisco.com/sp/a-bigger-helping-of-internet-please 75

Performance: separable components of a CPI CPI = (Infinite-cache CPI)
+ finite-cache effect (FCE) Infinite-cache CPI = execute busy (EBusy) + execute idle (EIdle) FCE = (cycles per miss) × (misses per instruction) = (miss penalty) × (miss rate) P. G. Emma. ”Understanding some simple processor-performance limits.” IBM Journal of Research and Development, 41(3):215–232, May 1997. 76

Pervasive CPU Parallelism pipeline-level parallelism (PLP) instruction-level parallelism (ILP) memory-level
parallelism (MLP) data-level parallelism (DLP) thread-level parallelism (TLP) 77

The Cache Liptay, J. S. (1968) ”Structural Aspects of the
System/360 Model 85, Part II: The Cache,” IBM System Journal, 7(1). 78

The Cache: Processor-Memory Performance Gap Liptay, J. S. (1968) ”Structural
Aspects of the System/360 Model 85, Part II: The Cache,” IBM System Journal, 7(1). 79

The Cache: Assumptions & Effectiveness Liptay, J. S. (1968) ”Structural
Aspects of the System/360 Model 85, Part II: The Cache,” IBM System Journal, 7(1). 80

The Curse of Multiple Granularities Seshadri, V. (2016). ”Simple DRAM
and Virtual Memory Abstractions to Enable Highly Efﬁcient Memory Systems.” CoRR, abs/1605.06483. 81

Word Granularity != Cache Line Granularity Seshadri, V. (2016). ”Simple
DRAM and Virtual Memory Abstractions to Enable Highly Efﬁcient Memory Systems.” CoRR, abs/1605.06483. 82

Shortcomings of Strided Access Patterns Seshadri, V. (2016). ”Simple DRAM
and Virtual Memory Abstractions to Enable Highly Efﬁcient Memory Systems.” CoRR, abs/1605.06483. 83

Pointer Chasing Example - Vector http://pythontutor.com/cpp.html https://github.com/pgbovine/opt-cpp-backend 84

Pointer Chasing Example - (Singly) Linked List 85

Pointer Chasing Example - Doubly Linked List 86

Pointer Chasing Example - Linked List - C++ #include <algorithm>
#include <forward_list> #include <iterator> bool found(const std::forward_list<int> & list, int value) { return find(begin(list), end(list), value) != end(list); } int main() { std::forward_list<int> list {11, 22, 33, 44, 55}; return found(list, 42); } 87

Pointer Chasing Example - Linked List - ASM https://godbolt.org/g/rkzQ90 88

Pointer Chasing Example - Linked List - ASM (r2) http://rada.re/
89

Pointer Chasing Example - Linked List - CFG (r2) 90

Pointer Chasing Example - Linked List - CFG (r2) radiff2
-g sym.found forward_list_app forward_list_app > forward_list_found.dot xdot forward_list_found.dot dot -Tpng -o forward_list_found.png forward_list_found.dot 91

Isolated & Clustered Cache Misses Miquel Moreto, Francisco J. Cazorla,
Alex Ramirez, and Mateo Valero. 2008. ”MLP-aware dynamic cache partitioning.” In Proceedings of the 3rd international conference on High performance embedded architectures and compilers (HiPEAC’08). 92

Cache Miss Cost & Miss Clustering Thomas R. Puzak, A.
Hartstein, P. G. Emma, V. Srinivasan, and Jim Mitchell. 2007. ”An analysis of the effects of miss clustering on the cost of a cache miss.” In Proceedings of the 4th international conference on Computing frontiers (CF ’07). ACM, New York, NY, USA, 3-12. 93

Cache Miss Penalty: Different STC due to different MLP MLP
(memory-level parallelism) & STC (stall-time criticality) R. Das, O. Mutlu, T. Moscibroda and C. R. Das, ”Application-aware prioritization mechanisms for on-chip networks,” 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), New York, NY, 2009, pp. 280-291. 94

Skip Lists William Pugh. 1990. ”Skip lists: a probabilistic alternative
to balanced trees.” Commun. ACM 33, 6, 668-676. 95

Jump Pointers S. Chen, P. B. Gibbons, and T. C.
Mowry. “Improving Index Performance through Prefetching.” In Proc. of the 20th Annual ACM SIGMOD International Conference on Management of Data, 2001. 96

Prefetching Aggressiveness: Distance & Degree Sparsh Mittal. 2016. ”A Survey
of Recent Prefetching Techniques for Processor Caches.” ACM Comput. Surv. 49, 2, Article 35. 97

Prefetching Timeliness Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012.
”When Prefetching Works, When It Doesn’t, and Why.” ACM Trans. Archit. Code Optim. 9, 1, Article 2. 98

Prefetches Classification Huaiyu Zhu, Yong Chen, and Xian-He Sun. 2010.
”Timing local streams: improving timeliness in data prefetching.” In Proceedings of the 24th ACM International Conference on Supercomputing (ICS ’10). ACM, New York, NY, USA, 169-178. 99

Prefetching I #include <algorithm> #include <chrono> #include <cinttypes> #include <cstddef>
#include <cstdio> #include <cstdlib> #include <future> #include <iterator> #include <memory> #include <random> struct point { double x, y, z; }; using T = point; 100

Prefetching II struct timing_result { double duration_initial; double duration_non_prefetched; double
duration_degree; double sum_initial; double sum_non_prefetched; double sum_degree; }; timing_result chase(std::size_t n, bool shuffle, std::size_t d, bool prefet timing_result chase_result; std::vector<std::unique_ptr<T>> v; for (std::size_t i = 0; i != n; ++i) { v.emplace_back(new point{1. * i, 2. * i, 5.* i}); } if (shuffle) { std::mt19937 g(1); 101

Prefetching III std::shuffle(begin(v), end(v), g); } double sum = 0.0;
auto time_start = std::chrono::steady_clock::now(); if (prefetch) { for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); sum += std::exp(-v[i]->y); } } else { for (std::size_t i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; chase_result.duration_initial = duration.count(); chase_result.sum_initial = sum; 102

Prefetching IV sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t
i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; chase_result.duration_non_prefetched = duration.count(); chase_result.sum_non_prefetched = sum; sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); __builtin_prefetch(v[i + 2*d].get()); sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); 103

Prefetching V duration = time_end - time_start; chase_result.duration_degree = duration.count();
chase_result.sum_degree = sum; return chase_result; } int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; const bool shuffle = (argc > 2) ? std::atoi(argv[2]) : false; const std::size_t d = (argc > 3) ? std::atoll(argv[3]) : 3; const bool prefetch = (argc > 4) ? std::atoi(argv[4]) : false; const std::size_t threads_count = (argc > 5) ? std::atoll(argv[5]) : 4; printf("size: %zu \n", n); printf("shuffle: %d \n", shuffle); printf("distance: %zu \n", d); printf("prefetch: %d \n", prefetch); 104

Prefetching VI printf("threads_count: %d \n", threads_count); const auto thread_work =
[n, shuffle, d, prefetch]() { return chase(n, shuffle, d, prefetch); }; std::vector<std::future<timing_result>> results; for (std::size_t thread = 0; thread != threads_count; ++thread) results.emplace_back(std::async(std::launch::async, thread_work)); for (auto && future_result : results) if (future_result.valid()) future_result.wait(); std::vector<double> timings_initial, timings_non_prefetched, timings_degree; for (auto && future_result : results) { timing_result chase_result = future_result.get(); timings_initial.push_back(chase_result.duration_initial); 105

Prefetching VII timings_non_prefetched.push_back(chase_result.duration_non_prefetched); timings_degree.push_back(chase_result.duration_degree); } const auto timings_initial_minmax = std::minmax_element(begin(timings_initial),
end(timings_initial)); const auto timings_non_prefetched_minmax = std::minmax_element(begin(timings_non_prefetched), end(timings_non_pref const auto timings_degree_minmax = std::minmax_element(begin(timings_degree), end(timings_degree)); printf(prefetch ? "prefetched" : "non-prefetched"); printf(" initial duration: [%g, %g] \n", *timings_initial_minmax.first, *timings_initial_minmax.second); printf("non-prefetched duration: [%g, %g] \n", *timings_non_prefetched_mi *timings_non_prefetched_minmax.second); printf("degree-two prefetching duration: [%g, %g] \n", *timings_degree_mi *timings_degree_minmax.second); } 106

Prefetch Overhead S. Van der Wiel and D. Lilja, ”A
Survey of Data Prefetching Techniques,” Technical Report No. HPPC 96-05, University of Minnesota, October 1996. 107

Prefetching Timings: No Prefetch $ likwid-perfctr -f -C 0-3 -g
L3 -m ./prefetch 100000 1 0 0 4 distance: 0 prefetch: 0 non-prefetched initial duration: [0.00280393, 0.00289815] non-prefetched duration: [0.00254968, 0.00257311] degree-two prefetching duration: [0.00290615, 0.00296243] Region chase_initial, Group 1: L3 | CPI STAT | 5.8641 | 1.4529 | 1.4744 | 1.4660 | | L3 bandwidth [MBytes/s] STAT | 10733.6308 | 2666.0364 | 2710.9325 | 2683.4077 | Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | L3 miss rate STAT | 0.0584 | 0.0145 | 0.0148 | 0.0146 | | L3 miss ratio STAT | 3.7723 | 0.9117 | 0.9789 | 0.9431 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 0 0 4 | Cycles without execution [%] STAT | 228.2316 | 56.8136 | 57.4443 | 57.0579 | | Cycles without execution [%] STAT | 227.0385 | 56.5980 | 57.0024 | 56.7596 | 108

Prefetching Timings: useless 0-distance prefetch (overhead) $ likwid-perfctr -f -C
0-3 -g L3 -m ./prefetch 100000 1 0 1 4 distance: 0 prefetch: 1 prefetched initial duration: [0.00288751, 0.00295978] non-prefetched duration: [0.0025575, 0.00258342] degree-two prefetching duration: [0.00285772, 0.00287839] Region chase_initial, Group 1: L3 | CPI STAT | 5.7454 | 1.4345 | 1.4387 | 1.4364 | | L3 bandwidth [MBytes/s] STAT | 10518.6383 | 2618.5405 | 2645.6096 | 2629.6596 | 109

Prefetching Timings: 1-distance prefetch (mostly overhead) $ likwid-perfctr -f -C
0-3 -g L3CACHE -m ./prefetch 100000 1 1 1 4 prefetched initial duration: [0.00250957, 0.00257662] non-prefetched duration: [0.00255286, 0.00258417] degree-two prefetching duration: [0.00230482, 0.00235828] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.9595 | 1.2343 | 1.2433 | 1.2399 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0889 | 0.4381 | 0.6454 | 0.5222 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 1 1 4 | Cycles without execution [%] STAT | 214.1614 | 53.4628 | 53.6716 | 53.5404 | | Cycles without execution [%] STAT | 200.4785 | 50.0405 | 50.1857 | 50.1196 | Formulas: L3 request rate = MEM_LOAD_UOPS_RETIRED_L3_ALL/UOPS_RETIRED_ALL L3 miss rate = MEM_LOAD_UOPS_RETIRED_L3_MISS/UOPS_RETIRED_ALL L3 miss ratio = MEM_LOAD_UOPS_RETIRED_L3_MISS/MEM_LOAD_UOPS_RETIRED_L3_ALL https://github.com/RRZE-HPC/likwid/blob/master/groups/ivybridge/L3CACHE.txt 110

Prefetching Timings: 2-distance prefetch $ likwid-perfctr -f -C 0-3 -g
L3CACHE -m ./prefetch 100000 1 2 1 4 size: 100000 shuffle: 1 distance: 2 prefetch: 1 threads_count: 4 prefetched initial duration: [0.0023392, 0.00241287] non-prefetched duration: [0.00257006, 0.00260938] degree-two prefetching duration: [0.00199431, 0.00203528] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.5557 | 1.1331 | 1.1423 | 1.1389 | | L3 request rate STAT | 0.0006 | 0.0001 | 0.0002 | 0.0002 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2317 | 0.3138 | 0.6791 | 0.5579 | Region chase_degree, Group 1: L3CACHE | CPI STAT | 3.6990 | 0.9243 | 0.9253 | 0.9248 | | L3 request rate STAT | 0.0005 | 0.0001 | 0.0002 | 0.0001 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0145 | 0.3597 | 0.6550 | 0.5036 | 111

L3CACHE -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00181161, 0.00188783] non-prefetched duration: [0.00257601, 0.0026076] degree-two prefetching duration: [0.00152468, 0.00156814] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0065 | 0.0016 | 0.0017 | 0.0016 | | CPI STAT | 3.4808 | 0.8650 | 0.8788 | 0.8702 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2431 | 0.4694 | 0.6640 | 0.5608 | Region chase_degree, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0053 | 0.0013 | 0.0014 | 0.0013 | | CPI STAT | 2.7450 | 0.6832 | 0.6882 | 0.6863 | | L3 miss rate STAT | 0.0016 | 0.0004 | 0.0004 | 0.0004 | | L3 miss ratio STAT | 3.4045 | 0.7778 | 0.9346 | 0.8511 | 112

L3 -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00180738, 0.00189831] non-prefetched duration: [0.00254486, 0.00258013] degree-two prefetching duration: [0.00154542, 0.00158065] Region chase_initial, Group 1: L3 | CPI STAT | 3.5027 | 0.8668 | 0.8835 | 0.8757 | L3 bandwidth [MBytes/s] STAT | 17384.8731 | 4296.5905 | 4381.7164 | 4346.2183 Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | | CPI STAT | 2.7626 | 0.6894 | 0.6919 | 0 | L3 bandwidth [MBytes/s] STAT | 21505.6670 | 5333.6653 | 5396.4473 | 53 $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 8 1 4 | Cycles without execution [%] STAT | 187.6689 | 46.3938 | 47.3055 | 46.91 | Cycles without execution [%] STAT | 151.5095 | 37.6872 | 38.0656 | 37.87 113

Prefetching Timings: suboptimal (untimely) prefetch $ likwid-perfctr -f -C 0-3
-g L3 -m ./prefetch 100000 1 512 1 4 size: 100000 shuffle: 1 distance: 512 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00177956, 0.00186644] non-prefetched duration: [0.00257188, 0.0026064] degree-two prefetching duration: [0.00173249, 0.00178712] Region chase_initial, Group 1: L3 | CPI STAT | 3.4343 | 0.8523 | 0.8683 | 0.8586 | | L3 data volume [GBytes] STAT | 0.0293 | 0.0073 | 0.0074 | 0.0073 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 3.1891 | 0.7903 | 0.8034 | 0.7973 | | L3 bandwidth [MBytes/s] STAT | 19902.4764 | 4954.4107 | 5013.4006 | 4975.6191 | 114

Gem5 http://www.gem5.org/ 115

Gem5 - std::vector & std::list I Filling with numbers -
std::vector vs. std::list Machine code & assembly (std::vector) Micro-ops execution breakdown (std::vector) Assembly is Too High Level: http://xlogicx.net/?p=369 116

Gem5 - std::vector & std::list II Micro-ops pipeline stages (std::vector)
117

Gem5 - std::vector & std::list III Pipeline diagram - one
iteration (std::vector) Pipeline diagram - three iterations (std::vector) 118

Gem5 - std::vector & std::list IV Machine code & assembly
(std::list) heap allocation in the loop @ 400d85 what could possibly go wrong? 119

std::list - one iteration

std::list - one iteration (continued...)

std::list - one iteration (...continued still)

std::list - one iteration (...done!)

(The GNU C library's) malloc https://sourceware.org/glibc/wiki/MallocInternals Arena A structure that
is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are ”free”. Threads assigned to each arena will allocate memory from that arena’s free lists. Glibc Heap Analysis in Linux Systems with Radare2 https://youtube.com/watch?v=Svm5V4leEho r2con-2016 - rada.re/con/ 124

malloc & free - new, new[], delete, delete[] int main()
{ double * a = new double[8]; double * b = new double[8]; delete[] b; delete[] a; double * c = new double[8]; delete[] c; } 125

new[] & delete[] - dmhg 1/6 126

Memory Access Patterns: Temporal & Spatial Locality horizontal axis -
time vertical axis - address D. J. Hatﬁeld and J. Gerald. ”Program restructuring for virtual memory.” IBM Systems Journal, 10(3):168–192, 1971. 132

Loop Fusion 0.429504s (unfused) down to 0.287501s (fused) g++ -Ofast
-march=native (5.2.0) void unfused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) a[i] = b[i] * c[i]; for (size_t i = 0; i != N; ++i) d[i] = a[i] * c[i]; } void fused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } } 133

Pin - A Dynamic Binary Instrumentation Tool http://www.intel.com/software/pintool pin -t
$PIN_ROOT/source/tools/ManualExamples/obj-intel64/pinatrace.so -- ./loop_fusion . . . 0x400e43,R,0x401c48 0x400e59,R,0x401d40 0x400e65,W,0x1c789c0 0x400e65,W,0x1c789e0 . . . r-project.org rstudio.com ggplot2.org rcpp.org 134

Loop Fusion: unfused over time PC: Program Counter (instruction pointer)
135

Loop Fusion: unfused space-time PC: Program Counter (instruction pointer) MA:
Memory Address (array element pointer) 136

Loop Fusion: unfused space over time MA: Memory Address (array
element pointer) 137

Loop Fusion: fused over time PC: Program Counter (instruction pointer)
138

Loop Fusion: fused space-time PC: Program Counter (instruction pointer) MA:
Memory Address (array element pointer) 139

Loop Fusion: fused space over time MA: Memory Address (array
element pointer) 140

Takeaway: Overlapping Latencies as a General Principle Overlapping latencies also
works on a ”macro” scale • load as ”get the data from the Internet” • compute as ”process the data” Another example: Communication Avoiding and Overlapping for Numerical Linear Algebra • https://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-65.html • http://www.cs.berkeley.edu/~egeor/sc12_slides_final.pdf 141

Non-Overlapped Timings id,symbol,count,time 1,AAPL,565449,1.59043 2,AXP,731366,3.43745 3,BA,867366,5.40218 4,CAT,830327,7.08103 5,CSCO,400440,8.49192 6,CVX,687198,9.98761 7,DD,910932,12.2254
8,DIS,910430,14.058 9,GE,871676,15.8333 10,GS,280604,17.059 11,HD,556611,18.2738 12,IBM,860071,20.3876 13,INTC,559127,21.9856 14,JNJ,724724,25.5534 15,JPM,500473,26.576 16,KO,864903,28.5405 17,MCD,717021,30.087 18,MMM,698996,31.749 19,MRK,733948,33.2642 20,MSFT,475451,34.3134 21,NKE,556344,36.4545 142

Overlapped Timings id,symbol,count,time 1,AAPL,565449,2.00713 2,AXP,731366,2.09158 3,BA,867366,2.13468 4,CAT,830327,2.19194 5,CSCO,400440,2.19197 6,CVX,687198,2.19198 7,DD,910932,2.51895
8,DIS,910430,2.51898 9,GE,871676,2.51899 10,GS,280604,2.519 11,HD,556611,2.51901 12,IBM,860071,2.51902 13,INTC,559127,2.51902 14,JNJ,724724,2.51903 15,JPM,500473,2.51904 16,KO,864903,2.51905 17,MCD,717021,2.51906 18,MMM,698996,2.51907 19,MRK,733948,2.51908 20,MSFT,475451,2.51908 21,NKE,556344,2.51909 143

Visualizing & Monitoring Performance https://github.com/Celtoys/Remotery 144

Timeline: Without Overlapping 145

Timeline: With Overlapping 146

Cache Misses, MLP, and STC: Slack R. Das et al.,
”Aérgia: Exploiting Packet Latency Slack in On-Chip Networks,” Proc. 37th Ann. Int’l Symp. Computer Architecture (ISCA 10), ACM Press, 2010. 147

Dependent Cache Misses - Non-Overlapped - Serialized A Day in
the Life of a Cache Miss Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and James E. Smith, ”A Performance Counter Architecture for Computing Accurate CPI Components”, ASPLOS 2006, pp. 175-184. 1. load instruction enters the window (ROB) 2. the load issues from the instruction buffer (RS) 3. the load blocks the ROB head 4. ROB eventually ﬁlls 5. dispatch stops, instruction window drains 6. eventually issue and commit stop

Independent Cache Misses in ROB - Overlapped Stijn Eyerman, Lieven
Eeckhout, Tejas Karkhanis, and James E. Smith, ”A Top-Down Approach to Architecting CPI Component Performance Counters”, IEEE Micro, Special Issue on Top Picks from 2006 Microarchitecture Conferences, Vol 27, No 1, pp. 84-93. 149

Miss-Dependent Mispredicted Branch - Penalties Serialization S. Eyerman, J.E. Smith
and L. Eeckhout, ”Characterizing the branch misprediction penalty”, Performance Analysis of Systems and Software 2006 IEEE International Symposium on 2006, pp. 48-58. 150

Dependent Cache Misses - Non-Overlapped - Serialized Milad Hashemi, Khubaib,
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. ”Accelerating Dependent Cache Misses with an Enhanced Memory Controller.” In ISCA, 2016. 151

Independent Misses Connected by a Pending Cache Hit • MLP
- supported by non-blocking caches, out-of-order execution • multiple outstanding cache-misses - Miss Status Holding Registers (MSHRs) / Line Fill Buffers (LFBs) • MSHR ﬁle entries - merging redundant (same cache line) memory requests Xi E. Chen and Tor M. Aamodt. 2008. ”Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.” In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). 152

Independent Misses Connected by a Pending Cache Hit Xi E.
Chen and Tor M. Aamodt. 2011. ”Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.” ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 153

Finite MSHRs => Finite MLP Xi E. Chen and Tor
M. Aamodt. 2011. ”Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.” ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 154

Multicore Processors & Shared Memory Memory utilization even more important
- contention for capacity & bandwidth! ”Disaggregated Memory Architectures for Blade Servers,” Kevin Te-Ming Lim, Ph.D. Thesis, The University of Michigan, 2010. 155

Cache Miss Penalty: Leading Edge & Trailing Edge ”The End
of Scaling? Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node,” Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 156

Cache Miss Penalty: Bandwidth Utilization Impact ”The End of Scaling?
Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node,” Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 157

Multicore: Sequential / Parallel Execution Model L. Yavits, A. Morad,
R. Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 158

Multicore: Amdahl's Law, Strong Scaling ”Reevaluating Amdahl’s Law,” John L.
Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 159

Multicore: Gustafson's Law, Weak Scaling ”Reevaluating Amdahl’s Law,” John L.
Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 160

Amdahl's Law Optimistic Assumes perfect parallelism of the parallel portion:
Only Serial Bottlenecks, No Parallel Bottlenecks Counterpoint: https://blogs.msdn.microsoft.com/ddperf/2009/04/29/parallel-scalability-isnt-childs-play-part-2-amdahls-law-vs- gunthers-law/ 161

Multicore: Synchronization, Actual Scaling M. A. Suleman, M. K. Qureshi,
and Y. N. Patt, “Feedback-driven threading: Power-efﬁcient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 162

Multicore: Communication, Actual Scaling M. A. Suleman, M. K. Qureshi,
and Y. N. Patt, “Feedback-driven threading: Power-efﬁcient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 163

Multicore & DRAM: AoS I #include <cstddef> #include <cstdlib> #include
<future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> struct contract { double K; double T; double P; }; using element = contract; using container = std::vector<element>; 164

Multicore & DRAM: AoS II double sum_if(const container & a,
const container & b, const std::vector<std::size_t> & index) { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a[j].K == b[j].K) sum += a[j].K; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) average += f() / m; return average; } 165

Multicore & DRAM: AoS III std::vector<std::size_t> index_stream(std::size_t n) { std::vector<std::size_t>
index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); for (std::size_t i = 0; i != n; ++i) index.push_back(u(g)); return index; } 166

Multicore & DRAM: AoS IV int main(int argc, char *
argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { thread_type[thread] = (argc > 3 + thread) ? std::atoll(argv[3 + thread]) : 0; std::cout << "thread_type[" << thread << "] = " << thread_type[thread] << '\n'; } 167

Multicore & DRAM: AoS V endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t
thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } const container v1(n, {1.0, 0.5, 3.0}); const container v2(n, {1.0, 2.0, 1.0}); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 168

Multicore & DRAM: AoS VI boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);
for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 169

Multicore & DRAM: AoS Timings 1 thread, sequential access $
./DRAM_CMP 10000000 10 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.395408s wall, 0.406250s user + 0.000000s system = 0.406250s CPU (102.7%) 170

Multicore & DRAM: AoS Timings 1 thread, random access $
./DRAM_CMP 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 5.348314s wall, 5.343750s user + 0.000000s system = 5.343750s CPU (99.9%) 171

Multicore & DRAM: AoS Timings 4 threads, sequential access $
./DRAM_CMP 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.508894s wall, 2.000000s user + 0.000000s system = 2.000000s CPU (393.0%) 172

Multicore & DRAM: AoS Timings 4 threads: 3 sequential access
+ 1 random access $ ./DRAM_CMP 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 5.666049s wall, 7.265625s user + 0.000000s system = 7.265625s CPU (128.2%) 173

Multicore & DRAM: AoS Timings Memory Access Patterns & Multicore:
Interactions Matter Inter-thread Interference Sharing - Contention - Interference - Slowdown Threads using a shared resource (like on-chip/off-chip interconnects and memory) contend for it, interfering with each other’s progress, resulting in slowdown (and thus negative returns to increased threads count). cf. Thomas Moscibroda and Onur Mutlu, ”Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems,” Microsoft Research Technical Report, MSR-TR-2007-15, February 2007. 174

Multicore & DRAM: SoA I #include <cstddef> #include <cstdlib> #include
<future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> // SoA (structure-of-arrays) struct data { std::vector<double> K; std::vector<double> T; std::vector<double> P; }; 175

Multicore & DRAM: SoA II double sum_if(const data & a,
const data & b, const std::vector<std::size_t { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a.K[j] == b.K[j]) sum += a.K[j]; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) { average += f() / m; } 176

Multicore & DRAM: SoA III return average; } std::vector<std::size_t> index_stream(std::size_t
n) { std::vector<std::size_t> index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); 177

Multicore & DRAM: SoA IV for (std::size_t i = 0;
i != n; ++i) index.push_back(u(g)); return index; } int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { 178

Multicore & DRAM: SoA V thread_type[thread] = (argc > 3
+ thread) ? std::atoll(argv[3 + thr std::cout << "thread_type[" << thread << "] = " << thread_type[thre } endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } data v1; v1.K.resize(n, 1.0); v1.T.resize(n, 0.5); v1.P.resize(n, 3.0); 179

Multicore & DRAM: SoA VI data v2; v2.K.resize(n, 1.0); v2.T.resize(n,
2.0); v2.P.resize(n, 1.0); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 180

Multicore & DRAM: SoA VII boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);
for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 181

Multicore & DRAM: SoA Timings 1 thread, sequential access $
./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.211877s wall, 0.203125s user + 0.000000s system = 0.203125s CPU (95.9%) 182

Multicore & DRAM: SoA Timings 1 thread, random access $
./DRAM_CMP.SoA 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 4.534646s wall, 4.546875s user + 0.000000s system = 4.546875s CPU (100.3%) 183

Multicore & DRAM: SoA Timings 4 threads, sequential access $
./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.256391s wall, 1.031250s user + 0.000000s system = 1.031250s CPU (402.2%) 184

Multicore & DRAM: SoA Timings 4 threads: 3 sequential access
+ 1 random access $ ./DRAM_CMP.SoA 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 4.581033s wall, 5.265625s user + 0.000000s system = 5.265625s CPU (114.9%) 185

Multicore & DRAM: SoA Timings Better Access Patterns yield Better
Single-core Performance but also Reduced Interference and thus Better Multi-core Performance 186

Multicore: Arithmetic Intensity L. Yavits, A. Morad, R. Ginosar, The
effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 187

Multicore: Synchronization & Connectivity Intensity L. Yavits, A. Morad, R.
Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 188

Speedup: Synchronization and Connectivity Bottlenecks f: parallelizable fraction f1 :
connectivity intensity f2 : synchronization intensity L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 189

Speedup: Synchronization & Connectivity Bottlenecks Speedup - affected by sequential-to-parallel
data synchronization and inter-core communication. L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 190

Pipelining & Temporal Parallelism D. Sima, ”Decisive aspects in the
evolution of microprocessors”, Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 191

Pipelining: Base N. P. Jouppi and D. W. Wall. 1989.
”Available instruction-level parallelism for superscalar and superpipelined machines.” In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 192

Pipelining: Superscalar N. P. Jouppi and D. W. Wall. 1989.
”Available instruction-level parallelism for superscalar and superpipelined machines.” In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 193

Pipelining & Branches P. Emma and E. Davidson, ”Characterization of
Branch and Data Dependencies in Programs for Evaluating Pipeline Performance,” IEEE Trans. Computers C-36, No. 7, 859-875 (July 1987) 194

Branch (Mis)Prediction Example I #include <cmath> #include <cstddef> #include <cstdlib>
#include <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> double sum1(const std::vector<double> & x, const std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::cos(x[i]) : std::sin(x[i]); } return sum; } 195

Branch (Mis)Prediction Example II double sum2(const std::vector<double> & x, const
std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::sin(x[i]) : std::cos(x[i]); } return sum; } std::vector<bool> inclusion_random(std::size_t n, double p) { std::vector<bool> which; which.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution decision(p); for (std::size_t i = 0; i != n; ++i) which.push_back(decision(g)); 196

Branch (Mis)Prediction Example III return which; } int main(int argc,
char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // branch takenness / predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\n'; // takenness probability // 0.0: never; 1.0: always const double p = (argc > 3) ? std::atof(argv[3]) : 0.5; std::cout << "p = " << p << '\n'; 197

Branch (Mis)Prediction Example IV std::vector<bool> which; if (type == 0)
which.resize(n, false); else if (type == 1) which.resize(n, true); else if (type == 2) which = inclusion_random(n, p); const std::vector<double> x(n, 1.1); boost::timer::auto_cpu_timer timer; std::cout << sum1(x, which) + sum2(x, which) << '\n'; } 198

Timing: Branch (Mis)Prediction Example $ make BP CXXFLAGS="-std=c++14 -O3 -march=native"
LDLIBS=-lboost_timer-mt $ ./BP 10000000 0 n = 10000000 type = 0 1.3448e+007 1.190391s wall, 1.187500s user + 0.000000s system = 1.187500s CPU (99.8%) $ ./BP 10000000 1 n = 10000000 type = 1 1.3448e+007 1.172734s wall, 1.156250s user + 0.000000s system = 1.156250s CPU (98.6%) $ ./BP 10000000 2 n = 10000000 type = 2 1.3448e+007 1.296455s wall, 1.296875s user + 0.000000s system = 1.296875s CPU (100.0%) 199

Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH
-f ./BP 10000000 0 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 0 1.3448e+07 0.445464s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177597 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167613066 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167632206 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952380 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14796 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4586 | | Runtime unhalted [s] | 0.4505 | | Clock [MHz] | 2591.5373 | | CPI | 0.4679 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.929838e-06 | | Branch misprediction ratio | 3.967263e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 200

-f ./BP 10000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 1 1.3448e+07 0.445354s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177490 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167125701 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167146162 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952366 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14720 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4584 | | Runtime unhalted [s] | 0.4504 | | Clock [MHz] | 2591.5345 | | CPI | 0.4678 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.899380e-06 | | Branch misprediction ratio | 3.946885e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 201

-f ./BP 10000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 2 1.3448e+07 0.509917s wall, 0.510000s user + 0.000000s system = 0.510000s CPU (100.0%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3191479747 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2264945099 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2264967068 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 468135649 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 15326586 | +------------------------------+---------+------------+ +----------------------------+-----------+ | Metric | Core 1 | +----------------------------+-----------+ | Runtime (RDTSC) [s] | 0.8822 | | Runtime unhalted [s] | 0.8740 | | Clock [MHz] | 2591.5589 | | CPI | 0.7097 | | Branch rate | 0.1467 | | Branch misprediction rate | 0.0048 | | Branch misprediction ratio | 0.0327 | | Instructions per branch | 6.8174 | +----------------------------+-----------+ 202

Perf: Branch (Mis)Prediction Example $ perf stat -e branches,branch-misses -r
10 ./BP 10000000 0 Performance counter stats for './BP 10000000 0' (10 runs): 374,121,213 branches ( +- 0.02% ) 23,260 branch-misses # 0.01% of all branches ( +- 0.35% ) 0.460392835 seconds time elapsed ( +- 0.50% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 1 Performance counter stats for './BP 10000000 1' (10 runs): 374,040,282 branches ( +- 0.01% ) 23,124 branch-misses # 0.01% of all branches ( +- 0.45% ) 0.457583418 seconds time elapsed ( +- 0.04% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 2 Performance counter stats for './BP 10000000 2' (10 runs): 469,331,762 branches ( +- 0.01% ) 15,326,501 branch-misses # 3.27% of all branches ( +- 0.01% ) 0.884858777 seconds time elapsed ( +- 0.30% ) 203

Sniper The Sniper Multi-Core Simulator http://snipersim.org/ 204

Sniper: Branch (Mis)Prediction Example CPI stack: never taken 205

Sniper: Branch (Mis)Prediction Example CPI stack: always taken 206

Sniper: Branch (Mis)Prediction Example CPI stack: randomly taken 207

Sniper: Branch (Mis)Prediction Example CPI graph: never taken 208

Sniper: Branch (Mis)Prediction Example CPI graph: always taken 209

Sniper: Branch (Mis)Prediction Example CPI graph: randomly taken 210

Sniper: Branch (Mis)Prediction Example CPI graph (detailed): always taken 211

Sniper: Branch (Mis)Prediction Example CPI graph (detailed): randomly taken 212

Branch Prediction & Speculative Execution D. Sima, ”Decisive aspects in
the evolution of microprocessors”, Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 213

Block Enlargement Fisher, J. A. (1983). ”Very Long Instruction Word
architectures and the ELI-512.” Proceedings of the 10th Annual International Symposium on Computer Architecture. 214

Block Enlargement Joseph A. Fisher and John J. O’Donnell, ”VLIW
Machines: Multiprocessors We Can Actually Program,” CompCon 84 Proceedings, pp. 299-305, IEEE, 1984. 215

Branch Predictability • takenness rate? 216

Branch Predictability • takenness rate? • transition rate? 216

Branch Predictability • takenness rate? • transition rate? • compare:
216

• 01010101 (i % 2) 216

• 01010101 (i % 2) • 01101101 (i % 3) 216

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) 216

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) 216

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 216

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 216

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: 216

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: • all predictable! 216

Branch Predictability & Marker API https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#using- the-marker-api https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC g++ -Ofast
-march=native source.cpp -o application -std=c++14 -DLIKWID_PERFMON -lpthread -llikwid likwid-perfctr -f -C 0-3 -g BRANCH -m ./application #include <likwid.h> // . . . LIKWID_MARKER_START("branch"); // branch code LIKWID_MARKER_STOP("branch"); 217

Branch Entropy linear entropy: EL(p) = 2 × min(p, 1
− p) intuition: miss rate proportional to the probability of the least frequent outcome 218

Branch Takenness Probability Sander De Pestel, Stijn Eyerman and Lieven
Eeckhout, ”Micro-Architecture Independent Branch Behavior Characterization”, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 219

Branch Entropy & Miss Rate: Linear Relationship Sander De Pestel,
Stijn Eyerman and Lieven Eeckhout, ”Micro-Architecture Independent Branch Behavior Characterization”, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 220

Branches & Expectations: Code I #include <chrono> #include <cmath> #include
<cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <numeric> #include <random> #include <string> #include <vector> #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable((x))) 221

Branches & Expectations: Code II using T = int; void
f(T z, T & x, T & y) { ((z < 0) ? x : y) = 5; } void generate_never(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(10, 19); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 222

Branches & Expectations: Code III void generate_always(std::size_t n, std::vector<T> &
zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-19, -10); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } void generate_random(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-5, 4); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 223

Branches & Expectations: Code IV int main(int argc, char *
argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n), zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } else if (type == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); 224

Branches & Expectations: Code V const auto time_start = std::chrono::steady_clock::now();
T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(zs[i], xs[i], ys[i]); } const auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 225

Branches & Expectations: Compiling & Timing g++ -ggdb -std=c++14 -march=native
-Ofast ./branches.cpp -o branches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./branches.cpp -o branches_c time ./branches_g 1000000 0 time ./branches_g 1000000 1 time ./branches_g 1000000 2 time ./branches_c 1000000 0 time ./branches_c 1000000 1 time ./branches_c 1000000 2 226

Branches & Expectations: Timings (GCC) $ time ./branches_g 1000000 0
n = 1000000 type = 0 never duration: 0.00082991 sum(xs): 0 sum(ys): 5000000 real 0m0.034s user 0m0.033s sys 0m0.003s $ time ./branches_g 1000000 1 n = 1000000 type = 1 always duration: 0.000839488 sum(xs): 5000000 sum(ys): 0 real 0m0.031s user 0m0.030s sys 0m0.000s $ time ./branches_g 1000000 2 n = 1000000 type = 2 random duration: 0.0052968 sum(xs): 2498105 sum(ys): 2501895 real 0m0.038s user 0m0.033s sys 0m0.003s 227

Branches & Expectations: Timings (Clang) $ time ./branches_c 1000000 0
n = 1000000 type = 0 never duration: 0.00091161 sum(xs): 0 sum(ys): 5000000 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 1 n = 1000000 type = 1 always duration: 0.000765925 sum(xs): 5000000 sum(ys): 0 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 2 n = 1000000 type = 2 random duration: 0.00554585 sum(xs): 2498105 sum(ys): 2501895 real 0m0.041s user 0m0.040s sys 0m0.000s 228

So many performance events, so little time ”So many performance
events, so little time,” Gerd Zellweger, Denny Lin, Timothy Roscoe. Proceedings of the 7th Asia-Paciﬁc Workshop on Systems (APSys, Hong Kong, China, August 2016). 229

Hierarchical cycle accounting Andrzej Nowak, David Levinthal, Willy Zwaenepoel: ”Hierarchical
cycle accounting: a new method for application performance tuning.” ISPASS 2015. https://github.com/David-Levinthal/gooda 230

Top-down Microarchitecture Analysis Method (TMAM) https://github.com/andikleen/pmu-tools/wiki/toplev-manual https://sites.google.com/site/analysismethods/yasin-pubs ”A Top-Down Method
for Performance Analysis and Counters Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 231

TMAM: Bottlenecks ”A Top-Down Method for Performance Analysis and Counters
Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 232

TMAM: Breakdown ”A Top-Down Method for Performance Analysis and Counters
Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 233

TMAM: Meaning Updates: https://download.01.org/perfmon/ ”A Top-Down Method for Performance Analysis
and Counters Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 234

Branches & Expectations: TMAM, Level 1 (GCC) $ ~/builds/pmu-tools/toplev.py -l1
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu n = 1000000 type = 2 random duration: 0.00523105 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.92 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_g 1000000 2 235

Branches & Expectations: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x n = 1000000 type = 2 random duration: 0.00528841 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u, n = 1000000 type = 2 random duration: 0.00550316 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.94 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 47.54 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 16.41 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./branches_g 1000000 2 236

Branches & Expectations: TMAM, Level 2, perf (GCC) perf record
-g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period -o perf.data ./branches_g 1000000 2 perf report -Mintel 237

Branches & Expectations: TMAM, Level 1 (Clang) $ ~/builds/pmu-tools/toplev.py -l1
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu n = 1000000 type = 2 random duration: 0.00555177 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.53 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_c 1000000 2 238

Branches & Expectations: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umas n = 1000000 type = 2 random duration: 0.0055571 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instruction n = 1000000 type = 2 random duration: 0.00556777 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.54 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 39.20 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 15.18 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,p 239

Branches & Expectations: TMAM, Level 2, perf (Clang) perf record
-g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./branches_c 1000000 2 perf report -Mintel 240

Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and
James E. Smith. 2009. ”A mechanistic performance model for superscalar out-of-order processors.” ACM Trans. Comput. Syst. 27, 2, Article 3. 241

Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and
James E. Smith, ”A Performance Counter Architecture for Computing Accurate CPI Components”, ASPLOS 2006, pp. 175-184. 242

Virtual Functions & Indirect Branches: Code I #include <chrono> #include
<cmath> #include <cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <memory> #include <numeric> #include <random> #include <string> #include <vector> #define str(s) #s #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable(!!(x))) 243

Virtual Functions & Indirect Branches: Code II using T =
int; struct base { virtual T f() const { return 0; } }; struct derived_taken : base { T f() const override { return -1; } }; struct derived_untaken : base { T f() const override { return 1; } }; void f(const base & b, T & x, T & y) { ((b.f() < 0) ? x : y) = 119; } void generate_never(std::size_t n, std::vector<std::unique_ptr<base>> & zs) { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_untaken>()); return; 244

Virtual Functions & Indirect Branches: Code III } void generate_always(std::size_t
n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_taken>()); return; } void generate_random(std::size_t n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution z(0.5); for (std::size_t i = 0; i != n; ++i) { if (z(g)) zs.emplace_back(std::make_unique<derived_taken>()); else zs.emplace_back(std::make_unique<derived_untaken>()); 245

Virtual Functions & Indirect Branches: Code IV } return; }
int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n); std::vector<std::unique_ptr<base>> zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } 246

Virtual Functions & Indirect Branches: Code V else if (type
== 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); auto time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(*zs[i], xs[i], ys[i]); } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 247

Virtual Functions & Indirect Branches: Compiling & Timing g++ -ggdb
-std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_c time ./vbranches_g 10000000 0 time ./vbranches_g 10000000 1 time ./vbranches_g 10000000 2 time ./vbranches_c 10000000 0 time ./vbranches_c 10000000 1 time ./vbranches_c 10000000 2 248

Virtual Functions & Indirect Branches: Timings (GCC) $ time ./vbranches_g
10000000 0 n = 10000000 type = 0 never duration: 0.0338749 sum(xs): 0 sum(ys): 1190000000 real 0m0.645s user 0m0.573s sys 0m0.070s $ time ./vbranches_g 10000000 1 n = 10000000 type = 1 always duration: 0.0406144 sum(xs): 1190000000 sum(ys): 0 real 0m0.648s user 0m0.563s sys 0m0.083s $ time ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.131803 sum(xs): 595154105 sum(ys): 594845895 real 0m0.956s user 0m0.863s sys 0m0.090s 249

Branches & Expectations: Timings (Clang) $ time ./vbranches_c 10000000 0
n = 10000000 type = 0 never duration: 0.0314749 sum(xs): 0 sum(ys): 1190000000 real 0m0.623s user 0m0.530s sys 0m0.090s $ time ./vbranches_c 10000000 1 n = 10000000 type = 1 always duration: 0.0314727 sum(xs): 1190000000 sum(ys): 0 real 0m0.623s user 0m0.557s sys 0m0.063s $ time ./vbranches_c 10000000 2 n = 10000000 type = 2 random duration: 0.0854935 sum(xs): 595154105 sum(ys): 594845895 real 0m1.863s user 0m1.800s sys 0m0.063s 250

Virtual Functions & Indirect Branches: TMAM, Level 1 (GCC) $
~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x9c,umask=0x1/u,cycles:u} n = 10000000 type = 2 random duration: 0.131386 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. BAD Bad_Speculation: 12.98 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_g 10000000 2 251

~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0xc5,umask=0x0/u,c n = 10000000 type = 2 random duration: 0.131247 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,cycles:u,cpu/event=0xa3,umask=0x4,cmask= n = 10000000 type = 2 random duration: 0.131361 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 36.02 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.41 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u BAD Bad_Speculation: 12.92 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.75 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data 252

~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.13145 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.44 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.69 % [100.00%] This metric represents cycles fraction the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of miss-predicted branches. For example, branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sampling events: br_misp_retired.all_branches:u BAD Bad_Speculation: 12.97 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.82 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./vbranches_g 10000000 2 253

Virtual Functions: TMAM, Level 3, perf (GCC) perf record -g
-e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_g 10000000 2 perf report -Mintel 254

Virtual Functions & Indirect Branches: TMAM, Level 1 (Clang) $
~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu n = 10000000 type = 2 random duration: 0.0858722 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.66 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_c 10000000 2 255

Virtual Functions & Indirect Branches: TMAM, Level 2 (Clang) $
~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umas n = 10000000 type = 2 random duration: 0.0859943 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instruction n = 10000000 type = 2 random duration: 0.0861661 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.61 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.64 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,p 256

Virtual Functions & Indirect Branches: TMAM, Level 3 (Clang) ~/builds/pmu-tools/toplev.py
-l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.65 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.63 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.MS_Switches: 8.40 % [100.00%] This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals. Sampling events: idq.ms_switches:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,edge=1,cmask=1,name=MS_Switches_IDQ_MS_SWITCHES,period=2000003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./vbranches_c 10000000 2 257

Virtual Functions: TMAM, Level 3, perf (Clang) perf record -g
-e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_c 10000000 2 perf report -Mintel 258

Branches vs. Predicated Execution https://github.com/mongodb-labs/disasm Interactive Disassembler GUI with optional
Intel Architecture Code Analyzer (IACA) integration 259

Branches vs. Predicated Execution https://github.com/mongodb-labs/disasm Interactive Disassembler GUI with optional
Intel Architecture Code Analyzer (IACA) integration 260

Compiler-Specific Built-in Functions GCC & Clang: __builtin_expect http://llvm.org/docs/BranchWeightMetadata.html#built-in- expect-instructions https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
likely & unlikely https://kernelnewbies.org/FAQ/LikelyUnlikely Clang: __builtin_unpredictable http://clang.llvm.org/docs/LanguageExtensions.html#builtin- unpredictable 261

Branch Misprediction, Speculation, and Wrong-Path Execution J. Reineke et al.,
“A Deﬁnition and Classiﬁcation of Timing Anomalies,” Proc. Int’l Workshop Worst Case Execution Time (WCET), 2006. 262

Branch Misprediction Penalty & Wrong-Path Execution Tejas S. Karkhanis and
James E. Smith. 2004. ”A First-Order Superscalar Processor Model.” In Proceedings of the 31st annual international symposium on Computer architecture (ISCA ’04). 263

The Number of Cycles Sam Van den Steen; Stijn Eyerman;
Sander De Pestel; Moncef Mechri; Trevor E. Carlson; David Black-Schaffer; Erik Hagersten; Lieven Eeckhout, “Analytical Processor Performance and Power Modeling using Micro-Architecture Independent Characteristics,” Transactions on Computers (TC) 2016. C - #cycles, N - #instructions, Deff - effective dispatch rate, mbpred - #branch mispredictions, cres - branch resolution time, cfe - front-end pipeline depth, mILi - #instruction fetch misses at each level i in the cache hierarchy, cLi - access latency to each cache level, ROB - size of the Reorder Buffer, mLLC - #number of LLC load misses, cmem - memory access time, cbus - memory bus transfer and waiting time, MLP - amount of memory-level parallelism, PhLLC - LLC hit chain penalty 264

Roofline Model: Potential ”Auto-tuning Performance on Multicore Computers,” S. Williams,
PhD, 2008. 265

Roofline Model: Optimization ”Auto-tuning Performance on Multicore Computers,” S. Williams,
PhD, 2008. 266

Cache-aware Roofline model ”Cache-aware Rooﬂine model: Upgrading the loft.” Aleksandar
Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 267

Cache-aware Roofline model ”Cache-aware Rooﬂine model: Upgrading the loft.” Aleksandar
Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 268

Roofline Model: Microarchitectural Bottlenecks ”Extending the Rooﬂine Model: Bottleneck Analysis
with Microarchitectural Constraints.” Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 269

Roofline Model: Microarchitectural Bottlenecks ”Extending the Rooﬂine Model: Bottleneck Analysis
with Microarchitectural Constraints.” Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 270

C++ Standards: C++11 & C++14 Atomic Operations & Concurrent Memory
Model http://en.cppreference.com/w/cpp/atomic http://github.com/MattPD/cpplinks/blob/master/atomics.lockfree.memory_model.md ”The C11 and C++11 Concurrency Model” by Mark John Batty: http://www.cl.cam.ac.uk/~mjb220/thesis/ Move semantics https://isocpp.org/wiki/faq/cpp11-language#rval http://thbecker.net/articles/rvalue_references/section_01.html http://kholdstare.github.io/technical/2013/11/23/moves-demystiﬁed.html scoped_allocator (stateful allocators support) https://isocpp.org/wiki/faq/cpp11-library#scoped-allocator http://en.cppreference.com/w/cpp/header/scoped_allocator https://accu.org/content/conf2012/JonathanWakely-CXX11_allocators.pdf https://accu.org/content/conf2013/Frank_Birbacher_Allocators.r210article.pdf 271

C++ Standards: C++11, C++14, and C++17 reducing the need for
conditional compilation via macros and template metaprogramming constexpr https://isocpp.org/wiki/faq/cpp11-language#cpp11-constexpr https://isocpp.org/wiki/faq/cpp14-language#extended-constexpr if constexpr http://en.cppreference.com/w/cpp/language/if#Constexpr_If 272

C++17 Standard std::string_view http://en.cppreference.com/w/cpp/string/basic_string_view interoperatbility with C APIs (e.g., sockets)
without extra allocations / copies std::aligned_alloc (C11) http://en.cppreference.com/w/cpp/memory/c/aligned_alloc aligned uninitialized storage allocation (vectorization) Hardware interference size http://eel.is/c++draft/hardware.interference http://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size portable cache line size information (e.g., padding to avoid false sharing) Extended allocators & polymorphic memory resources http://en.cppreference.com/w/cpp/memory/polymorphic_allocator http://stackoverﬂow.com/questions/38010544/polymorphic-allocator-when- and-why-should-i-use-it http://boost.org/doc/libs/release/doc/html/container/extended_functionality.html 273

C++ Core Guidelines P: Philosophy • P.9: Don’t waste time
or space. Per: Performance • Per.3: Don’t optimize something that’s not performance critical. • Per.6: Don’t make claims about performance without measurements. • Per.7: Design to enable optimization • Per.18: Space is time. • Per.19: Access memory predictably. • Per.30: Avoid context switches on the critical path https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-performance https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#S-performance 274

Takeaway: It depends! • Memory access cost: latency / bandwidth?
275

• Cache miss cost? 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict 275

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict • cmov & tradeoffs: converting control dependencies to data dependencies 275

Takeaways Principles Data structures & data layout - fundamental part
of design CPUs & pervasive forms parallelism • can support each other: PLP, ILP (MLP!), TLP, DLP Balanced design vs. bottlenecks Overlapping latencies Sharing-contention-interference-slowdown Yale Patt’s Phase 2: Break the layers: • break through the hardware/software interface • harness all levels of the transformation hierarchy 276

Phase 2: Harnessing the Transformation Hierarchy Yale N. Patt, Microprocessor
Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 277

Break the Layers Yale N. Patt, Microprocessor Performance, Phase 2:
Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 278

Pigeonholing has to go Yale N. Patt at Yale Patt
75 Visions of the Future Computer Architecture Workshop: ”Are you a software person or a hardware person?” I’m a person this pigeonholing has to go We must break the layers Abstractions are great - AFTER you understand what’s being abstracted Yale N. Patt, 2013 IEEE CS Harry H. Goode Award Recipient Interview — https://youtu.be/S7wXivUy-tk Yale N. Patt at Yale Patt 75 Visions of the Future Computer Architecture Workshop — https://youtu.be/x4LH1cJCvxs 279

Backup Slides 280

Out-of-Order Execution R.M. Tomasulo, “An Efﬁcient Algorithm for Exploiting Multiple
Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 281

Out-of-Order Execution: Overlap R.M. Tomasulo, “An Efﬁcient Algorithm for Exploiting
Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 282

Out-of-Order Execution: Reservation Stations R.M. Tomasulo, “An Efﬁcient Algorithm for
Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 283

Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efﬁcient Algorithm
for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 284

Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efﬁcient Algorithm
for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 285

Out-of-Order Execution of Simple Micro-Operations Y.N. Patt, W.M. Hwu, and
M. Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 286

Out-of-Order Execution: Restricted Dataflow Y.N. Patt, W.M. Hwu, and M.
Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 287

Out-of-Order Execution: Results Buffer Y.N. Patt, W.M. Hwu, and M.
Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 288

Pipelining & Precise Exceptions: Reorder Buffer (ROB) J.E. Smith and
A.R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. 12th Ann. IEEE/ACM Int’l Symp. Computer Architecture, 1985, pp. 36–44. 289

Execution: Superscalar & Out-Of-Order J.E. Smith and G.S. Sohi, ”The
Microarchitecture of Superscalar Processors,” Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 290

Superscalar CPU Organization J.E. Smith and G.S. Sohi, ”The Microarchitecture
of Superscalar Processors,” Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 291

Superscalar CPU: ROB J.E. Smith and G.S. Sohi, ”The Microarchitecture
of Superscalar Processors,” Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 292

Computer Architecture: A Science of Tradeoffs ”My tongue in cheek
phrase to emphasize the importance of tradeoffs to the discipline of computer architecture. Clearly, computer architecture is more art than science. Science, we like to think, involves a coherent body of knowledge, even though we have yet to ﬁgure out all the connections. Art, on the other hand, is the result of individual expressions of the various artists. Since each computer architecture is the result of the individual(s) who speciﬁed it, there is no such completely coherent structure. So, I opined if computer architecture is a science at all, it is a science of tradeoffs. In class, we keep coming up with design choices that involve tradeoffs. In my view, ”tradeoffs” is at the heart of computer architecture.” — Yale N. Patt 293

Design Points: Dictated the Application Space The design of a
microprocessor is about making relevant tradeoffs. We refer to the set of considerations, along with the relevant importance of each, as the “design point” for the microprocessor—that is, the characteristics that are most important to the use of the microprocessor, such that one is willing to be less concerned about other characteristics. In each case, it is usually the problem we are addressing . . . which dictates the design point for the microprocessor, and the resulting tradeoffs that must be made. Patt, Y., & Cockrell, E. (2001). ”Requirements, bottlenecks, and good fortune: Agents for microprocessor evolution.” Proceedings of the IEEE, 89(11), 1553-1559. 294

A Science of Tradeoffs Software Performance Optimization - Analogous! The
multiplicity of tradeoffs: • Multidimensional • Multiple levels • Costs and beneﬁts 295

Trade-offs - Latency & Bandwidth I Intel(R) Memory Latency Checker
- v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 60.4 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using traffic with the following read-write ratios ALL Reads : 24152.0 3:1 Reads-Writes : 22313.2 2:1 Reads-Writes : 22050.5 1:1 Reads-Writes : 21130.4 Stream-triad like: 21559.4 296

Trade-offs - Latency & Bandwidth II Measuring Memory Bandwidths between
nodes within system Using Read-only traffic type Memory node Socket 0 0 24155.0 Measuring Loaded Latencies for the system Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 122.27 24109.6 00002 121.99 24082.7 00008 120.60 23952.1 00015 119.28 23837.6 00050 70.87 17408.7 00100 64.59 12496.6 297

Trade-offs - Latency & Bandwidth III Inject Latency Bandwidth Delay
(ns) MB/sec ========================== 00200 61.76 8129.1 00300 60.75 6194.8 00400 60.63 5085.6 00500 60.12 4377.0 00700 60.51 3505.2 01000 60.60 2812.6 01300 60.66 2425.3 01700 60.51 2117.0 02500 60.36 1789.5 03500 60.33 1585.4 05000 60.29 1430.9 09000 60.31 1267.9 20000 60.32 1154.7 298

Trade-offs - Latency & Size I Intel i3-2120 (Sandy Bridge),
3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 20 9 + 16 (L3) 1 M 24 4 2 M 26 2 4 M 27 + 18 ns 1 + 18 ns + 56 ns (RAM) 8 M 28 + 38 ns 1 + 20 ns 16 M 28 + 47 ns 9 ns 32 M 28 + 52 ns 5 ns 64 M 28 + 54 ns 2 ns 128 M 36 + 55 ns 8 + 1 ns + 16 (TLB miss) 299

Trade-offs - Latency & Size II Size Latency Increase Description
256 M 40 + 56 ns 4 + 1 ns 512 M 42 + 56 ns 2 1024 M 43 + 56 ns 1 2048 M 44 + 56 ns 1 4096 M 44 + 56 ns 0 8192 M 53 + 56 ns 9 + 18 (PDPTE cache miss) Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html 300

Trade-offs - Least Squares Golub & Van Loan (2013) ”Matrix
Computations” Trade-offs: FLOPs (FLoating-point OPerations) vs. Applicability / Numerical Stability / Speed / Accuracy Example: Catalogue of dense decompositions: http://eigen.tuxfamily.org/dox/group__TopicLinearAlgebraDecompositions.html 301

Trade-offs - Multidimensional - Numerical Optimization Ben Recht, Feng Niu,
Christopher Ré, Stephen Wright. ”Lock-Free Approaches to Parallelizing Stochastic Gradient Descent” OPT 2011: 4th International Workshop on Optimization for Machine Learning http://opt.kyb.tuebingen.mpg.de/slides/opt2011-recht.pdf 302

Trade-offs - Multiple levels - Numerical Optimization Gradient computation -
accuracy vs. function evaluations f : Rd → RN • Finite differencing: • forward-difference: O( √ ϵM) error, d O(Cost(f)) evaluations • central-difference: O(ϵ2/3 M ) error, 2d O(Cost(f)) evaluations w/ the machine epsilon ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • Algorithmic differentiation (AD): precision - as in hand-coded analytical gradient • rough forward-mode cost d O(Cost(f)) • rough reverse-mode cost N O(Cost(f)) 303

Trade-offs: Costs and Benefits Gabriel, Richard P. (1985). ”Performance and
Evaluation of Lisp Systems.” Cambridge, Mass: MIT Press; Computer Systems Series. 304

Costs and Benefits: Implications • Important to know what to
focus on • Optimize the optimization: so that it doesn’t always take hours or days or weeks or months... 305

Superscalar CPU Model Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and
James E. Smith. 2009. ”A mechanistic performance model for superscalar out-of-order processors.” ACM Trans. Comput. Syst. 27, 2, Article 3.306

NVMs as Storage Class Memories - Bottlenecks: New & Old
Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, ”A Case for Efﬁcient Hardware-Software Cooperative Management of Storage and Memory” Proceedings of the 5th Workshop on Energy-Efﬁcient Design (WEED), Tel-Aviv, Israel, 2013. 307

DBs Execution Cycles: Useful Computation vs. Stall Cycles R. Panda,
C. Erb, M. LeBeane, J. H. Ryoo and L. K. John, ”Performance Characterization of Modern Databases on Out-of-Order CPUs,” Computer Architecture and High Performance Computing (SBAC-PAD), 2015 27th International Symposium on, Florianopolis, 2015, pp. 114-121. 308

System Calls - Performance Impact Livio Soares and Michael Stumm.
2010. ”FlexSC: ﬂexible system call scheduling with exception-less system calls.” In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI’10). USENIX Association, Berkeley, CA, USA, 33-46. 309

System Calls, Interrupts, and Asynchronous I/O Jisoo Yang, Dave B.
Minturn, and Frank Hady. 2012. ”When poll is better than interrupt.” In Proceedings of the 10th USENIX conference on File and Storage Technologies (FAST’12). USENIX Association, Berkeley, CA, USA. 310

System Calls as CPU Exceptions Craig B. Zilles, Joel S.
Emer, and Gurindar S. Sohi. 1999. ”The use of multithreading for exception handling.” In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture (MICRO 32). IEEE Computer Society, Washington, DC, USA, 311

Pollution & Context Switch Misses Replaced Miss (D) & Reordered
Miss (C) F. Liu, F. Guo, Y. Solihin, S. Kim and A. Eker, ”Characterizing and modeling the behavior of context switch misses”, Intl. Conf. on Parallel Architectures and Compilation Techniques, 2008. 312

Beyond Mode Switch Time: Footprint & Pollution Livio Soares and
Michael Stumm. 2010. ”FlexSC: ﬂexible system call scheduling with exception-less system calls.” In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI’10). USENIX Association, Berkeley, CA, USA, 33-46. 313

Beyond Mode Switch Time: Direct & Indirect Costs Livio Soares
and Michael Stumm. 2010. ”FlexSC: ﬂexible system call scheduling with exception-less system calls.” In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI’10). USENIX Association, Berkeley, CA, USA, 33-46. 314

Partitioning-Sharing Tradeoffs Butler W. Lampson. 1983. ”Hints for computer system
design.” In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP ’83). ACM, New York, NY, USA, 33-48. 315

Shared Resource: DRAM Heechul Yun, Renato, Zheng-Pei Wu, Rodolfo Pellizzoni.
”PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms,” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), 2014. https://github.com/heechul/palloc 316

Shared Resource: MSHRs Heechul Yun, Rodolfo Pellizzon, and Prathap Kumar
Valsan. 2015. ”Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems.” In Proceedings of the 2015 27th Euromicro Conference on Real-Time Systems (ECRTS ’15). 317

Partitioning Multithreading • Thread afﬁnity • POSIX: sched_getcpu, pthread_setaffinity_np •
http://eli.thegreenplace.net/2016/c11-threads-afﬁnity-and- hyperthreading/ • https://github.com/RRZE- HPC/likwid/blob/master/groups/skylake/FALSE_SHARE.txt • Local LLC false sharing rate = MEM_LOAD_L3_HIT_RETIRED_XSNP_HITM / MEM_INST_RETIRED_ALL • NUMA: Remote Memory Accesses (RMA), Local Memory Accesses (LMA), RMA/LMA ratio • https://01.org/numatop/ • https://github.com/01org/numatop 318

Cache Partitioning: Index-Based & Way-Based Giovani Gracioli, Ahmed Alhammad, Renato
Mancuso, Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. ”A Survey on Cache Management Mechanisms for Real-Time Embedded Systems.” ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 319

Cache Partitioning: CPU Support Giovani Gracioli, Ahmed Alhammad, Renato Mancuso,
Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. ”A Survey on Cache Management Mechanisms for Real-Time Embedded Systems.” ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 320

Cache Partitioning & Intel: CAT & CMT Cache Monitoring Technology
and Cache Allocation Technology https://github.com/01org/intel-cmt-cat A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer, “Cache QoS: From concept to reality in the Intel Xeon processor E5-2600 v3 product family,” in Intl. Symp. on High Performance Computer Architecture (HPCA), Mar. 2016. 321

Cache Partitioning != Cache Access Timing Isolation H. Yun and
P. Valsan, ”Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms”, International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 322

Cache Partitioning != Cache Access Timing Isolation https://github.com/CSL-KU/IsolBench Prathap Kumar
Valsan, Heechul Yun, Farzad Farshchi. ”Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems.” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 325

Cache Partitioning != Cache Access Timing Isolation • Shared: MSHRs
(Miss information/Status Holding Registers) / LFBs (Line Fill Buffers) • Contention => cache space partitioning != cache access timing isolation Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. ”Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems.” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 326

Cache Partitioning != Cache Access Timing Isolation • mutiple MSHRs
support multiple outstanding cache-misses • the number of MSHRs determines the MLP of the cache • local MLP - outstanding misses one core can generate • global MLP - parallelism of the entire shared memory hierarchy (i.e., shared LLC and DRAM) • ”the aggregated parallelism of the cores (the sum of local MLP) exceeds the parallelism supported by the shared LLC and DRAM (global MLP) in the out-of-order architectures” Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. ”Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems.” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 327

Shared Resource (MSHRs) & Prefetching: Xeon Phi Zhenman Fang, Sanyam
Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. ”Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking.” ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 328

Shared Resource (MSHRs) & Prefetching: SNB Zhenman Fang, Sanyam Mehta,
Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. ”Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking.” ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 329

Weighted Speedup A. Snavely and D. M. Tullsen, “Symbiotic jobscheduling
for simultaneous multithreading processor,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Nov. 2000, pp. 234– 244. S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance,” in Computer Architecture Letters, vol. 13, no. 2, 2014.330

Resources 331

Resources • http://www.agner.org/optimize/ • https://users.ece.cmu.edu/~omutlu/lecture-videos.html • https://github.com/MattPD/cpplinks/ • https://github.com/MattPD/cpplinks/blob/master/assembly.x86.md •
https://github.com/MattPD/cpplinks/blob/master/comparch.md • https://github.com/MattPD/cpplinks/blob/master/performance.tools.md 332

Slides https://speakerdeck.com/mattpd 333

Thank You! Questions? 334

Computer Architecture, C++, and High Performanc...

Computer Architecture, C++, and High Performance (Meeting C++ 2016)

More Decks by Matt P. Dziubinski

Other Decks in Programming

Featured

Transcript