Computer Architecture, C++, and High Performance (CppCon 2016)

Computer Architecture, C++, and High Performance Matt P. Dziubinski CppCon
2016 [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)

Outline • Performance • Why do we care? • What
is it? • How to • measure it - reason about it - improve it? 2

Why? 3

Costs and Curves Moore, Gordon E. (1965). "Cramming more components
onto integrated circuits". Electronics Magazine. 4

Cramming more components onto integrated circuits Moore, Gordon E. (1965).
"Cramming more components onto integrated circuits". Electronics Magazine. 5

Spending Moore’s Dividend "Spending Moore's Dividend," James Larus, Microsoft Research
Technical Report MSR-TR-2008-69, May 2008. 6

Transformation Hierarchy Yale N. Patt, Microprocessor Performance, Phase 2: Can
We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 7

Phase I & The Walls Yale N. Patt, Microprocessor Performance,
Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 8

CPU Performance Trends 1 5 9 13 18 24 51
80 117 183 280 481 649 993 1,267 1,779 3,016 4,195 6,043 6,681 7,108 11,865 14,387 19,484 21,871 24,129 1 10 100 1000 10,000 100,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Performance (vs. VAX-11/780) 25%/year 52%/year 22%/year IBM POWERstation 100, 150 MHz Digital Alphastation 4/266, 266 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164 Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor IBM Power4, 1.3 GHz Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 1.5, VAX-11/785 AMD Athlon 64, 2.8 GHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz VAX 8700, 22 MHz AX-11/780, 5 MHz Hennessy, John L.; Patterson, David A., 2011, "Computer Architecture: A Quantitative Approach," Morgan Kaufmann. 9

40 Years of Microprocessor Trend Data https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend- data/ 10

Processor-Memory Performance Gap 1 100 10 1000 Performance 10,000 100,000
1980 2010 2005 2000 1995 Year Processor Memory 1990 1985 The difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access. Hennessy, John L.; Patterson, David A., 2011, "Computer Architecture: A Quantitative Approach," Morgan Kaufmann. Computer Architecture is Back: Parallel Computing Landscape https://www.youtube.com/watch?v=On-k-E5HpcQ 11

DRAM Performance Trends D. Lee: "Reducing DRAM Latency at Low
Cost by Exploiting Heterogeneity." http://arxiv.org/abs/1604.08041 (2016) D. Lee et al., "Tiered-latency DRAM: A low latency and low cost DRAM architecture," in HPCA, 2013. 12

Emerging Memory Technologies - Further Down The Hierarchy Qureshi et
al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. 13

NVMs as Storage Class Memories - Bottlenecks: New & Old
Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, "A Case for Efﬁcient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efﬁcient Design (WEED), Tel-Aviv, Israel, 2013. 14

DBs Execution Cycles: Useful Computation vs. Stall Cycles R. Panda,
C. Erb, M. LeBeane, J. H. Ryoo and L. K. John, "Performance Characterization of Modern Databases on Out-of-Order CPUs," Computer Architecture and High Performance Computing (SBAC-PAD), 2015 27th International Symposium on, Florianopolis, 2015, pp. 114-121. 15

System Calls - Performance Impact Livio Soares and Michael Stumm.
2010. "FlexSC: ﬂexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 16

System Calls, Interrupts, and Asynchronous I/O Jisoo Yang, Dave B.
Minturn, and Frank Hady. 2012. "When poll is better than interrupt." In Proceedings of the 10th USENIX conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA. 17

System Calls as CPU Exceptions Craig B. Zilles, Joel S.
Emer, and Gurindar S. Sohi. 1999. "The use of multithreading for exception handling." In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture (MICRO 32). IEEE Computer Society, Washington, DC, USA, 219-229. 18

Pollution & Context Switch Misses Replaced Miss (D) & Reordered
Miss (C) F. Liu, F. Guo, Y. Solihin, S. Kim and A. Eker, "Characterizing and modeling the behavior of context switch misses", Intl. Conf. on Parallel Architectures and Compilation Techniques, 2008. 19

Beyond Mode Switch Time: Footprint & Pollution Livio Soares and
Michael Stumm. 2010. "FlexSC: ﬂexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 20

Beyond Mode Switch Time: Direct & Indirect Costs Livio Soares
and Michael Stumm. 2010. "FlexSC: ﬂexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 21

Feature Scaling Trends Lee, Yunsup, "Decoupled Vector-Fetch Architecture with a
Scalarizing Compiler," EECS Department, University of California, Berkeley. 2016. http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-82.html 22

Process-Architecture-Optimization Intel's Annual Report on Form 10-K for the ﬁscal
year ended December 26, 2015, ﬁled with the SEC on February 12, 2016. https://www.sec.gov/Archives/edgar/data/50863/000005086316000105/a10kdocument12262015q4.htm 23

Make it fast Butler W. Lampson. 1983. "Hints for computer
system design." In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP '83). ACM, New York, NY, USA, 33-48. 24

What? 25

Performance: The Early Days A. Greenbaum and T. Chartier. "Numerical
Methods: Design, analysis, and computer implementation of algorithms." 2010. Course Notes for Short Course on Numerical Analysis. 26

Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), "On
the computational complexity of algorithms", Transactions of the American Mathematical Society 117: 285–306. 27

Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), "On
the computational complexity of algorithms", Transactions of the American Mathematical Society 117: 285–306. 28

Complexity: Algorithms & Data Structures O(N) http://en.cppreference.com/w/cpp/algorithm/ﬁnd O(N·log(N)) http://en.cppreference.com/w/cpp/algorithm/sort http://en.cppreference.com/w/cpp/algorithm/lower_bound
log(N) http://en.cppreference.com/w/cpp/container/set/ﬁnd 29

Analysis of Algorithms - Scientific Method Robert Sedgewick and Kevin
Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 30

Analysis of Algorithms - Problem Size N vs. Running Time
T(N) Robert Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 31

Analysis of Algorithms - Tilde Notation & Tilde Approximations Robert
Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 32

Analysis of Algorithms - Doubling Ratio Experiments Robert Sedgewick and
Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 33

Find Example C++ Code I #include <algorithm> #include <chrono> #include
<cstddef> #include <cstdint> #include <cstdio> #include <iterator> #include <random> #include <set> #include <vector> #include <boost/container/flat_set.hpp> #include <EASTL/vector_set.h> // EASTL // https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned debugFlags, const char* file, int line) { return malloc(size); } 34

Find Example C++ Code II void* operator new[](size_t size, size_t
alignment, size_t alignmentOffset, const char* pName, int flags, unsigned debugFlags, const char* file, int line) { return malloc(size); } using T = std::uint32_t; std::vector<T> odd_numbers(std::size_t count) { std::vector<T> result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 35

Find Example C++ Code III template <typename container_type> void ctor_and_find(const
char * type_name, const std::vector<T> & v, std::size_t q) { printf("%s\n", type_name); std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution<T> uniform(0, 2 * n + 2); printf("ctor\t"); auto time_start = std::chrono::steady_clock::now(); const container_type s(begin(v), end(v)); auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; printf("duration: %g \n", duration.count()); printf("search\t"); time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != q; ++i) { 36

Find Example C++ Code IV const auto it = s.find(uniform(prng));
sum += (it != end(s)) ? *it : 0; } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; printf("duration: %g \t", duration.count()); printf("sum: %zu \n\n", sum); } void ctor_and_find(const char * type_name, const std::vector<T> & v_src, std::size_t q) { printf("%s\n", type_name); std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution<T> uniform(0, 2*n + 2); printf("prep\t"); auto time_start = std::chrono::steady_clock::now(); auto v = v_src; 37

Find Example C++ Code V std::sort(begin(v), end(v)); auto time_end =
std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; printf("duration: %g \n", duration.count()); printf("search\t"); time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; printf("duration: %g \t", duration.count()); printf("sum: %zu \n\n", sum); } 38

Find Example C++ Code VI int main(int argc, char *
argv[]) { // `n`: elements count (size) const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; printf("size: %zu \n", n); // `q`: queries count const std::size_t q = (argc > 2) ? std::atoll(argv[2]) : 10; printf("queries: %zu \n", q); const auto v = odd_numbers(n); printf("\n"); ctor_and_find<std::set<T>>("std::set", v, q); ctor_and_find("std::vector: copy & sort", v, q); ctor_and_find<boost::container::flat_set<T>>("boost::container::flat_set" ctor_and_find<eastl::vector_set<T>>("eastl::vector_set", v, q); } 39

Find Example - Benchmark (Nonius) Code I #include <algorithm> #include
<cstddef> #include <cstdint> #include <cstdio> #include <iterator> #include <random> #include <set> #include <vector> #include <boost/container/flat_set.hpp> #include <EASTL/vector_set.h> #include <nonius/nonius.h++> #include <nonius/main.h++> NONIUS_PARAM(size, std::size_t{100u}) NONIUS_PARAM(queries, std::size_t{10u}) 40

Find Example - Benchmark (Nonius) Code II // EASTL //
https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned de return malloc(size); } void* operator new[](size_t size, size_t alignment, size_t alignmentOffset, return malloc(size); } using T = std::uint32_t; std::vector<T> odd_numbers(std::size_t count) { std::vector<T> result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 41

Find Example - Benchmark (Nonius) Code III template <typename container_type>
T ctor_and_find(const char * type_name, const std::vector<T> & v, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution<T> uniform(0, 2 * n + 2); const container_type s(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto it = s.find(uniform(prng)); sum += (it != end(s)) ? *it : 0; } return sum; } 42

Find Example - Benchmark (Nonius) Code IV T ctor_and_find(const char
* type_name, const std::vector<T> & v_src, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution<T> uniform(0, 2*n + 2); auto v = v_src; std::sort(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } return sum; } 43

Find Example - Benchmark (Nonius) Code V NONIUS_BENCHMARK("std::set", [](nonius::chronometer meter)
{ const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<std::set<T>>("std::set", v, q); }); }); NONIUS_BENCHMARK("std::vector: copy & sort", [](nonius::chronometer meter) const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find("std::vector: copy & sort", v, q); }); }); 44

Find Example - Benchmark (Nonius) Code VI NONIUS_BENCHMARK("boost::container::flat_set", [](nonius::chronometer meter
const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<boost::container::flat_set<T>>("boost::container::flat_se }); }); NONIUS_BENCHMARK("eastl::vector_set", [](nonius::chronometer meter) { const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<eastl::vector_set<T>>("eastl::vector_set", v, q); }); }); int main(int argc, char * argv[]) { nonius::main(argc, argv); } 45

Find Example - Benchmark (Nonius) Code I Nonius: statistics-powered micro-benchmarking
framework: https://nonius.io/ https://github.com/libnonius/nonius Running: BNSIZE=10000; BNQUERIES=1000 ./find --param=size:$BNSIZE --param=queries:$BNQUERIES > results.size=$BNSIZE.queries=$BNQUERIES.txt ./find --param=size:$BNSIZE --param=queries:$BNQUERIES --reporter=html --output=results.size=$BNSIZE.queries=$BNQUERIES.html 46

Find Example - Results: size=10,000 queries=1,000 47

Find Example - Results: size=10,000,000 queries=1,000,000 48

How? 49

Asymptotic growth & "random access machines"? Tomasz Jurkiewicz and Kurt
Mehlhorn. 2015. "On a Model of Virtual Address Translation." J. Exp. Algorithmics 19. http://arxiv.org/abs/1212.0703 & https://people.mpi-inf.mpg.de/~mehlhorn/ftp/KMvat.pdf 50

Asymptotic growth & "random access machines"? Asymptotic - growing problem
size • for large data need to take into account the costs of actually bringing it in • communication complexity vs. computation complexity • including overlapping computation-communication latencies 51

"Operation"? Jack Dongarra. 2016. "With Extreme Scale Computing the Rules
Have Changed." In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 52

"Operation"? Jack Dongarra. 2016. "With Extreme Scale Computing the Rules
Have Changed." In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 53

Complexity - constants, microarchitecture? "Array Layouts for Comparison-Based Searching" Paul-Virak
Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search 54

Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). 54

Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). • It was only through careful and controlled experimentation with different implementations of each of the search algorithms that we are able to understand how the interactions between processor features such as pipelining, prefetching, speculative execution, and conditional moves affect the running times of the search algorithms." 54

Reasoning about Performance: The Scientific Method Requires - enabled by
- the knowledge of microachitectural details. Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi, Chapter 2 "Methods" from "Readings in Computer Architecture," Morgan Kaufmann, 2000. Prefetching beneﬁts evaluation: Disable/enable prefetchers using likwid-features: https://github.com/RRZE-HPC/likwid/wiki/likwid-features Example: https://gist.github.com/MattPD/06e293fb935eaf67ee9c301e70db6975 55

Microarchitecture Intel® 64 and IA-32 Architectures Optimization Reference Manual https://www-ssl.intel.com/content/www/us/en/architecture-and-
technology/64-ia-32-architectures-optimization-manual.html 56

Pervasive CPU Parallelism pipeline-level parallelism (PLP) instruction-level parallelism (ILP) memory-level
parallelism (MLP) data-level parallelism (DLP) thread-level parallelism (TLP) 57

Pipelining & Temporal Parallelism D. Sima, "Decisive aspects in the
evolution of microprocessors", Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 58

Pipelining: Base N. P. Jouppi and D. W. Wall. 1989.
"Available instruction-level parallelism for superscalar and superpipelined machines." In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 59

Pipelining: Superscalar N. P. Jouppi and D. W. Wall. 1989.
"Available instruction-level parallelism for superscalar and superpipelined machines." In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 60

The Cache Liptay, J. S. (1968) "Structural Aspects of the
System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 61

The Cache: Processor-Memory Performance Gap Liptay, J. S. (1968) "Structural
Aspects of the System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 62

The Cache: Assumptions & Effectiveness Liptay, J. S. (1968) "Structural
Aspects of the System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 63

Out-of-Order Execution R.M. Tomasulo, “An Efﬁcient Algorithm for Exploiting Multiple
Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 64

Out-of-Order Execution: Overlap R.M. Tomasulo, “An Efﬁcient Algorithm for Exploiting
Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 65

Out-of-Order Execution: Reservation Stations R.M. Tomasulo, “An Efﬁcient Algorithm for
Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 66

Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efﬁcient Algorithm
for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 67

Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efﬁcient Algorithm
for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 68

Out-of-Order Execution of Simple Micro-Operations Y.N. Patt, W.M. Hwu, and
M. Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 69

Out-of-Order Execution: Restricted Dataflow Y.N. Patt, W.M. Hwu, and M.
Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 70

Out-of-Order Execution: Results Buffer Y.N. Patt, W.M. Hwu, and M.
Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 71

Pipelining & Precise Exceptions: Reorder Buffer (ROB) J.E. Smith and
A.R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. 12th Ann. IEEE/ACM Int’l Symp. Computer Architecture, 1985, pp. 36–44. 72

Execution: Superscalar & Out-Of-Order J.E. Smith and G.S. Sohi, "The
Microarchitecture of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 73

Superscalar CPU Organization J.E. Smith and G.S. Sohi, "The Microarchitecture
of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 74

Superscalar CPU: ROB J.E. Smith and G.S. Sohi, "The Microarchitecture
of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 75

Computer Architecture: A Science of Tradeoffs "My tongue in cheek
phrase to emphasize the importance of tradeoffs to the discipline of computer architecture. Clearly, computer architecture is more art than science. Science, we like to think, involves a coherent body of knowledge, even though we have yet to ﬁgure out all the connections. Art, on the other hand, is the result of individual expressions of the various artists. Since each computer architecture is the result of the individual(s) who speciﬁed it, there is no such completely coherent structure. So, I opined if computer architecture is a science at all, it is a science of tradeoffs. In class, we keep coming up with design choices that involve tradeoffs. In my view, "tradeoffs" is at the heart of computer architecture." — Yale N. Patt 76

Design Points: Dictated the Application Space The design of a
microprocessor is about making relevant tradeoffs. We refer to the set of considerations, along with the relevant importance of each, as the “design point” for the microprocessor—that is, the characteristics that are most important to the use of the microprocessor, such that one is willing to be less concerned about other characteristics. In each case, it is usually the problem we are addressing . . . which dictates the design point for the microprocessor, and the resulting tradeoffs that must be made. Patt, Y., & Cockrell, E. (2001). "Requirements, bottlenecks, and good fortune: Agents for microprocessor evolution." Proceedings of the IEEE, 89(11), 1553-1559. 77

A Science of Tradeoffs Software Performance Optimization - Analogous! The
multiplicity of tradeoffs: • Multidimensional • Multiple levels • Costs and beneﬁts 78

Trade-offs - Latency & Bandwidth I Intel(R) Memory Latency Checker
- v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 60.4 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using traffic with the following read-write ratios ALL Reads : 24152.0 3:1 Reads-Writes : 22313.2 2:1 Reads-Writes : 22050.5 1:1 Reads-Writes : 21130.4 Stream-triad like: 21559.4 79

Trade-offs - Latency & Bandwidth II Measuring Memory Bandwidths between
nodes within system Using Read-only traffic type Memory node Socket 0 0 24155.0 Measuring Loaded Latencies for the system Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 122.27 24109.6 00002 121.99 24082.7 00008 120.60 23952.1 00015 119.28 23837.6 00050 70.87 17408.7 00100 64.59 12496.6 80

Trade-offs - Latency & Bandwidth III Inject Latency Bandwidth Delay
(ns) MB/sec ========================== 00200 61.76 8129.1 00300 60.75 6194.8 00400 60.63 5085.6 00500 60.12 4377.0 00700 60.51 3505.2 01000 60.60 2812.6 01300 60.66 2425.3 01700 60.51 2117.0 02500 60.36 1789.5 03500 60.33 1585.4 05000 60.29 1430.9 09000 60.31 1267.9 20000 60.32 1154.7 81

Trade-offs - Latency & Size I Intel i3-2120 (Sandy Bridge),
3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 20 9 + 16 (L3) 1 M 24 4 2 M 26 2 4 M 27 + 18 ns 1 + 18 ns + 56 ns (RAM) 8 M 28 + 38 ns 1 + 20 ns 16 M 28 + 47 ns 9 ns 32 M 28 + 52 ns 5 ns 64 M 28 + 54 ns 2 ns 128 M 36 + 55 ns 8 + 1 ns + 16 (TLB miss) 82

Trade-offs - Latency & Size II Size Latency Increase Description
256 M 40 + 56 ns 4 + 1 ns 512 M 42 + 56 ns 2 1024 M 43 + 56 ns 1 2048 M 44 + 56 ns 1 4096 M 44 + 56 ns 0 8192 M 53 + 56 ns 9 + 18 (PDPTE cache miss) Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html 83

Trade-offs - Least Squares Golub & Van Loan (2013) "Matrix
Computations" Trade-offs: FLOPs (FLoating-point OPerations) vs. Applicability / Numerical Stability / Speed / Accuracy Example: Catalogue of dense decompositions: http://eigen.tuxfamily.org/dox/group__TopicLinearAlgebraDecompositions.html 84

Trade-offs - Multidimensional - Numerical Optimization Ben Recht, Feng Niu,
Christopher Ré, Stephen Wright. "Lock-Free Approaches to Parallelizing Stochastic Gradient Descent" OPT 2011: 4th International Workshop on Optimization for Machine Learning http://opt.kyb.tuebingen.mpg.de/slides/opt2011-recht.pdf 85

Trade-offs - Multiple levels - Numerical Optimization Gradient computation -
accuracy vs. function evaluations f : Rd → RN • Finite differencing: • forward-difference: O( √ ϵM) error, d O(Cost(f)) evaluations • central-difference: O(ϵ2/3 M ) error, 2d O(Cost(f)) evaluations w/ the machine epsilon ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • Algorithmic differentiation (AD): precision - as in hand-coded analytical gradient • rough forward-mode cost d O(Cost(f)) • rough reverse-mode cost N O(Cost(f)) 86

Trade-offs: Costs and Benefits Gabriel, Richard P. (1985). "Performance and
Evaluation of Lisp Systems." Cambridge, Mass: MIT Press; Computer Systems Series. 87

Costs and Benefits: Implications • Important to know what to
focus on • Optimize the optimization: so that it doesn't always take hours or days or weeks or months... 88

Superscalar CPU Model Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and
James E. Smith. 2009. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2, Article 3. 89

Instruction Level Parallelism & Loop Unrolling - Code I #include
<cstddef> #include <cstdint> #include <cstdlib> #include <iostream> #include <vector> #include <boost/timer/timer.hpp> 90

Instruction Level Parallelism & Loop Unrolling - Code II using
T = double; T sum_1(const std::vector<T> & input) { T sum = 0.0; for (std::size_t i = 0, n = input.size(); i != n; ++i) sum += input[i]; return sum; } T sum_2(const std::vector<T> & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { sum1 += input[i]; sum2 += input[i + 1]; } return sum1 + sum2; } 91

Instruction Level Parallelism & Loop Unrolling - Code III int
main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 10000000; const std::size_t f = (argc > 2) ? std::atoll(argv[2]) : 1; std::cout << "n = " << n << '\n'; // iterations count std::cout << "f = " << f << '\n'; // unroll factor const std::vector<T> a(n, T(1)); boost::timer::auto_cpu_timer timer; const T sum = (f == 1) ? sum_1(a) : (f == 2) ? sum_2(a) : 0; std::cout << sum << '\n'; } 92

Instruction Level Parallelism & Loop Unrolling - Results make vector_sums
CXXFLAGS="-std=c++14 -O2 -march=native" LDLIBS=-lboost_timer $ ./vector_sums 1000000000 2 n = 1000000000 f = 2 1e+09 0.466293s wall, 0.460000s user + 0.000000s system = 0.460000s CPU (98.7%) $ ./vector_sums 1000000000 1 n = 1000000000 f = 1 1e+09 0.841269s wall, 0.840000s user + 0.010000s system = 0.850000s CPU (101.0%) 93

perf • https://perf.wiki.kernel.org/ • http://www.brendangregg.com/perf.html 94

perf Results - sum_1 Performance counter stats for './vector_sums 1000000000
1': 1675.812457 task-clock (msec) # 0.850 CPUs utilized 34 context-switches # 0.020 K/sec 5 cpu-migrations # 0.003 K/sec 8,953 page-faults # 0.005 M/sec 5,760,418,457 cycles # 3.437 GHz 3,456,046,515 stalled-cycles-frontend # 60.00% frontend cycles id 8,225,763,566 instructions # 1.43 insns per cycle # 0.42 stalled cycles per 2,050,710,005 branches # 1223.711 M/sec 104,331 branch-misses # 0.01% of all branches 1.970909249 seconds time elapsed 95

perf Results - sum_2 Performance counter stats for './vector_sums 1000000000
2': 1283.910371 task-clock (msec) # 0.835 CPUs utilized 38 context-switches # 0.030 K/sec 3 cpu-migrations # 0.002 K/sec 9,466 page-faults # 0.007 M/sec 4,458,594,733 cycles # 3.473 GHz 2,149,690,303 stalled-cycles-frontend # 48.21% frontend cycles id 6,734,925,029 instructions # 1.51 insns per cycle # 0.32 stalled cycles per 1,552,029,608 branches # 1208.830 M/sec 119,358 branch-misses # 0.01% of all branches 1.537971058 seconds time elapsed 96

GCC Explorer: sum_1 (C++) http://gcc.godbolt.org/ 97

GCC Explorer: sum_1 (x86-64 Assembly) http://gcc.godbolt.org/ 98

GCC Explorer: sum_2 (C++) http://gcc.godbolt.org/ 99

GCC Explorer: sum_2 (x86-64 Assembly) http://gcc.godbolt.org/ 100

Intel Architecture Code Analyzer (IACA) #include <iacaMarks.h> T sum_2(const std::vector<T>
& input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { IACA_START sum1 += input[i]; sum2 += input[i + 1]; } IACA_END return sum1 + sum2; } $ g++ -std=c++14 -O2 -march=native vector_sums_2i.cpp -o vector_sums_2i $ iaca -64 -arch IVB -graph ./vector_sums_2i • https://software.intel.com/en-us/articles/intel-architecture-code-analyzer • https://stackoverﬂow.com/questions/26021337/what-is-iaca-and-how-do-i- use-it • http://kylehegeman.com/blog/2013/12/28/introduction-to-iaca/ 101

IACA Results - sum_1 $ iaca -64 -arch IVB -graph
./vector_sums_1i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_1i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 3.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.0 0.0 | 1.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.0 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 1.0 1.0 | | | | | mov rdx, qword ptr [rdi] | 2 | | 1.0 | | 1.0 1.0 | | | CP | vaddsd xmm0, xmm0, qword ptr [rdx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x1 | 1 | | | | | | 1.0 | | cmp rax, rcx | 0F | | | | | | | | jnz 0xffffffffffffffe7 Total Num Of Uops: 5 102

IACA Results - sum_2 $ iaca -64 -arch IVB -graph
./vector_sums_2i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_2i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 6.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.5 0.0 | 3.0 | 1.5 1.5 | 1.5 1.5 | 0.0 | 1.5 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rcx, qword ptr [rdi] | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | CP | vaddsd xmm0, xmm0, qword ptr [rcx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x2 | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | vaddsd xmm1, xmm1, qword ptr [rcx+rdx*1] | 1 | 0.5 | | | | | 0.5 | | add rdx, 0x10 | 1 | | | | | | 1.0 | | cmp rax, rsi | 0F | | | | | | | | jnz 0xffffffffffffffde | 1 | | 1.0 | | | | | CP | vaddsd xmm0, xmm0, xmm1 Total Num Of Uops: 9 103

IACA Data Dependency Graph - sum_1 104

IACA Data Dependency Graph - sum_2 105

Work, Depth, and Parallelism Guy E. Blelloch, "Programming parallel algorithms",
Communications of the ACM, 1996. 106

ILP & Data (In)dependence G. S. Tjaden and M. J.
Flynn, ‘‘Detection and Parallel Execution of Independent Instructions,’’ IEEE Transactions on Computers, vol. C-19, pp. 889-895, October 1970. 107

ILP vs. Dependencies D. W. Wall, “Limits of instruction-level parallelism,”
Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 108

ILP, Criticality & Latency Hiding D. W. Wall, “Limits of
instruction-level parallelism,” Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 109

Empty Issue Slots: Horizontal Waste & Vertical Waste D. M.
Tullsen, S. J. Eggers and H. M. Levy, "Simultaneous multithreading: Maximizing on-chip parallelism," Proceedings, 22nd Annual International Symposium on Computer Architecture, 1995. 110

Wasted Slots: Causes D. M. Tullsen, S. J. Eggers and
H. M. Levy, "Simultaneous multithreading: Maximizing on-chip parallelism," Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, Santa Margherita Ligure, Italy, 1995, pp. 392-403. 111

Wasted Slots: Miss Events Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis,
and James E. Smith. 2006. "A performance counter architecture for computing accurate CPI components." SIGOPS Oper. Syst. Rev. 40, 5 (October 2006), 175-184. 112

likwid • https://github.com/RRZE-HPC/likwid • https://github.com/RRZE-HPC/likwid/wiki • https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr 113

likwid Results - sum_1: 489 Scalar MUOPS/s $ likwid-perfctr -C
S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 1.090122s wall, 0.880000s user + 0.000000s system = 0.880000s CPU (80.7%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 8002493499 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 4285189526 | | CPU_CLK_UNHALTED_REF | FIXC2 | 3258346806 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000155741 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 2.0456 | | Runtime unhalted [s] | 1.6536 | | Clock [MHz] | 3408.2011 | | CPI | 0.5355 | | MFLOP/s | 488.9303 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 488.9303 | +----------------------+-----------+ 114

likwid Results - sum_2: 595 Scalar MUOPS/s $ likwid-perfctr -C
S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 2 1e+09 0.620421s wall, 0.470000s user + 0.000000s system = 0.470000s CPU (75.8%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 6502566958 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2948446599 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2223894218 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000328727 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.6809 | | Runtime unhalted [s] | 1.1377 | | Clock [MHz] | 3435.8987 | | CPI | 0.4534 | | MFLOP/s | 595.1079 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 595.1079 | +----------------------+-----------+ 115

likwid Results: sum_vectorized: 676 AVX MFLOP/s g++ -std=c++14 -O2 -ftree-vectorize
-ffast-math -march=native -lboost_timer vector_sums.cpp -o vector_sums_vf $ likwid-perfctr -C S0:0 -g FLOPS_DP -f ./vector_sums_vf 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 0.561288s wall, 0.390000s user + 0.000000s system = 0.390000s CPU (69.5%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3002491149 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2709364345 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2043804906 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 91 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 260258099 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.5390 | | Runtime unhalted [s] | 1.0454 | | Clock [MHz] | 3435.5297 | | CPI | 0.9024 | | MFLOP/s | 676.4420 | | AVX MFLOP/s | 676.4420 | | Packed MUOPS/s | 169.1105 | | Scalar MUOPS/s | 0.0001 | +----------------------+-----------+ 116

Performance: CPI Steven K. Przybylski, "Cache and Memory Hierarchy Design
– A Performance-Directed Approach," San Fransisco, Morgan-Kaufmann, 1990. 117

Performance: [YMMV]PI - Power Grochowski, E., Ronen, R., Shen, J.,
& Wang, H. (2004). "Best of Both Latency and Throughput." Proceedings of the IEEE International Conference on Computer Design. 118

Performance: [YMMV]PI - Graphs Scott Beamer, Krste Asanović, and David
A. Patterson. "GAIL: The Graph Algorithm Iron Law." Workshop on Irregular Applications: Architectures and Algorithms (IAˆ3), at the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015. 119

Performance: [YMMV]PI - Packets packet_processing_times = seconds/packet = instructions/packet *
clock_cycles/instruction * seconds/clock_cycle = clock_cycles/packet * seconds/clock_cycle = CPP / core_frequency cycles per packet (CPP) http://blogs.cisco.com/sp/a-bigger-helping-of-internet-please 120

Performance: separable components of a CPI CPI = (Infinite-cache CPI)
+ finite-cache effect (FCE) Infinite-cache CPI = execute busy (EBusy) + execute idle (EIdle) FCE = (cycles per miss) × (misses per instruction) = (miss penalty) × (miss rate) P. G. Emma. "Understanding some simple processor-performance limits." IBM Journal of Research and Development, 41(3):215–232, May 1997. 121

Pipelining & Branches P. Emma and E. Davidson, "Characterization of
Branch and Data Dependencies in Programs for Evaluating Pipeline Performance," IEEE Trans. Computers C-36, No. 7, 859-875 (July 1987) 122

Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and
James E. Smith. 2009. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2, Article 3. 123

Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and
James E. Smith, "A Performance Counter Architecture for Computing Accurate CPI Components", ASPLOS 2006, pp. 175-184. 124

Branch (Mis)Prediction Example I #include <cmath> #include <cstddef> #include <cstdlib>
#include <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> double sum1(const std::vector<double> & x, const std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::cos(x[i]) : std::sin(x[i]); } return sum; } 125

Branch (Mis)Prediction Example II double sum2(const std::vector<double> & x, const
std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::sin(x[i]) : std::cos(x[i]); } return sum; } std::vector<bool> inclusion_random(std::size_t n, double p) { std::vector<bool> which; which.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution decision(p); for (std::size_t i = 0; i != n; ++i) which.push_back(decision(g)); 126

Branch (Mis)Prediction Example III return which; } int main(int argc,
char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // branch takenness / predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\n'; // takenness probability // 0.0: never; 1.0: always const double p = (argc > 3) ? std::atof(argv[3]) : 0.5; std::cout << "p = " << p << '\n'; 127

Branch (Mis)Prediction Example IV std::vector<bool> which; if (type == 0)
which.resize(n, false); else if (type == 1) which.resize(n, true); else if (type == 2) which = inclusion_random(n, p); const std::vector<double> x(n, 1.1); boost::timer::auto_cpu_timer timer; std::cout << sum1(x, which) + sum2(x, which) << '\n'; } 128

Timing: Branch (Mis)Prediction Example $ make BP CXXFLAGS="-std=c++14 -O3 -march=native"
LDLIBS=-lboost_timer-mt $ ./BP 10000000 0 n = 10000000 type = 0 1.3448e+007 1.190391s wall, 1.187500s user + 0.000000s system = 1.187500s CPU (99.8%) $ ./BP 10000000 1 n = 10000000 type = 1 1.3448e+007 1.172734s wall, 1.156250s user + 0.000000s system = 1.156250s CPU (98.6%) $ ./BP 10000000 2 n = 10000000 type = 2 1.3448e+007 1.296455s wall, 1.296875s user + 0.000000s system = 1.296875s CPU (100.0%) 129

Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH
-f ./BP 10000000 0 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 0 1.3448e+07 0.445464s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177597 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167613066 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167632206 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952380 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14796 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4586 | | Runtime unhalted [s] | 0.4505 | | Clock [MHz] | 2591.5373 | | CPI | 0.4679 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.929838e-06 | | Branch misprediction ratio | 3.967263e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 130

-f ./BP 10000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 1 1.3448e+07 0.445354s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177490 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167125701 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167146162 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952366 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14720 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4584 | | Runtime unhalted [s] | 0.4504 | | Clock [MHz] | 2591.5345 | | CPI | 0.4678 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.899380e-06 | | Branch misprediction ratio | 3.946885e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 131

-f ./BP 10000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 2 1.3448e+07 0.509917s wall, 0.510000s user + 0.000000s system = 0.510000s CPU (100.0%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3191479747 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2264945099 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2264967068 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 468135649 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 15326586 | +------------------------------+---------+------------+ +----------------------------+-----------+ | Metric | Core 1 | +----------------------------+-----------+ | Runtime (RDTSC) [s] | 0.8822 | | Runtime unhalted [s] | 0.8740 | | Clock [MHz] | 2591.5589 | | CPI | 0.7097 | | Branch rate | 0.1467 | | Branch misprediction rate | 0.0048 | | Branch misprediction ratio | 0.0327 | | Instructions per branch | 6.8174 | +----------------------------+-----------+ 132

Perf: Branch (Mis)Prediction Example $ perf stat -e branches,branch-misses -r
10 ./BP 10000000 0 Performance counter stats for './BP 10000000 0' (10 runs): 374,121,213 branches ( +- 0.02% ) 23,260 branch-misses # 0.01% of all branches ( +- 0.35% ) 0.460392835 seconds time elapsed ( +- 0.50% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 1 Performance counter stats for './BP 10000000 1' (10 runs): 374,040,282 branches ( +- 0.01% ) 23,124 branch-misses # 0.01% of all branches ( +- 0.45% ) 0.457583418 seconds time elapsed ( +- 0.04% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 2 Performance counter stats for './BP 10000000 2' (10 runs): 469,331,762 branches ( +- 0.01% ) 15,326,501 branch-misses # 3.27% of all branches ( +- 0.01% ) 0.884858777 seconds time elapsed ( +- 0.30% ) 133

Sniper The Sniper Multi-Core Simulator http://snipersim.org/ 134

Sniper: Branch (Mis)Prediction Example CPI stack: never taken 135

Sniper: Branch (Mis)Prediction Example CPI stack: always taken 136

Sniper: Branch (Mis)Prediction Example CPI stack: randomly taken 137

Sniper: Branch (Mis)Prediction Example CPI graph: never taken 138

Sniper: Branch (Mis)Prediction Example CPI graph: always taken 139

Sniper: Branch (Mis)Prediction Example CPI graph: randomly taken 140

Sniper: Branch (Mis)Prediction Example CPI graph (detailed): always taken 141

Sniper: Branch (Mis)Prediction Example CPI graph (detailed): randomly taken 142

Branch Prediction & Speculative Execution D. Sima, "Decisive aspects in
the evolution of microprocessors", Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 143

Block Enlargement Fisher, J. A. (1983). "Very Long Instruction Word
architectures and the ELI-512." Proceedings of the 10th Annual International Symposium on Computer Architecture. 144

Block Enlargement Joseph A. Fisher and John J. O'Donnell, "VLIW
Machines: Multiprocessors We Can Actually Program," CompCon 84 Proceedings, pp. 299-305, IEEE, 1984. 145

Branch Predictability • takenness rate? 146

Branch Predictability • takenness rate? • transition rate? 146

Branch Predictability • takenness rate? • transition rate? • compare:
146

• 01010101 (i % 2) 146

• 01010101 (i % 2) • 01101101 (i % 3) 146

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) 146

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) 146

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 146

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 146

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: 146

• 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: • all predictable! 146

Branch Predictability & Marker API https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#using- the-marker-api https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC g++ -Ofast
-march=native source.cpp -o application -std=c++14 -DLIKWID_PERFMON -lpthread -llikwid likwid-perfctr -f -C 0-3 -g BRANCH -m ./application #include <likwid.h> // . . . LIKWID_MARKER_START("branch"); // branch code LIKWID_MARKER_STOP("branch"); 147

Branch Entropy linear entropy: EL(p) = 2 × min(p, 1
− p) intuition: miss rate proportional to the probability of the least frequent outcome 148

Branch Takenness Probability Sander De Pestel, Stijn Eyerman and Lieven
Eeckhout, "Micro-Architecture Independent Branch Behavior Characterization", IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 149

Branch Entropy & Miss Rate: Linear Relationship Sander De Pestel,
Stijn Eyerman and Lieven Eeckhout, "Micro-Architecture Independent Branch Behavior Characterization", IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 150

Branches & Expectations: Code I #include <chrono> #include <cmath> #include
<cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <numeric> #include <random> #include <string> #include <vector> #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable((x))) 151

Branches & Expectations: Code II using T = int; void
f(T z, T & x, T & y) { ((z < 0) ? x : y) = 5; } void generate_never(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(10, 19); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 152

Branches & Expectations: Code III void generate_always(std::size_t n, std::vector<T> &
zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-19, -10); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } void generate_random(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-5, 4); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 153

Branches & Expectations: Code IV int main(int argc, char *
argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n), zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } else if (type == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); 154

Branches & Expectations: Code V const auto time_start = std::chrono::steady_clock::now();
T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(zs[i], xs[i], ys[i]); } const auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 155

Branches & Expectations: Compiling & Timing g++ -ggdb -std=c++14 -march=native
-Ofast ./branches.cpp -o branches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./branches.cpp -o branches_c time ./branches_g 1000000 0 time ./branches_g 1000000 1 time ./branches_g 1000000 2 time ./branches_c 1000000 0 time ./branches_c 1000000 1 time ./branches_c 1000000 2 156

Branches & Expectations: Timings (GCC) $ time ./branches_g 1000000 0
n = 1000000 type = 0 never duration: 0.00082991 sum(xs): 0 sum(ys): 5000000 real 0m0.034s user 0m0.033s sys 0m0.003s $ time ./branches_g 1000000 1 n = 1000000 type = 1 always duration: 0.000839488 sum(xs): 5000000 sum(ys): 0 real 0m0.031s user 0m0.030s sys 0m0.000s $ time ./branches_g 1000000 2 n = 1000000 type = 2 random duration: 0.0052968 sum(xs): 2498105 sum(ys): 2501895 real 0m0.038s user 0m0.033s sys 0m0.003s 157

Branches & Expectations: Timings (Clang) $ time ./branches_c 1000000 0
n = 1000000 type = 0 never duration: 0.00091161 sum(xs): 0 sum(ys): 5000000 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 1 n = 1000000 type = 1 always duration: 0.000765925 sum(xs): 5000000 sum(ys): 0 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 2 n = 1000000 type = 2 random duration: 0.00554585 sum(xs): 2498105 sum(ys): 2501895 real 0m0.041s user 0m0.040s sys 0m0.000s 158

So many performance events, so little time "So many performance
events, so little time," Gerd Zellweger, Denny Lin, Timothy Roscoe. Proceedings of the 7th Asia-Paciﬁc Workshop on Systems (APSys, Hong Kong, China, August 2016). 159

Hierarchical cycle accounting Andrzej Nowak, David Levinthal, Willy Zwaenepoel: "Hierarchical
cycle accounting: a new method for application performance tuning." ISPASS 2015. https://github.com/David-Levinthal/gooda 160

Top-down Microarchitecture Analysis Method (TMAM) https://github.com/andikleen/pmu-tools/wiki/toplev-manual https://sites.google.com/site/analysismethods/yasin-pubs "A Top-Down Method
for Performance Analysis and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 161

TMAM: Bottlenecks "A Top-Down Method for Performance Analysis and Counters
Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 162

TMAM: Breakdown "A Top-Down Method for Performance Analysis and Counters
Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 163

TMAM: Meaning Updates: https://download.01.org/perfmon/ "A Top-Down Method for Performance Analysis
and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 164

Branches & Expectations: TMAM, Level 1 (GCC) $ ~/builds/pmu-tools/toplev.py -l1
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 1000000 type = 2 random duration: 0.00523105 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.92 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_g 1000000 2 165

Branches & Expectations: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1 n = 1000000 type = 2 random duration: 0.00528841 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,c n = 1000000 type = 2 random duration: 0.00550316 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.94 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 47.54 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 16.41 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./branches_g 1000000 2 166

Branches & Expectations: TMAM, Level 2, perf (GCC) perf record
-g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period -o perf.data ./branches_g 1000000 2 perf report -Mintel 167

Branches & Expectations: TMAM, Level 1 (Clang) $ ~/builds/pmu-tools/toplev.py -l1
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 1000000 type = 2 random duration: 0.00555177 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.53 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_c 1000000 2 168

Branches & Expectations: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2
--long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 1000000 type = 2 random duration: 0.0055571 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 1000000 type = 2 random duration: 0.00556777 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.54 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 39.20 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 15.18 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 169

Branches & Expectations: TMAM, Level 2, perf (Clang) perf record
-g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./branches_c 1000000 2 perf report -Mintel 170

Virtual Functions & Indirect Branches: Code I #include <chrono> #include
<cmath> #include <cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <memory> #include <numeric> #include <random> #include <string> #include <vector> #define str(s) #s #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable(!!(x))) 171

Virtual Functions & Indirect Branches: Code II using T =
int; struct base { virtual T f() const { return 0; } }; struct derived_taken : base { T f() const override { return -1; } }; struct derived_untaken : base { T f() const override { return 1; } }; void f(const base & b, T & x, T & y) { ((b.f() < 0) ? x : y) = 119; } void generate_never(std::size_t n, std::vector<std::unique_ptr<base>> & zs) { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_untaken>()); return; 172

Virtual Functions & Indirect Branches: Code III } void generate_always(std::size_t
n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_taken>()); return; } void generate_random(std::size_t n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution z(0.5); for (std::size_t i = 0; i != n; ++i) { if (z(g)) zs.emplace_back(std::make_unique<derived_taken>()); else zs.emplace_back(std::make_unique<derived_untaken>()); 173

Virtual Functions & Indirect Branches: Code IV } return; }
int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n); std::vector<std::unique_ptr<base>> zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } 174

Virtual Functions & Indirect Branches: Code V else if (type
== 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); auto time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(*zs[i], xs[i], ys[i]); } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 175

Virtual Functions & Indirect Branches: Compiling & Timing g++ -ggdb
-std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_c time ./vbranches_g 10000000 0 time ./vbranches_g 10000000 1 time ./vbranches_g 10000000 2 time ./vbranches_c 10000000 0 time ./vbranches_c 10000000 1 time ./vbranches_c 10000000 2 176

Virtual Functions & Indirect Branches: Timings (GCC) $ time ./vbranches_g
10000000 0 n = 10000000 type = 0 never duration: 0.0338749 sum(xs): 0 sum(ys): 1190000000 real 0m0.645s user 0m0.573s sys 0m0.070s $ time ./vbranches_g 10000000 1 n = 10000000 type = 1 always duration: 0.0406144 sum(xs): 1190000000 sum(ys): 0 real 0m0.648s user 0m0.563s sys 0m0.083s $ time ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.131803 sum(xs): 595154105 sum(ys): 594845895 real 0m0.956s user 0m0.863s sys 0m0.090s 177

Branches & Expectations: Timings (Clang) $ time ./vbranches_c 10000000 0
n = 10000000 type = 0 never duration: 0.0314749 sum(xs): 0 sum(ys): 1190000000 real 0m0.623s user 0m0.530s sys 0m0.090s $ time ./vbranches_c 10000000 1 n = 10000000 type = 1 always duration: 0.0314727 sum(xs): 1190000000 sum(ys): 0 real 0m0.623s user 0m0.557s sys 0m0.063s $ time ./vbranches_c 10000000 2 n = 10000000 type = 2 random duration: 0.0854935 sum(xs): 595154105 sum(ys): 594845895 real 0m1.863s user 0m1.800s sys 0m0.063s 178

Virtual Functions & Indirect Branches: TMAM, Level 1 (GCC) $
~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x9c,umask=0x1/u,cycles:u}' n = 10000000 type = 2 random duration: 0.131386 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. BAD Bad_Speculation: 12.98 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_g 10000000 2 179

~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0xc5,umask=0x0/u,cp n = 10000000 type = 2 random duration: 0.131247 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,cycles:u,cpu/event=0xa3,umask=0x4,cmask=4 n = 10000000 type = 2 random duration: 0.131361 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 36.02 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.41 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u BAD Bad_Speculation: 12.92 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.75 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data . 180

~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.13145 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.44 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.69 % [100.00%] This metric represents cycles fraction the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of miss-predicted branches. For example, branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sampling events: br_misp_retired.all_branches:u BAD Bad_Speculation: 12.97 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.82 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./vbranches_g 10000000 2 181

Virtual Functions: TMAM, Level 3, perf (GCC) perf record -g
-e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_g 10000000 2 perf report -Mintel 182

Virtual Functions & Indirect Branches: TMAM, Level 1 (Clang) $
~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 10000000 type = 2 random duration: 0.0858722 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.66 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_c 10000000 2 183

Virtual Functions & Indirect Branches: TMAM, Level 2 (Clang) $
~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 10000000 type = 2 random duration: 0.0859943 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 10000000 type = 2 random duration: 0.0861661 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.61 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.64 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 184

Virtual Functions & Indirect Branches: TMAM, Level 3 (Clang) ~/builds/pmu-tools/toplev.py
-l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.65 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.63 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.MS_Switches: 8.40 % [100.00%] This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals. Sampling events: idq.ms_switches:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,edge=1,cmask=1,name=MS_Switches_IDQ_MS_SWITCHES,period=2000003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./vbranches_c 10000000 2 185

Virtual Functions: TMAM, Level 3, perf (Clang) perf record -g
-e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_c 10000000 2 perf report -Mintel 186

Branches vs. Predicated Execution https://github.com/mongodb-labs/disasm Interactive Disassembler GUI with optional
Intel Architecture Code Analyzer (IACA) integration 187

Branches vs. Predicated Execution https://github.com/mongodb-labs/disasm Interactive Disassembler GUI with optional
Intel Architecture Code Analyzer (IACA) integration 188

Compiler-Specific Built-in Functions GCC & Clang: __builtin_expect http://llvm.org/docs/BranchWeightMetadata.html#built-in- expect-instructions https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
likely & unlikely https://kernelnewbies.org/FAQ/LikelyUnlikely Clang: __builtin_unpredictable http://clang.llvm.org/docs/LanguageExtensions.html#builtin- unpredictable 189

Branch Misprediction, Speculation, and Wrong-Path Execution J. Reineke et al.,
“A Deﬁnition and Classiﬁcation of Timing Anomalies,” Proc. Int'l Workshop Worst Case Execution Time (WCET), 2006. 190

Branch Misprediction Penalty & Wrong-Path Execution Tejas S. Karkhanis and
James E. Smith. 2004. "A First-Order Superscalar Processor Model." In Proceedings of the 31st annual international symposium on Computer architecture (ISCA '04). 191

The Curse of Multiple Granularities Seshadri, V. (2016). "Simple DRAM
and Virtual Memory Abstractions to Enable Highly Efﬁcient Memory Systems." CoRR, abs/1605.06483. 192

Word Granularity != Cache Line Granularity Seshadri, V. (2016). "Simple
DRAM and Virtual Memory Abstractions to Enable Highly Efﬁcient Memory Systems." CoRR, abs/1605.06483. 193

Shortcomings of Strided Access Patterns Seshadri, V. (2016). "Simple DRAM
and Virtual Memory Abstractions to Enable Highly Efﬁcient Memory Systems." CoRR, abs/1605.06483. 194

Pointer Chasing Example - Vector http://pythontutor.com/cpp.html https://github.com/pgbovine/opt-cpp-backend 195

Pointer Chasing Example - (Singly) Linked List 196

Pointer Chasing Example - Doubly Linked List 197

Pointer Chasing Example - Linked List - C++ #include <algorithm>
#include <forward_list> #include <iterator> bool found(const std::forward_list<int> & list, int value) { return find(begin(list), end(list), value) != end(list); } int main() { std::forward_list<int> list {11, 22, 33, 44, 55}; return found(list, 42); } 198

Pointer Chasing Example - Linked List - ASM https://godbolt.org/g/rkzQ90 199

Pointer Chasing Example - Linked List - ASM (r2) http://rada.re/
200

Pointer Chasing Example - Linked List - CFG (r2) 201

Pointer Chasing Example - Linked List - CFG (r2) radiff2
-g sym.found forward_list_app forward_list_app > forward_list_found.dot xdot forward_list_found.dot dot -Tpng -o forward_list_found.png forward_list_found.dot 202

Isolated & Clustered Cache Misses Miquel Moreto, Francisco J. Cazorla,
Alex Ramirez, and Mateo Valero. 2008. "MLP-aware dynamic cache partitioning." In Proceedings of the 3rd international conference on High performance embedded architectures and compilers (HiPEAC'08). 203

Cache Miss Cost & Miss Clustering Thomas R. Puzak, A.
Hartstein, P. G. Emma, V. Srinivasan, and Jim Mitchell. 2007. "An analysis of the effects of miss clustering on the cost of a cache miss." In Proceedings of the 4th international conference on Computing frontiers (CF '07). ACM, New York, NY, USA, 3-12.204

Cache Miss Penalty: Different STC due to different MLP MLP
(memory-level parallelism) & STC (stall-time criticality) R. Das, O. Mutlu, T. Moscibroda and C. R. Das, "Application-aware prioritization mechanisms for on-chip networks," 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), New York, NY, 2009, pp. 280-291. 205

Skip Lists William Pugh. 1990. "Skip lists: a probabilistic alternative
to balanced trees." Commun. ACM 33, 6, 668-676. 206

Jump Pointers S. Chen, P. B. Gibbons, and T. C.
Mowry. “Improving Index Performance through Prefetching.” In Proc. of the 20th Annual ACM SIGMOD International Conference on Management of Data, 2001. 207

Prefetching Aggressiveness: Distance & Degree Sparsh Mittal. 2016. "A Survey
of Recent Prefetching Techniques for Processor Caches." ACM Comput. Surv. 49, 2, Article 35. 208

Prefetching Timeliness Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012.
"When Prefetching Works, When It Doesn’t, and Why." ACM Trans. Archit. Code Optim. 9, 1, Article 2. 209

Prefetches Classification Huaiyu Zhu, Yong Chen, and Xian-He Sun. 2010.
"Timing local streams: improving timeliness in data prefetching." In Proceedings of the 24th ACM International Conference on Supercomputing (ICS '10). ACM, New York, NY, USA, 169-178. 210

Prefetching I #include <algorithm> #include <chrono> #include <cinttypes> #include <cstddef>
#include <cstdio> #include <cstdlib> #include <future> #include <iterator> #include <memory> #include <random> struct point { double x, y, z; }; using T = point; 211

Prefetching II struct timing_result { double duration_initial; double duration_non_prefetched; double
duration_degree; double sum_initial; double sum_non_prefetched; double sum_degree; }; timing_result chase(std::size_t n, bool shuffle, std::size_t d, bool prefet timing_result chase_result; std::vector<std::unique_ptr<T>> v; for (std::size_t i = 0; i != n; ++i) { v.emplace_back(new point{1. * i, 2. * i, 5.* i}); } if (shuffle) { std::mt19937 g(1); 212

Prefetching III std::shuffle(begin(v), end(v), g); } double sum = 0.0;
auto time_start = std::chrono::steady_clock::now(); if (prefetch) { for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); sum += std::exp(-v[i]->y); } } else { for (std::size_t i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; chase_result.duration_initial = duration.count(); chase_result.sum_initial = sum; 213

Prefetching IV sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t
i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; chase_result.duration_non_prefetched = duration.count(); chase_result.sum_non_prefetched = sum; sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); __builtin_prefetch(v[i + 2*d].get()); sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); 214

Prefetching V duration = time_end - time_start; chase_result.duration_degree = duration.count();
chase_result.sum_degree = sum; return chase_result; } int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; const bool shuffle = (argc > 2) ? std::atoi(argv[2]) : false; const std::size_t d = (argc > 3) ? std::atoll(argv[3]) : 3; const bool prefetch = (argc > 4) ? std::atoi(argv[4]) : false; const std::size_t threads_count = (argc > 5) ? std::atoll(argv[5]) : 4; printf("size: %zu \n", n); printf("shuffle: %d \n", shuffle); printf("distance: %zu \n", d); printf("prefetch: %d \n", prefetch); 215

Prefetching VI printf("threads_count: %d \n", threads_count); const auto thread_work =
[n, shuffle, d, prefetch]() { return chase(n, shuffle, d, prefetch); }; std::vector<std::future<timing_result>> results; for (std::size_t thread = 0; thread != threads_count; ++thread) results.emplace_back(std::async(std::launch::async, thread_work)); for (auto && future_result : results) if (future_result.valid()) future_result.wait(); std::vector<double> timings_initial, timings_non_prefetched, timings_degree; for (auto && future_result : results) { timing_result chase_result = future_result.get(); timings_initial.push_back(chase_result.duration_initial); 216

Prefetching VII timings_non_prefetched.push_back(chase_result.duration_non_prefetched); timings_degree.push_back(chase_result.duration_degree); } const auto timings_initial_minmax = std::minmax_element(begin(timings_initial),
end(timings_initial)); const auto timings_non_prefetched_minmax = std::minmax_element(begin(timings_non_prefetched), end(timings_non_pref const auto timings_degree_minmax = std::minmax_element(begin(timings_degree), end(timings_degree)); printf(prefetch ? "prefetched" : "non-prefetched"); printf(" initial duration: [%g, %g] \n", *timings_initial_minmax.first, *timings_initial_minmax.second); printf("non-prefetched duration: [%g, %g] \n", *timings_non_prefetched_mi *timings_non_prefetched_minmax.second); printf("degree-two prefetching duration: [%g, %g] \n", *timings_degree_mi *timings_degree_minmax.second); } 217

Prefetch Overhead S. Van der Wiel and D. Lilja, "A
Survey of Data Prefetching Techniques," Technical Report No. HPPC 96-05, University of Minnesota, October 1996. 218

Prefetching Timings: No Prefetch $ likwid-perfctr -f -C 0-3 -g
L3 -m ./prefetch 100000 1 0 0 4 distance: 0 prefetch: 0 non-prefetched initial duration: [0.00280393, 0.00289815] non-prefetched duration: [0.00254968, 0.00257311] degree-two prefetching duration: [0.00290615, 0.00296243] Region chase_initial, Group 1: L3 | CPI STAT | 5.8641 | 1.4529 | 1.4744 | 1.4660 | | L3 bandwidth [MBytes/s] STAT | 10733.6308 | 2666.0364 | 2710.9325 | 2683.4077 | Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | L3 miss rate STAT | 0.0584 | 0.0145 | 0.0148 | 0.0146 | | L3 miss ratio STAT | 3.7723 | 0.9117 | 0.9789 | 0.9431 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 0 0 4 | Cycles without execution [%] STAT | 228.2316 | 56.8136 | 57.4443 | 57.0579 | | Cycles without execution [%] STAT | 227.0385 | 56.5980 | 57.0024 | 56.7596 | 219

Prefetching Timings: useless 0-distance prefetch (overhead) $ likwid-perfctr -f -C
0-3 -g L3 -m ./prefetch 100000 1 0 1 4 distance: 0 prefetch: 1 prefetched initial duration: [0.00288751, 0.00295978] non-prefetched duration: [0.0025575, 0.00258342] degree-two prefetching duration: [0.00285772, 0.00287839] Region chase_initial, Group 1: L3 | CPI STAT | 5.7454 | 1.4345 | 1.4387 | 1.4364 | | L3 bandwidth [MBytes/s] STAT | 10518.6383 | 2618.5405 | 2645.6096 | 2629.6596 | 220

Prefetching Timings: 1-distance prefetch (mostly overhead) $ likwid-perfctr -f -C
0-3 -g L3CACHE -m ./prefetch 100000 1 1 1 4 prefetched initial duration: [0.00250957, 0.00257662] non-prefetched duration: [0.00255286, 0.00258417] degree-two prefetching duration: [0.00230482, 0.00235828] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.9595 | 1.2343 | 1.2433 | 1.2399 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0889 | 0.4381 | 0.6454 | 0.5222 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 1 1 4 | Cycles without execution [%] STAT | 214.1614 | 53.4628 | 53.6716 | 53.5404 | | Cycles without execution [%] STAT | 200.4785 | 50.0405 | 50.1857 | 50.1196 | Formulas: L3 request rate = MEM_LOAD_UOPS_RETIRED_L3_ALL/UOPS_RETIRED_ALL L3 miss rate = MEM_LOAD_UOPS_RETIRED_L3_MISS/UOPS_RETIRED_ALL L3 miss ratio = MEM_LOAD_UOPS_RETIRED_L3_MISS/MEM_LOAD_UOPS_RETIRED_L3_ALL https://github.com/RRZE-HPC/likwid/blob/master/groups/ivybridge/L3CACHE.txt 221

Prefetching Timings: 2-distance prefetch $ likwid-perfctr -f -C 0-3 -g
L3CACHE -m ./prefetch 100000 1 2 1 4 size: 100000 shuffle: 1 distance: 2 prefetch: 1 threads_count: 4 prefetched initial duration: [0.0023392, 0.00241287] non-prefetched duration: [0.00257006, 0.00260938] degree-two prefetching duration: [0.00199431, 0.00203528] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.5557 | 1.1331 | 1.1423 | 1.1389 | | L3 request rate STAT | 0.0006 | 0.0001 | 0.0002 | 0.0002 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2317 | 0.3138 | 0.6791 | 0.5579 | Region chase_degree, Group 1: L3CACHE | CPI STAT | 3.6990 | 0.9243 | 0.9253 | 0.9248 | | L3 request rate STAT | 0.0005 | 0.0001 | 0.0002 | 0.0001 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0145 | 0.3597 | 0.6550 | 0.5036 | 222

L3CACHE -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00181161, 0.00188783] non-prefetched duration: [0.00257601, 0.0026076] degree-two prefetching duration: [0.00152468, 0.00156814] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0065 | 0.0016 | 0.0017 | 0.0016 | | CPI STAT | 3.4808 | 0.8650 | 0.8788 | 0.8702 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2431 | 0.4694 | 0.6640 | 0.5608 | Region chase_degree, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0053 | 0.0013 | 0.0014 | 0.0013 | | CPI STAT | 2.7450 | 0.6832 | 0.6882 | 0.6863 | | L3 miss rate STAT | 0.0016 | 0.0004 | 0.0004 | 0.0004 | | L3 miss ratio STAT | 3.4045 | 0.7778 | 0.9346 | 0.8511 | 223

L3 -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00180738, 0.00189831] non-prefetched duration: [0.00254486, 0.00258013] degree-two prefetching duration: [0.00154542, 0.00158065] Region chase_initial, Group 1: L3 | CPI STAT | 3.5027 | 0.8668 | 0.8835 | 0.8757 | | L3 bandwidth [MBytes/s] STAT | 17384.8731 | 4296.5905 | 4381.7164 | 4346.2183 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 2.7626 | 0.6894 | 0.6919 | 0.6906 | | L3 bandwidth [MBytes/s] STAT | 21505.6670 | 5333.6653 | 5396.4473 | 5376.4168 $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 8 1 4 | Cycles without execution [%] STAT | 187.6689 | 46.3938 | 47.3055 | 46.9172 | | Cycles without execution [%] STAT | 151.5095 | 37.6872 | 38.0656 | 37.8774 | 224

Prefetching Timings: suboptimal (untimely) prefetch $ likwid-perfctr -f -C 0-3
-g L3 -m ./prefetch 100000 1 512 1 4 size: 100000 shuffle: 1 distance: 512 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00177956, 0.00186644] non-prefetched duration: [0.00257188, 0.0026064] degree-two prefetching duration: [0.00173249, 0.00178712] Region chase_initial, Group 1: L3 | CPI STAT | 3.4343 | 0.8523 | 0.8683 | 0.8586 | | L3 data volume [GBytes] STAT | 0.0293 | 0.0073 | 0.0074 | 0.0073 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 3.1891 | 0.7903 | 0.8034 | 0.7973 | | L3 bandwidth [MBytes/s] STAT | 19902.4764 | 4954.4107 | 5013.4006 | 4975.6191 | 225

Gem5 http://www.gem5.org/ 226

Gem5 - std::vector & std::list I Filling with numbers -
std::vector vs. std::list Machine code & assembly (std::vector) Micro-ops execution breakdown (std::vector) Assembly is Too High Level: http://xlogicx.net/?p=369 227

Gem5 - std::vector & std::list II Micro-ops pipeline stages (std::vector)
228

Gem5 - std::vector & std::list III Pipeline diagram - one
iteration (std::vector) Pipeline diagram - three iterations (std::vector) 229

Gem5 - std::vector & std::list IV Machine code & assembly
(std::list) heap allocation in the loop @ 400d85 what could possibly go wrong? 230

std::list - one iteration

std::list - one iteration (continued...)

std::list - one iteration (...continued still)

std::list - one iteration (...done!)

(The GNU C library's) malloc https://sourceware.org/glibc/wiki/MallocInternals Arena A structure that
is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are "free". Threads assigned to each arena will allocate memory from that arena's free lists. Glibc Heap Analysis in Linux Systems with Radare2 https://youtube.com/watch?v=Svm5V4leEho r2con-2016 - rada.re/con/ 235

malloc & free - new, new[], delete, delete[] int main()
{ double * a = new double[8]; double * b = new double[8]; delete[] b; delete[] a; double * c = new double[8]; delete[] c; } 236

new[] & delete[] - dmhg 1/6 237

Memory Access Patterns: Temporal & Spatial Locality horizontal axis -
time vertical axis - address D. J. Hatﬁeld and J. Gerald. "Program restructuring for virtual memory." IBM Systems Journal, 10(3):168–192, 1971. 243

Loop Fusion 0.429504s (unfused) down to 0.287501s (fused) g++ -Ofast
-march=native (5.2.0) void unfused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) a[i] = b[i] * c[i]; for (size_t i = 0; i != N; ++i) d[i] = a[i] * c[i]; } void fused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } } 244

Pin - A Dynamic Binary Instrumentation Tool http://www.intel.com/software/pintool pin -t
$PIN_ROOT/source/tools/ManualExamples/obj-intel64/pinatrace.so -- ./loop_fusion . . . 0x400e43,R,0x401c48 0x400e59,R,0x401d40 0x400e65,W,0x1c789c0 0x400e65,W,0x1c789e0 . . . r-project.org rstudio.com ggplot2.org rcpp.org 245

Loop Fusion: unfused over time PC: Program Counter (instruction pointer)
246

Loop Fusion: unfused space-time PC: Program Counter (instruction pointer) MA:
Memory Address (array element pointer) 247

Loop Fusion: unfused space over time MA: Memory Address (array
element pointer) 248

Loop Fusion: fused over time PC: Program Counter (instruction pointer)
249

Loop Fusion: fused space-time PC: Program Counter (instruction pointer) MA:
Memory Address (array element pointer) 250

Loop Fusion: fused space over time MA: Memory Address (array
element pointer) 251

Takeaway: Overlapping Latencies as a General Principle Overlapping latencies also
works on a "macro" scale • load as "get the data from the Internet" • compute as "process the data" Another example: Communication Avoiding and Overlapping for Numerical Linear Algebra • https://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-65.html • http://www.cs.berkeley.edu/~egeor/sc12_slides_final.pdf 252

Non-Overlapped Timings id,symbol,count,time 1,AAPL,565449,1.59043 2,AXP,731366,3.43745 3,BA,867366,5.40218 4,CAT,830327,7.08103 5,CSCO,400440,8.49192 6,CVX,687198,9.98761 7,DD,910932,12.2254
8,DIS,910430,14.058 9,GE,871676,15.8333 10,GS,280604,17.059 11,HD,556611,18.2738 12,IBM,860071,20.3876 13,INTC,559127,21.9856 14,JNJ,724724,25.5534 15,JPM,500473,26.576 16,KO,864903,28.5405 17,MCD,717021,30.087 18,MMM,698996,31.749 19,MRK,733948,33.2642 20,MSFT,475451,34.3134 21,NKE,556344,36.4545 253

Overlapped Timings id,symbol,count,time 1,AAPL,565449,2.00713 2,AXP,731366,2.09158 3,BA,867366,2.13468 4,CAT,830327,2.19194 5,CSCO,400440,2.19197 6,CVX,687198,2.19198 7,DD,910932,2.51895
8,DIS,910430,2.51898 9,GE,871676,2.51899 10,GS,280604,2.519 11,HD,556611,2.51901 12,IBM,860071,2.51902 13,INTC,559127,2.51902 14,JNJ,724724,2.51903 15,JPM,500473,2.51904 16,KO,864903,2.51905 17,MCD,717021,2.51906 18,MMM,698996,2.51907 19,MRK,733948,2.51908 20,MSFT,475451,2.51908 21,NKE,556344,2.51909 254

Visualizing & Monitoring Performance https://github.com/Celtoys/Remotery 255

Timeline: Without Overlapping 256

Timeline: With Overlapping 257

Cache Misses, MLP, and STC: Slack R. Das et al.,
"Aérgia: Exploiting Packet Latency Slack in On-Chip Networks," Proc. 37th Ann. Int’l Symp. Computer Architecture (ISCA 10), ACM Press, 2010. 258

Dependent Cache Misses - Non-Overlapped - Serialized A Day in
the Life of a Cache Miss Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and James E. Smith, "A Performance Counter Architecture for Computing Accurate CPI Components", ASPLOS 2006, pp. 175-184. 1. load instruction enters the window (ROB) 2. the load issues from the instruction buffer (RS) 3. the load blocks the ROB head 4. ROB eventually ﬁlls 5. dispatch stops, instruction window drains 6. eventually issue and commit stop

Independent Cache Misses in ROB - Overlapped Stijn Eyerman, Lieven
Eeckhout, Tejas Karkhanis, and James E. Smith, "A Top-Down Approach to Architecting CPI Component Performance Counters", IEEE Micro, Special Issue on Top Picks from 2006 Microarchitecture Conferences, Vol 27, No 1, pp. 84-93. 260

Miss-Dependent Mispredicted Branch - Penalties Serialization S. Eyerman, J.E. Smith
and L. Eeckhout, "Characterizing the branch misprediction penalty", Performance Analysis of Systems and Software 2006 IEEE International Symposium on 2006, pp. 48-58. 261

Dependent Cache Misses - Non-Overlapped - Serialized Milad Hashemi, Khubaib,
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. "Accelerating Dependent Cache Misses with an Enhanced Memory Controller." In ISCA, 2016. 262

Independent Misses Connected by a Pending Cache Hit • MLP
- supported by non-blocking caches, out-of-order execution • multiple outstanding cache-misses - Miss Status Holding Registers (MSHRs) / Line Fill Buffers (LFBs) • MSHR ﬁle entries - merging redundant (same cache line) memory requests Xi E. Chen and Tor M. Aamodt. 2008. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). 263

Independent Misses Connected by a Pending Cache Hit Xi E.
Chen and Tor M. Aamodt. 2011. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 264

Finite MSHRs => Finite MLP Xi E. Chen and Tor
M. Aamodt. 2011. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 265

Cache Miss Penalty: Leading Edge & Trailing Edge "The End
of Scaling? Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node," Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 266

Cache Miss Penalty: Bandwidth Utilization Impact "The End of Scaling?
Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node," Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 267

Memory Capacity & Multicore Processors Memory utilization even more important
- contention for capacity & bandwidth! "Disaggregated Memory Architectures for Blade Servers," Kevin Te-Ming Lim, Ph.D. Thesis, The University of Michigan, 2010. 268

Multicore: Sequential / Parallel Execution Model L. Yavits, A. Morad,
R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 269

Multicore: Amdahl's Law, Strong Scaling "Reevaluating Amdahl's Law," John L.
Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 270

Multicore: Gustafson's Law, Weak Scaling "Reevaluating Amdahl's Law," John L.
Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 271

Amdahl's Law Optimistic Assumes perfect parallelism of the parallel portion:
Only Serial Bottlenecks, No Parallel Bottlenecks Counterpoint: https://blogs.msdn.microsoft.com/ddperf/2009/04/29/parallel-scalability-isnt-childs-play-part-2-amdahls-law-vs- gunthers-law/ 272

Multicore: Synchronization, Actual Scaling M. A. Suleman, M. K. Qureshi,
and Y. N. Patt, “Feedback-driven threading: Power-efﬁcient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 273

Multicore: Communication, Actual Scaling M. A. Suleman, M. K. Qureshi,
and Y. N. Patt, “Feedback-driven threading: Power-efﬁcient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 274

Multicore & DRAM: AoS I #include <cstddef> #include <cstdlib> #include
<future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> struct contract { double K; double T; double P; }; using element = contract; using container = std::vector<element>; 275

Multicore & DRAM: AoS II double sum_if(const container & a,
const container & b, const std::vector<std::size_t> & index) { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a[j].K == b[j].K) sum += a[j].K; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) average += f() / m; return average; } 276

Multicore & DRAM: AoS III std::vector<std::size_t> index_stream(std::size_t n) { std::vector<std::size_t>
index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); for (std::size_t i = 0; i != n; ++i) index.push_back(u(g)); return index; } 277

Multicore & DRAM: AoS IV int main(int argc, char *
argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { thread_type[thread] = (argc > 3 + thread) ? std::atoll(argv[3 + thread]) : 0; std::cout << "thread_type[" << thread << "] = " << thread_type[thread] << '\n'; } 278

Multicore & DRAM: AoS V endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t
thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } const container v1(n, {1.0, 0.5, 3.0}); const container v2(n, {1.0, 2.0, 1.0}); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 279

Multicore & DRAM: AoS VI boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);
for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 280

Multicore & DRAM: AoS Timings 1 thread, sequential access $
./DRAM_CMP 10000000 10 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.395408s wall, 0.406250s user + 0.000000s system = 0.406250s CPU (102.7%) 281

Multicore & DRAM: AoS Timings 1 thread, random access $
./DRAM_CMP 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 5.348314s wall, 5.343750s user + 0.000000s system = 5.343750s CPU (99.9%) 282

Multicore & DRAM: AoS Timings 4 threads, sequential access $
./DRAM_CMP 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.508894s wall, 2.000000s user + 0.000000s system = 2.000000s CPU (393.0%) 283

Multicore & DRAM: AoS Timings 4 threads: 3 sequential access
+ 1 random access $ ./DRAM_CMP 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 5.666049s wall, 7.265625s user + 0.000000s system = 7.265625s CPU (128.2%) 284

Multicore & DRAM: AoS Timings Memory Access Patterns & Multicore:
Interactions Matter Inter-thread Interference Sharing - Contention - Interference - Slowdown Threads using a shared resource (like on-chip/off-chip interconnects and memory) contend for it, interfering with each other's progress, resulting in slowdown (and thus negative returns to increased threads count). cf. Thomas Moscibroda and Onur Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," Microsoft Research Technical Report, MSR-TR-2007-15, February 2007. 285

Multicore & DRAM: SoA I #include <cstddef> #include <cstdlib> #include
<future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> // SoA (structure-of-arrays) struct data { std::vector<double> K; std::vector<double> T; std::vector<double> P; }; 286

Multicore & DRAM: SoA II double sum_if(const data & a,
const data & b, const std::vector<std::size_t { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a.K[j] == b.K[j]) sum += a.K[j]; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) { average += f() / m; } 287

Multicore & DRAM: SoA III return average; } std::vector<std::size_t> index_stream(std::size_t
n) { std::vector<std::size_t> index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); 288

Multicore & DRAM: SoA IV for (std::size_t i = 0;
i != n; ++i) index.push_back(u(g)); return index; } int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { 289

Multicore & DRAM: SoA V thread_type[thread] = (argc > 3
+ thread) ? std::atoll(argv[3 + thr std::cout << "thread_type[" << thread << "] = " << thread_type[thre } endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } data v1; v1.K.resize(n, 1.0); v1.T.resize(n, 0.5); v1.P.resize(n, 3.0); 290

Multicore & DRAM: SoA VI data v2; v2.K.resize(n, 1.0); v2.T.resize(n,
2.0); v2.P.resize(n, 1.0); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 291

Multicore & DRAM: SoA VII boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);
for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 292

Multicore & DRAM: SoA Timings 1 thread, sequential access $
./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.211877s wall, 0.203125s user + 0.000000s system = 0.203125s CPU (95.9%) 293

Multicore & DRAM: SoA Timings 1 thread, random access $
./DRAM_CMP.SoA 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 4.534646s wall, 4.546875s user + 0.000000s system = 4.546875s CPU (100.3%) 294

Multicore & DRAM: SoA Timings 4 threads, sequential access $
./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.256391s wall, 1.031250s user + 0.000000s system = 1.031250s CPU (402.2%) 295

Multicore & DRAM: SoA Timings 4 threads: 3 sequential access
+ 1 random access $ ./DRAM_CMP.SoA 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 4.581033s wall, 5.265625s user + 0.000000s system = 5.265625s CPU (114.9%) 296

Multicore & DRAM: SoA Timings Better Access Patterns yield Better
Single-core Performance but also Reduced Interference and thus Better Multi-core Performance 297

Multicore: Arithmetic Intensity L. Yavits, A. Morad, R. Ginosar, The
effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 298

Multicore: Synchronization & Connectivity Intensity L. Yavits, A. Morad, R.
Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 299

Speedup: Synchronization and Connectivity Bottlenecks f: parallelizable fraction f1 :
connectivity intensity f2 : synchronization intensity L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 300

Speedup: Synchronization & Connectivity Bottlenecks Speedup - affected by sequential-to-parallel
data synchronization and inter-core communication. L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 301

Partitioning-Sharing Tradeoffs Butler W. Lampson. 1983. "Hints for computer system
design." In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP '83). ACM, New York, NY, USA, 33-48. 302

Shared Resource: DRAM Heechul Yun, Renato, Zheng-Pei Wu, Rodolfo Pellizzoni.
"PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms," IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), 2014. https://github.com/heechul/palloc 303

Shared Resource: MSHRs Heechul Yun, Rodolfo Pellizzon, and Prathap Kumar
Valsan. 2015. "Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems." In Proceedings of the 2015 27th Euromicro Conference on Real-Time Systems (ECRTS '15). 304

Partitioning Multithreading • Thread afﬁnity • POSIX: sched_getcpu, pthread_setaffinity_np •
http://eli.thegreenplace.net/2016/c11-threads-afﬁnity-and- hyperthreading/ • https://github.com/RRZE- HPC/likwid/blob/master/groups/skylake/FALSE_SHARE.txt • Local LLC false sharing rate = MEM_LOAD_L3_HIT_RETIRED_XSNP_HITM / MEM_INST_RETIRED_ALL • NUMA: Remote Memory Accesses (RMA), Local Memory Accesses (LMA), RMA/LMA ratio • https://01.org/numatop/ • https://github.com/01org/numatop 305

Cache Partitioning: Index-Based & Way-Based Giovani Gracioli, Ahmed Alhammad, Renato
Mancuso, Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. "A Survey on Cache Management Mechanisms for Real-Time Embedded Systems." ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 306

Cache Partitioning: CPU Support Giovani Gracioli, Ahmed Alhammad, Renato Mancuso,
Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. "A Survey on Cache Management Mechanisms for Real-Time Embedded Systems." ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 307

Cache Partitioning & Intel: CAT & CMT Cache Monitoring Technology
and Cache Allocation Technology https://github.com/01org/intel-cmt-cat A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer, “Cache QoS: From concept to reality in the Intel Xeon processor E5-2600 v3 product family,” in Intl. Symp. on High Performance Computer Architecture (HPCA), Mar. 2016. 308

Cache Partitioning != Cache Access Timing Isolation H. Yun and
P. Valsan, "Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms", International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 309

Cache Partitioning != Cache Access Timing Isolation https://github.com/CSL-KU/IsolBench Prathap Kumar
Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 312

Cache Partitioning != Cache Access Timing Isolation • Shared: MSHRs
(Miss information/Status Holding Registers) / LFBs (Line Fill Buffers) • Contention => cache space partitioning != cache access timing isolation Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 313

Cache Partitioning != Cache Access Timing Isolation • mutiple MSHRs
support multiple outstanding cache-misses • the number of MSHRs determines the MLP of the cache • local MLP - outstanding misses one core can generate • global MLP - parallelism of the entire shared memory hierarchy (i.e., shared LLC and DRAM) • "the aggregated parallelism of the cores (the sum of local MLP) exceeds the parallelism supported by the shared LLC and DRAM (global MLP) in the out-of-order architectures" Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 314

Shared Resource (MSHRs) & Prefetching: Xeon Phi Zhenman Fang, Sanyam
Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. "Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking." ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 315

Shared Resource (MSHRs) & Prefetching: SNB Zhenman Fang, Sanyam Mehta,
Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. "Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking." ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 316

Weighted Speedup A. Snavely and D. M. Tullsen, “Symbiotic jobscheduling
for simultaneous multithreading processor,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Nov. 2000, pp. 234– 244. S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance,” in Computer Architecture Letters, vol. 13, no. 2, 2014.317

The Number of Cycles Sam Van den Steen; Stijn Eyerman;
Sander De Pestel; Moncef Mechri; Trevor E. Carlson; David Black-Schaffer; Erik Hagersten; Lieven Eeckhout, “Analytical Processor Performance and Power Modeling using Micro-Architecture Independent Characteristics,” Transactions on Computers (TC) 2016. C - #cycles, N - #instructions, Deff - effective dispatch rate, mbpred - #branch mispredictions, cres - branch resolution time, cfe - front-end pipeline depth, mILi - #instruction fetch misses at each level i in the cache hierarchy, cLi - access latency to each cache level, ROB - size of the Reorder Buffer, mLLC - #number of LLC load misses, cmem - memory access time, cbus - memory bus transfer and waiting time, MLP - amount of memory-level parallelism, PhLLC - LLC hit chain penalty 318

Roofline Model: Potential "Auto-tuning Performance on Multicore Computers," S. Williams,
PhD, 2008. 319

Roofline Model: Optimization "Auto-tuning Performance on Multicore Computers," S. Williams,
PhD, 2008. 320

Cache-aware Roofline model "Cache-aware Rooﬂine model: Upgrading the loft." Aleksandar
Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 321

Cache-aware Roofline model "Cache-aware Rooﬂine model: Upgrading the loft." Aleksandar
Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 322

Roofline Model: Microarchitectural Bottlenecks "Extending the Rooﬂine Model: Bottleneck Analysis
with Microarchitectural Constraints." Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 323

Roofline Model: Microarchitectural Bottlenecks "Extending the Rooﬂine Model: Bottleneck Analysis
with Microarchitectural Constraints." Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 324

C++ Standards: C++11 & C++14 Atomic Operations & Concurrent Memory
Model http://en.cppreference.com/w/cpp/atomic http://github.com/MattPD/cpplinks/blob/master/atomics.lockfree.memory_model.md "The C11 and C++11 Concurrency Model" by Mark John Batty: http://www.cl.cam.ac.uk/~mjb220/thesis/ Move semantics https://isocpp.org/wiki/faq/cpp11-language#rval http://thbecker.net/articles/rvalue_references/section_01.html http://kholdstare.github.io/technical/2013/11/23/moves-demystiﬁed.html scoped_allocator (stateful allocators support) https://isocpp.org/wiki/faq/cpp11-library#scoped-allocator http://en.cppreference.com/w/cpp/header/scoped_allocator https://accu.org/content/conf2012/JonathanWakely-CXX11_allocators.pdf https://accu.org/content/conf2013/Frank_Birbacher_Allocators.r210article.pdf 325

C++ Standards: C++11, C++14, and C++17 reducing the need for
conditional compilation via macros and template metaprogramming constexpr https://isocpp.org/wiki/faq/cpp11-language#cpp11-constexpr https://isocpp.org/wiki/faq/cpp14-language#extended-constexpr if constexpr http://en.cppreference.com/w/cpp/language/if#Constexpr_If 326

C++17 Standard std::string_view http://en.cppreference.com/w/cpp/string/basic_string_view interoperatbility with C APIs (e.g., sockets)
without extra allocations / copies std::aligned_alloc (C11) http://en.cppreference.com/w/cpp/memory/c/aligned_alloc aligned uninitialized storage allocation (vectorization) Hardware interference size http://eel.is/c++draft/hardware.interference http://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size portable cache line size information (e.g., padding to avoid false sharing) Extended allocators & polymorphic memory resources http://en.cppreference.com/w/cpp/memory/polymorphic_allocator http://stackoverﬂow.com/questions/38010544/polymorphic-allocator-when- and-why-should-i-use-it http://boost.org/doc/libs/release/doc/html/container/extended_functionality.html 327

C++ Core Guidelines P: Philosophy • P.9: Don't waste time
or space. Per: Performance • Per.3: Don't optimize something that's not performance critical. • Per.6: Don't make claims about performance without measurements. • Per.7: Design to enable optimization • Per.18: Space is time. • Per.19: Access memory predictably. • Per.30: Avoid context switches on the critical path https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-performance https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#S-performance 328

Takeaway: It depends! • Memory access cost: latency / bandwidth?
329

• Cache miss cost? 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict 329

• Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict • cmov & tradeoffs: converting control dependencies to data dependencies 329

Takeaways Principles Data structures & data layout - fundamental part
of design CPUs & pervasive forms parallelism • can support each other: PLP, ILP (MLP!), TLP, DLP Balanced design vs. bottlenecks Overlapping latencies Sharing-contention-interference-slowdown Yale Patt's Phase 2: Break the layers: • break through the hardware/software interface • harness all levels of the transformation hierarchy 330

Phase 2: Harnessing the Transformation Hierarchy Yale N. Patt, Microprocessor
Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 331

Break the Layers Yale N. Patt, Microprocessor Performance, Phase 2:
Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 332

Pigeonholing has to go Yale N. Patt at Yale Patt
75 Visions of the Future Computer Architecture Workshop: "Are you a software person or a hardware person?" I'm a person this pigeonholing has to go We must break the layers Abstractions are great - AFTER you understand what's being abstracted Yale N. Patt, 2013 IEEE CS Harry H. Goode Award Recipient Interview — https://youtu.be/S7wXivUy-tk Yale N. Patt at Yale Patt 75 Visions of the Future Computer Architecture Workshop — https://youtu.be/x4LH1cJCvxs 333

Resources http://www.agner.org/optimize/ https://users.ece.cmu.edu/~omutlu/lecture-videos.html https://github.com/MattPD/cpplinks/ 334

Slides https://speakerdeck.com/mattpd 335

Thank You! Questions? 336

Computer Architecture, C++, and High Performanc...

Computer Architecture, C++, and High Performance (CppCon 2016)

More Decks by Matt P. Dziubinski

Other Decks in Programming

Featured

Transcript