Slide 1

Slide 1 text

Computer Architecture, C++, and High Performance Matt P. Dziubinski CppCon 2016 [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)

Slide 2

Slide 2 text

Outline • Performance • Why do we care? • What is it? • How to • measure it - reason about it - improve it? 2

Slide 3

Slide 3 text

Why? 3

Slide 4

Slide 4 text

Costs and Curves Moore, Gordon E. (1965). "Cramming more components onto integrated circuits". Electronics Magazine. 4

Slide 5

Slide 5 text

Cramming more components onto integrated circuits Moore, Gordon E. (1965). "Cramming more components onto integrated circuits". Electronics Magazine. 5

Slide 6

Slide 6 text

Spending Moore’s Dividend "Spending Moore's Dividend," James Larus, Microsoft Research Technical Report MSR-TR-2008-69, May 2008. 6

Slide 7

Slide 7 text

Transformation Hierarchy Yale N. Patt, Microprocessor Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 7

Slide 8

Slide 8 text

Phase I & The Walls Yale N. Patt, Microprocessor Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 8

Slide 9

Slide 9 text

CPU Performance Trends 1 5 9 13 18 24 51 80 117 183 280 481 649 993 1,267 1,779 3,016 4,195 6,043 6,681 7,108 11,865 14,387 19,484 21,871 24,129 1 10 100 1000 10,000 100,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Performance (vs. VAX-11/780) 25%/year 52%/year 22%/year IBM POWERstation 100, 150 MHz Digital Alphastation 4/266, 266 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164 Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor IBM Power4, 1.3 GHz Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 1.5, VAX-11/785 AMD Athlon 64, 2.8 GHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz VAX 8700, 22 MHz AX-11/780, 5 MHz Hennessy, John L.; Patterson, David A., 2011, "Computer Architecture: A Quantitative Approach," Morgan Kaufmann. 9

Slide 10

Slide 10 text

40 Years of Microprocessor Trend Data https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend- data/ 10

Slide 11

Slide 11 text

Processor-Memory Performance Gap 1 100 10 1000 Performance 10,000 100,000 1980 2010 2005 2000 1995 Year Processor Memory 1990 1985 The difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access. Hennessy, John L.; Patterson, David A., 2011, "Computer Architecture: A Quantitative Approach," Morgan Kaufmann. Computer Architecture is Back: Parallel Computing Landscape https://www.youtube.com/watch?v=On-k-E5HpcQ 11

Slide 12

Slide 12 text

DRAM Performance Trends D. Lee: "Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity." http://arxiv.org/abs/1604.08041 (2016) D. Lee et al., "Tiered-latency DRAM: A low latency and low cost DRAM architecture," in HPCA, 2013. 12

Slide 13

Slide 13 text

Emerging Memory Technologies - Further Down The Hierarchy Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. 13

Slide 14

Slide 14 text

NVMs as Storage Class Memories - Bottlenecks: New & Old Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, 2013. 14

Slide 15

Slide 15 text

DBs Execution Cycles: Useful Computation vs. Stall Cycles R. Panda, C. Erb, M. LeBeane, J. H. Ryoo and L. K. John, "Performance Characterization of Modern Databases on Out-of-Order CPUs," Computer Architecture and High Performance Computing (SBAC-PAD), 2015 27th International Symposium on, Florianopolis, 2015, pp. 114-121. 15

Slide 16

Slide 16 text

System Calls - Performance Impact Livio Soares and Michael Stumm. 2010. "FlexSC: flexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 16

Slide 17

Slide 17 text

System Calls, Interrupts, and Asynchronous I/O Jisoo Yang, Dave B. Minturn, and Frank Hady. 2012. "When poll is better than interrupt." In Proceedings of the 10th USENIX conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA. 17

Slide 18

Slide 18 text

System Calls as CPU Exceptions Craig B. Zilles, Joel S. Emer, and Gurindar S. Sohi. 1999. "The use of multithreading for exception handling." In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture (MICRO 32). IEEE Computer Society, Washington, DC, USA, 219-229. 18

Slide 19

Slide 19 text

Pollution & Context Switch Misses Replaced Miss (D) & Reordered Miss (C) F. Liu, F. Guo, Y. Solihin, S. Kim and A. Eker, "Characterizing and modeling the behavior of context switch misses", Intl. Conf. on Parallel Architectures and Compilation Techniques, 2008. 19

Slide 20

Slide 20 text

Beyond Mode Switch Time: Footprint & Pollution Livio Soares and Michael Stumm. 2010. "FlexSC: flexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 20

Slide 21

Slide 21 text

Beyond Mode Switch Time: Direct & Indirect Costs Livio Soares and Michael Stumm. 2010. "FlexSC: flexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 21

Slide 22

Slide 22 text

Feature Scaling Trends Lee, Yunsup, "Decoupled Vector-Fetch Architecture with a Scalarizing Compiler," EECS Department, University of California, Berkeley. 2016. http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-82.html 22

Slide 23

Slide 23 text

Process-Architecture-Optimization Intel's Annual Report on Form 10-K for the fiscal year ended December 26, 2015, filed with the SEC on February 12, 2016. https://www.sec.gov/Archives/edgar/data/50863/000005086316000105/a10kdocument12262015q4.htm 23

Slide 24

Slide 24 text

Make it fast Butler W. Lampson. 1983. "Hints for computer system design." In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP '83). ACM, New York, NY, USA, 33-48. 24

Slide 25

Slide 25 text

What? 25

Slide 26

Slide 26 text

Performance: The Early Days A. Greenbaum and T. Chartier. "Numerical Methods: Design, analysis, and computer implementation of algorithms." 2010. Course Notes for Short Course on Numerical Analysis. 26

Slide 27

Slide 27 text

Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), "On the computational complexity of algorithms", Transactions of the American Mathematical Society 117: 285–306. 27

Slide 28

Slide 28 text

Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), "On the computational complexity of algorithms", Transactions of the American Mathematical Society 117: 285–306. 28

Slide 29

Slide 29 text

Complexity: Algorithms & Data Structures O(N) http://en.cppreference.com/w/cpp/algorithm/find O(N·log(N)) http://en.cppreference.com/w/cpp/algorithm/sort http://en.cppreference.com/w/cpp/algorithm/lower_bound log(N) http://en.cppreference.com/w/cpp/container/set/find 29

Slide 30

Slide 30 text

Analysis of Algorithms - Scientific Method Robert Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 30

Slide 31

Slide 31 text

Analysis of Algorithms - Problem Size N vs. Running Time T(N) Robert Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 31

Slide 32

Slide 32 text

Analysis of Algorithms - Tilde Notation & Tilde Approximations Robert Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 32

Slide 33

Slide 33 text

Analysis of Algorithms - Doubling Ratio Experiments Robert Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 33

Slide 34

Slide 34 text

Find Example C++ Code I #include #include #include #include #include #include #include #include #include #include #include // EASTL // https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned debugFlags, const char* file, int line) { return malloc(size); } 34

Slide 35

Slide 35 text

Find Example C++ Code II void* operator new[](size_t size, size_t alignment, size_t alignmentOffset, const char* pName, int flags, unsigned debugFlags, const char* file, int line) { return malloc(size); } using T = std::uint32_t; std::vector odd_numbers(std::size_t count) { std::vector result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 35

Slide 36

Slide 36 text

Find Example C++ Code III template void ctor_and_find(const char * type_name, const std::vector & v, std::size_t q) { printf("%s\n", type_name); std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution uniform(0, 2 * n + 2); printf("ctor\t"); auto time_start = std::chrono::steady_clock::now(); const container_type s(begin(v), end(v)); auto time_end = std::chrono::steady_clock::now(); std::chrono::duration duration = time_end - time_start; printf("duration: %g \n", duration.count()); printf("search\t"); time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != q; ++i) { 36

Slide 37

Slide 37 text

Find Example C++ Code IV const auto it = s.find(uniform(prng)); sum += (it != end(s)) ? *it : 0; } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; printf("duration: %g \t", duration.count()); printf("sum: %zu \n\n", sum); } void ctor_and_find(const char * type_name, const std::vector & v_src, std::size_t q) { printf("%s\n", type_name); std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution uniform(0, 2*n + 2); printf("prep\t"); auto time_start = std::chrono::steady_clock::now(); auto v = v_src; 37

Slide 38

Slide 38 text

Find Example C++ Code V std::sort(begin(v), end(v)); auto time_end = std::chrono::steady_clock::now(); std::chrono::duration duration = time_end - time_start; printf("duration: %g \n", duration.count()); printf("search\t"); time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; printf("duration: %g \t", duration.count()); printf("sum: %zu \n\n", sum); } 38

Slide 39

Slide 39 text

Find Example C++ Code VI int main(int argc, char * argv[]) { // `n`: elements count (size) const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; printf("size: %zu \n", n); // `q`: queries count const std::size_t q = (argc > 2) ? std::atoll(argv[2]) : 10; printf("queries: %zu \n", q); const auto v = odd_numbers(n); printf("\n"); ctor_and_find>("std::set", v, q); ctor_and_find("std::vector: copy & sort", v, q); ctor_and_find>("boost::container::flat_set" ctor_and_find>("eastl::vector_set", v, q); } 39

Slide 40

Slide 40 text

Find Example - Benchmark (Nonius) Code I #include #include #include #include #include #include #include #include #include #include #include #include NONIUS_PARAM(size, std::size_t{100u}) NONIUS_PARAM(queries, std::size_t{10u}) 40

Slide 41

Slide 41 text

Find Example - Benchmark (Nonius) Code II // EASTL // https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned de return malloc(size); } void* operator new[](size_t size, size_t alignment, size_t alignmentOffset, return malloc(size); } using T = std::uint32_t; std::vector odd_numbers(std::size_t count) { std::vector result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 41

Slide 42

Slide 42 text

Find Example - Benchmark (Nonius) Code III template T ctor_and_find(const char * type_name, const std::vector & v, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution uniform(0, 2 * n + 2); const container_type s(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto it = s.find(uniform(prng)); sum += (it != end(s)) ? *it : 0; } return sum; } 42

Slide 43

Slide 43 text

Find Example - Benchmark (Nonius) Code IV T ctor_and_find(const char * type_name, const std::vector & v_src, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution uniform(0, 2*n + 2); auto v = v_src; std::sort(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } return sum; } 43

Slide 44

Slide 44 text

Find Example - Benchmark (Nonius) Code V NONIUS_BENCHMARK("std::set", [](nonius::chronometer meter) { const auto n = meter.param(); const auto q = meter.param(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find>("std::set", v, q); }); }); NONIUS_BENCHMARK("std::vector: copy & sort", [](nonius::chronometer meter) const auto n = meter.param(); const auto q = meter.param(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find("std::vector: copy & sort", v, q); }); }); 44

Slide 45

Slide 45 text

Find Example - Benchmark (Nonius) Code VI NONIUS_BENCHMARK("boost::container::flat_set", [](nonius::chronometer meter const auto n = meter.param(); const auto q = meter.param(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find>("boost::container::flat_se }); }); NONIUS_BENCHMARK("eastl::vector_set", [](nonius::chronometer meter) { const auto n = meter.param(); const auto q = meter.param(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find>("eastl::vector_set", v, q); }); }); int main(int argc, char * argv[]) { nonius::main(argc, argv); } 45

Slide 46

Slide 46 text

Find Example - Benchmark (Nonius) Code I Nonius: statistics-powered micro-benchmarking framework: https://nonius.io/ https://github.com/libnonius/nonius Running: BNSIZE=10000; BNQUERIES=1000 ./find --param=size:$BNSIZE --param=queries:$BNQUERIES > results.size=$BNSIZE.queries=$BNQUERIES.txt ./find --param=size:$BNSIZE --param=queries:$BNQUERIES --reporter=html --output=results.size=$BNSIZE.queries=$BNQUERIES.html 46

Slide 47

Slide 47 text

Find Example - Results: size=10,000 queries=1,000 47

Slide 48

Slide 48 text

Find Example - Results: size=10,000,000 queries=1,000,000 48

Slide 49

Slide 49 text

How? 49

Slide 50

Slide 50 text

Asymptotic growth & "random access machines"? Tomasz Jurkiewicz and Kurt Mehlhorn. 2015. "On a Model of Virtual Address Translation." J. Exp. Algorithmics 19. http://arxiv.org/abs/1212.0703 & https://people.mpi-inf.mpg.de/~mehlhorn/ftp/KMvat.pdf 50

Slide 51

Slide 51 text

Asymptotic growth & "random access machines"? Asymptotic - growing problem size • for large data need to take into account the costs of actually bringing it in • communication complexity vs. computation complexity • including overlapping computation-communication latencies 51

Slide 52

Slide 52 text

"Operation"? Jack Dongarra. 2016. "With Extreme Scale Computing the Rules Have Changed." In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 52

Slide 53

Slide 53 text

"Operation"? Jack Dongarra. 2016. "With Extreme Scale Computing the Rules Have Changed." In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 53

Slide 54

Slide 54 text

Complexity - constants, microarchitecture? "Array Layouts for Comparison-Based Searching" Paul-Virak Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search 54

Slide 55

Slide 55 text

Complexity - constants, microarchitecture? "Array Layouts for Comparison-Based Searching" Paul-Virak Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). 54

Slide 56

Slide 56 text

Complexity - constants, microarchitecture? "Array Layouts for Comparison-Based Searching" Paul-Virak Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). • It was only through careful and controlled experimentation with different implementations of each of the search algorithms that we are able to understand how the interactions between processor features such as pipelining, prefetching, speculative execution, and conditional moves affect the running times of the search algorithms." 54

Slide 57

Slide 57 text

Reasoning about Performance: The Scientific Method Requires - enabled by - the knowledge of microachitectural details. Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi, Chapter 2 "Methods" from "Readings in Computer Architecture," Morgan Kaufmann, 2000. Prefetching benefits evaluation: Disable/enable prefetchers using likwid-features: https://github.com/RRZE-HPC/likwid/wiki/likwid-features Example: https://gist.github.com/MattPD/06e293fb935eaf67ee9c301e70db6975 55

Slide 58

Slide 58 text

Microarchitecture Intel® 64 and IA-32 Architectures Optimization Reference Manual https://www-ssl.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html 56

Slide 59

Slide 59 text

Pervasive CPU Parallelism pipeline-level parallelism (PLP) instruction-level parallelism (ILP) memory-level parallelism (MLP) data-level parallelism (DLP) thread-level parallelism (TLP) 57

Slide 60

Slide 60 text

Pipelining & Temporal Parallelism D. Sima, "Decisive aspects in the evolution of microprocessors", Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 58

Slide 61

Slide 61 text

Pipelining: Base N. P. Jouppi and D. W. Wall. 1989. "Available instruction-level parallelism for superscalar and superpipelined machines." In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 59

Slide 62

Slide 62 text

Pipelining: Superscalar N. P. Jouppi and D. W. Wall. 1989. "Available instruction-level parallelism for superscalar and superpipelined machines." In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 60

Slide 63

Slide 63 text

The Cache Liptay, J. S. (1968) "Structural Aspects of the System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 61

Slide 64

Slide 64 text

The Cache: Processor-Memory Performance Gap Liptay, J. S. (1968) "Structural Aspects of the System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 62

Slide 65

Slide 65 text

The Cache: Assumptions & Effectiveness Liptay, J. S. (1968) "Structural Aspects of the System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 63

Slide 66

Slide 66 text

Out-of-Order Execution R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 64

Slide 67

Slide 67 text

Out-of-Order Execution: Overlap R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 65

Slide 68

Slide 68 text

Out-of-Order Execution: Reservation Stations R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 66

Slide 69

Slide 69 text

Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 67

Slide 70

Slide 70 text

Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 68

Slide 71

Slide 71 text

Out-of-Order Execution of Simple Micro-Operations Y.N. Patt, W.M. Hwu, and M. Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 69

Slide 72

Slide 72 text

Out-of-Order Execution: Restricted Dataflow Y.N. Patt, W.M. Hwu, and M. Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 70

Slide 73

Slide 73 text

Out-of-Order Execution: Results Buffer Y.N. Patt, W.M. Hwu, and M. Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 71

Slide 74

Slide 74 text

Pipelining & Precise Exceptions: Reorder Buffer (ROB) J.E. Smith and A.R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. 12th Ann. IEEE/ACM Int’l Symp. Computer Architecture, 1985, pp. 36–44. 72

Slide 75

Slide 75 text

Execution: Superscalar & Out-Of-Order J.E. Smith and G.S. Sohi, "The Microarchitecture of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 73

Slide 76

Slide 76 text

Superscalar CPU Organization J.E. Smith and G.S. Sohi, "The Microarchitecture of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 74

Slide 77

Slide 77 text

Superscalar CPU: ROB J.E. Smith and G.S. Sohi, "The Microarchitecture of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 75

Slide 78

Slide 78 text

Computer Architecture: A Science of Tradeoffs "My tongue in cheek phrase to emphasize the importance of tradeoffs to the discipline of computer architecture. Clearly, computer architecture is more art than science. Science, we like to think, involves a coherent body of knowledge, even though we have yet to figure out all the connections. Art, on the other hand, is the result of individual expressions of the various artists. Since each computer architecture is the result of the individual(s) who specified it, there is no such completely coherent structure. So, I opined if computer architecture is a science at all, it is a science of tradeoffs. In class, we keep coming up with design choices that involve tradeoffs. In my view, "tradeoffs" is at the heart of computer architecture." — Yale N. Patt 76

Slide 79

Slide 79 text

Design Points: Dictated the Application Space The design of a microprocessor is about making relevant tradeoffs. We refer to the set of considerations, along with the relevant importance of each, as the “design point” for the microprocessor—that is, the characteristics that are most important to the use of the microprocessor, such that one is willing to be less concerned about other characteristics. In each case, it is usually the problem we are addressing . . . which dictates the design point for the microprocessor, and the resulting tradeoffs that must be made. Patt, Y., & Cockrell, E. (2001). "Requirements, bottlenecks, and good fortune: Agents for microprocessor evolution." Proceedings of the IEEE, 89(11), 1553-1559. 77

Slide 80

Slide 80 text

A Science of Tradeoffs Software Performance Optimization - Analogous! The multiplicity of tradeoffs: • Multidimensional • Multiple levels • Costs and benefits 78

Slide 81

Slide 81 text

Trade-offs - Latency & Bandwidth I Intel(R) Memory Latency Checker - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 60.4 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using traffic with the following read-write ratios ALL Reads : 24152.0 3:1 Reads-Writes : 22313.2 2:1 Reads-Writes : 22050.5 1:1 Reads-Writes : 21130.4 Stream-triad like: 21559.4 79

Slide 82

Slide 82 text

Trade-offs - Latency & Bandwidth II Measuring Memory Bandwidths between nodes within system Using Read-only traffic type Memory node Socket 0 0 24155.0 Measuring Loaded Latencies for the system Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 122.27 24109.6 00002 121.99 24082.7 00008 120.60 23952.1 00015 119.28 23837.6 00050 70.87 17408.7 00100 64.59 12496.6 80

Slide 83

Slide 83 text

Trade-offs - Latency & Bandwidth III Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00200 61.76 8129.1 00300 60.75 6194.8 00400 60.63 5085.6 00500 60.12 4377.0 00700 60.51 3505.2 01000 60.60 2812.6 01300 60.66 2425.3 01700 60.51 2117.0 02500 60.36 1789.5 03500 60.33 1585.4 05000 60.29 1430.9 09000 60.31 1267.9 20000 60.32 1154.7 81

Slide 84

Slide 84 text

Trade-offs - Latency & Size I Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 20 9 + 16 (L3) 1 M 24 4 2 M 26 2 4 M 27 + 18 ns 1 + 18 ns + 56 ns (RAM) 8 M 28 + 38 ns 1 + 20 ns 16 M 28 + 47 ns 9 ns 32 M 28 + 52 ns 5 ns 64 M 28 + 54 ns 2 ns 128 M 36 + 55 ns 8 + 1 ns + 16 (TLB miss) 82

Slide 85

Slide 85 text

Trade-offs - Latency & Size II Size Latency Increase Description 256 M 40 + 56 ns 4 + 1 ns 512 M 42 + 56 ns 2 1024 M 43 + 56 ns 1 2048 M 44 + 56 ns 1 4096 M 44 + 56 ns 0 8192 M 53 + 56 ns 9 + 18 (PDPTE cache miss) Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html 83

Slide 86

Slide 86 text

Trade-offs - Least Squares Golub & Van Loan (2013) "Matrix Computations" Trade-offs: FLOPs (FLoating-point OPerations) vs. Applicability / Numerical Stability / Speed / Accuracy Example: Catalogue of dense decompositions: http://eigen.tuxfamily.org/dox/group__TopicLinearAlgebraDecompositions.html 84

Slide 87

Slide 87 text

Trade-offs - Multidimensional - Numerical Optimization Ben Recht, Feng Niu, Christopher Ré, Stephen Wright. "Lock-Free Approaches to Parallelizing Stochastic Gradient Descent" OPT 2011: 4th International Workshop on Optimization for Machine Learning http://opt.kyb.tuebingen.mpg.de/slides/opt2011-recht.pdf 85

Slide 88

Slide 88 text

Trade-offs - Multiple levels - Numerical Optimization Gradient computation - accuracy vs. function evaluations f : Rd → RN • Finite differencing: • forward-difference: O( √ ϵM) error, d O(Cost(f)) evaluations • central-difference: O(ϵ2/3 M ) error, 2d O(Cost(f)) evaluations w/ the machine epsilon ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • Algorithmic differentiation (AD): precision - as in hand-coded analytical gradient • rough forward-mode cost d O(Cost(f)) • rough reverse-mode cost N O(Cost(f)) 86

Slide 89

Slide 89 text

Trade-offs: Costs and Benefits Gabriel, Richard P. (1985). "Performance and Evaluation of Lisp Systems." Cambridge, Mass: MIT Press; Computer Systems Series. 87

Slide 90

Slide 90 text

Costs and Benefits: Implications • Important to know what to focus on • Optimize the optimization: so that it doesn't always take hours or days or weeks or months... 88

Slide 91

Slide 91 text

Superscalar CPU Model Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2, Article 3. 89

Slide 92

Slide 92 text

Instruction Level Parallelism & Loop Unrolling - Code I #include #include #include #include #include #include 90

Slide 93

Slide 93 text

Instruction Level Parallelism & Loop Unrolling - Code II using T = double; T sum_1(const std::vector & input) { T sum = 0.0; for (std::size_t i = 0, n = input.size(); i != n; ++i) sum += input[i]; return sum; } T sum_2(const std::vector & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { sum1 += input[i]; sum2 += input[i + 1]; } return sum1 + sum2; } 91

Slide 94

Slide 94 text

Instruction Level Parallelism & Loop Unrolling - Code III int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 10000000; const std::size_t f = (argc > 2) ? std::atoll(argv[2]) : 1; std::cout << "n = " << n << '\n'; // iterations count std::cout << "f = " << f << '\n'; // unroll factor const std::vector a(n, T(1)); boost::timer::auto_cpu_timer timer; const T sum = (f == 1) ? sum_1(a) : (f == 2) ? sum_2(a) : 0; std::cout << sum << '\n'; } 92

Slide 95

Slide 95 text

Instruction Level Parallelism & Loop Unrolling - Results make vector_sums CXXFLAGS="-std=c++14 -O2 -march=native" LDLIBS=-lboost_timer $ ./vector_sums 1000000000 2 n = 1000000000 f = 2 1e+09 0.466293s wall, 0.460000s user + 0.000000s system = 0.460000s CPU (98.7%) $ ./vector_sums 1000000000 1 n = 1000000000 f = 1 1e+09 0.841269s wall, 0.840000s user + 0.010000s system = 0.850000s CPU (101.0%) 93

Slide 96

Slide 96 text

perf • https://perf.wiki.kernel.org/ • http://www.brendangregg.com/perf.html 94

Slide 97

Slide 97 text

perf Results - sum_1 Performance counter stats for './vector_sums 1000000000 1': 1675.812457 task-clock (msec) # 0.850 CPUs utilized 34 context-switches # 0.020 K/sec 5 cpu-migrations # 0.003 K/sec 8,953 page-faults # 0.005 M/sec 5,760,418,457 cycles # 3.437 GHz 3,456,046,515 stalled-cycles-frontend # 60.00% frontend cycles id 8,225,763,566 instructions # 1.43 insns per cycle # 0.42 stalled cycles per 2,050,710,005 branches # 1223.711 M/sec 104,331 branch-misses # 0.01% of all branches 1.970909249 seconds time elapsed 95

Slide 98

Slide 98 text

perf Results - sum_2 Performance counter stats for './vector_sums 1000000000 2': 1283.910371 task-clock (msec) # 0.835 CPUs utilized 38 context-switches # 0.030 K/sec 3 cpu-migrations # 0.002 K/sec 9,466 page-faults # 0.007 M/sec 4,458,594,733 cycles # 3.473 GHz 2,149,690,303 stalled-cycles-frontend # 48.21% frontend cycles id 6,734,925,029 instructions # 1.51 insns per cycle # 0.32 stalled cycles per 1,552,029,608 branches # 1208.830 M/sec 119,358 branch-misses # 0.01% of all branches 1.537971058 seconds time elapsed 96

Slide 99

Slide 99 text

GCC Explorer: sum_1 (C++) http://gcc.godbolt.org/ 97

Slide 100

Slide 100 text

GCC Explorer: sum_1 (x86-64 Assembly) http://gcc.godbolt.org/ 98

Slide 101

Slide 101 text

GCC Explorer: sum_2 (C++) http://gcc.godbolt.org/ 99

Slide 102

Slide 102 text

GCC Explorer: sum_2 (x86-64 Assembly) http://gcc.godbolt.org/ 100

Slide 103

Slide 103 text

Intel Architecture Code Analyzer (IACA) #include T sum_2(const std::vector & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { IACA_START sum1 += input[i]; sum2 += input[i + 1]; } IACA_END return sum1 + sum2; } $ g++ -std=c++14 -O2 -march=native vector_sums_2i.cpp -o vector_sums_2i $ iaca -64 -arch IVB -graph ./vector_sums_2i • https://software.intel.com/en-us/articles/intel-architecture-code-analyzer • https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i- use-it • http://kylehegeman.com/blog/2013/12/28/introduction-to-iaca/ 101

Slide 104

Slide 104 text

IACA Results - sum_1 $ iaca -64 -arch IVB -graph ./vector_sums_1i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_1i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 3.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.0 0.0 | 1.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.0 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 1.0 1.0 | | | | | mov rdx, qword ptr [rdi] | 2 | | 1.0 | | 1.0 1.0 | | | CP | vaddsd xmm0, xmm0, qword ptr [rdx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x1 | 1 | | | | | | 1.0 | | cmp rax, rcx | 0F | | | | | | | | jnz 0xffffffffffffffe7 Total Num Of Uops: 5 102

Slide 105

Slide 105 text

IACA Results - sum_2 $ iaca -64 -arch IVB -graph ./vector_sums_2i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_2i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 6.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.5 0.0 | 3.0 | 1.5 1.5 | 1.5 1.5 | 0.0 | 1.5 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rcx, qword ptr [rdi] | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | CP | vaddsd xmm0, xmm0, qword ptr [rcx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x2 | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | vaddsd xmm1, xmm1, qword ptr [rcx+rdx*1] | 1 | 0.5 | | | | | 0.5 | | add rdx, 0x10 | 1 | | | | | | 1.0 | | cmp rax, rsi | 0F | | | | | | | | jnz 0xffffffffffffffde | 1 | | 1.0 | | | | | CP | vaddsd xmm0, xmm0, xmm1 Total Num Of Uops: 9 103

Slide 106

Slide 106 text

IACA Data Dependency Graph - sum_1 104

Slide 107

Slide 107 text

IACA Data Dependency Graph - sum_2 105

Slide 108

Slide 108 text

Work, Depth, and Parallelism Guy E. Blelloch, "Programming parallel algorithms", Communications of the ACM, 1996. 106

Slide 109

Slide 109 text

ILP & Data (In)dependence G. S. Tjaden and M. J. Flynn, ‘‘Detection and Parallel Execution of Independent Instructions,’’ IEEE Transactions on Computers, vol. C-19, pp. 889-895, October 1970. 107

Slide 110

Slide 110 text

ILP vs. Dependencies D. W. Wall, “Limits of instruction-level parallelism,” Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 108

Slide 111

Slide 111 text

ILP, Criticality & Latency Hiding D. W. Wall, “Limits of instruction-level parallelism,” Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 109

Slide 112

Slide 112 text

Empty Issue Slots: Horizontal Waste & Vertical Waste D. M. Tullsen, S. J. Eggers and H. M. Levy, "Simultaneous multithreading: Maximizing on-chip parallelism," Proceedings, 22nd Annual International Symposium on Computer Architecture, 1995. 110

Slide 113

Slide 113 text

Wasted Slots: Causes D. M. Tullsen, S. J. Eggers and H. M. Levy, "Simultaneous multithreading: Maximizing on-chip parallelism," Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, Santa Margherita Ligure, Italy, 1995, pp. 392-403. 111

Slide 114

Slide 114 text

Wasted Slots: Miss Events Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2006. "A performance counter architecture for computing accurate CPI components." SIGOPS Oper. Syst. Rev. 40, 5 (October 2006), 175-184. 112

Slide 115

Slide 115 text

likwid • https://github.com/RRZE-HPC/likwid • https://github.com/RRZE-HPC/likwid/wiki • https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr 113

Slide 116

Slide 116 text

likwid Results - sum_1: 489 Scalar MUOPS/s $ likwid-perfctr -C S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 1.090122s wall, 0.880000s user + 0.000000s system = 0.880000s CPU (80.7%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 8002493499 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 4285189526 | | CPU_CLK_UNHALTED_REF | FIXC2 | 3258346806 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000155741 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 2.0456 | | Runtime unhalted [s] | 1.6536 | | Clock [MHz] | 3408.2011 | | CPI | 0.5355 | | MFLOP/s | 488.9303 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 488.9303 | +----------------------+-----------+ 114

Slide 117

Slide 117 text

likwid Results - sum_2: 595 Scalar MUOPS/s $ likwid-perfctr -C S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 2 1e+09 0.620421s wall, 0.470000s user + 0.000000s system = 0.470000s CPU (75.8%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 6502566958 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2948446599 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2223894218 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000328727 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.6809 | | Runtime unhalted [s] | 1.1377 | | Clock [MHz] | 3435.8987 | | CPI | 0.4534 | | MFLOP/s | 595.1079 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 595.1079 | +----------------------+-----------+ 115

Slide 118

Slide 118 text

likwid Results: sum_vectorized: 676 AVX MFLOP/s g++ -std=c++14 -O2 -ftree-vectorize -ffast-math -march=native -lboost_timer vector_sums.cpp -o vector_sums_vf $ likwid-perfctr -C S0:0 -g FLOPS_DP -f ./vector_sums_vf 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 0.561288s wall, 0.390000s user + 0.000000s system = 0.390000s CPU (69.5%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3002491149 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2709364345 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2043804906 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 91 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 260258099 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.5390 | | Runtime unhalted [s] | 1.0454 | | Clock [MHz] | 3435.5297 | | CPI | 0.9024 | | MFLOP/s | 676.4420 | | AVX MFLOP/s | 676.4420 | | Packed MUOPS/s | 169.1105 | | Scalar MUOPS/s | 0.0001 | +----------------------+-----------+ 116

Slide 119

Slide 119 text

Performance: CPI Steven K. Przybylski, "Cache and Memory Hierarchy Design – A Performance-Directed Approach," San Fransisco, Morgan-Kaufmann, 1990. 117

Slide 120

Slide 120 text

Performance: [YMMV]PI - Power Grochowski, E., Ronen, R., Shen, J., & Wang, H. (2004). "Best of Both Latency and Throughput." Proceedings of the IEEE International Conference on Computer Design. 118

Slide 121

Slide 121 text

Performance: [YMMV]PI - Graphs Scott Beamer, Krste Asanović, and David A. Patterson. "GAIL: The Graph Algorithm Iron Law." Workshop on Irregular Applications: Architectures and Algorithms (IAˆ3), at the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015. 119

Slide 122

Slide 122 text

Performance: [YMMV]PI - Packets packet_processing_times = seconds/packet = instructions/packet * clock_cycles/instruction * seconds/clock_cycle = clock_cycles/packet * seconds/clock_cycle = CPP / core_frequency cycles per packet (CPP) http://blogs.cisco.com/sp/a-bigger-helping-of-internet-please 120

Slide 123

Slide 123 text

Performance: separable components of a CPI CPI = (Infinite-cache CPI) + finite-cache effect (FCE) Infinite-cache CPI = execute busy (EBusy) + execute idle (EIdle) FCE = (cycles per miss) × (misses per instruction) = (miss penalty) × (miss rate) P. G. Emma. "Understanding some simple processor-performance limits." IBM Journal of Research and Development, 41(3):215–232, May 1997. 121

Slide 124

Slide 124 text

Pipelining & Branches P. Emma and E. Davidson, "Characterization of Branch and Data Dependencies in Programs for Evaluating Pipeline Performance," IEEE Trans. Computers C-36, No. 7, 859-875 (July 1987) 122

Slide 125

Slide 125 text

Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2, Article 3. 123

Slide 126

Slide 126 text

Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and James E. Smith, "A Performance Counter Architecture for Computing Accurate CPI Components", ASPLOS 2006, pp. 175-184. 124

Slide 127

Slide 127 text

Branch (Mis)Prediction Example I #include #include #include #include #include #include #include #include double sum1(const std::vector & x, const std::vector & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::cos(x[i]) : std::sin(x[i]); } return sum; } 125

Slide 128

Slide 128 text

Branch (Mis)Prediction Example II double sum2(const std::vector & x, const std::vector & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::sin(x[i]) : std::cos(x[i]); } return sum; } std::vector inclusion_random(std::size_t n, double p) { std::vector which; which.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution decision(p); for (std::size_t i = 0; i != n; ++i) which.push_back(decision(g)); 126

Slide 129

Slide 129 text

Branch (Mis)Prediction Example III return which; } int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // branch takenness / predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\n'; // takenness probability // 0.0: never; 1.0: always const double p = (argc > 3) ? std::atof(argv[3]) : 0.5; std::cout << "p = " << p << '\n'; 127

Slide 130

Slide 130 text

Branch (Mis)Prediction Example IV std::vector which; if (type == 0) which.resize(n, false); else if (type == 1) which.resize(n, true); else if (type == 2) which = inclusion_random(n, p); const std::vector x(n, 1.1); boost::timer::auto_cpu_timer timer; std::cout << sum1(x, which) + sum2(x, which) << '\n'; } 128

Slide 131

Slide 131 text

Timing: Branch (Mis)Prediction Example $ make BP CXXFLAGS="-std=c++14 -O3 -march=native" LDLIBS=-lboost_timer-mt $ ./BP 10000000 0 n = 10000000 type = 0 1.3448e+007 1.190391s wall, 1.187500s user + 0.000000s system = 1.187500s CPU (99.8%) $ ./BP 10000000 1 n = 10000000 type = 1 1.3448e+007 1.172734s wall, 1.156250s user + 0.000000s system = 1.156250s CPU (98.6%) $ ./BP 10000000 2 n = 10000000 type = 2 1.3448e+007 1.296455s wall, 1.296875s user + 0.000000s system = 1.296875s CPU (100.0%) 129

Slide 132

Slide 132 text

Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH -f ./BP 10000000 0 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 0 1.3448e+07 0.445464s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177597 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167613066 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167632206 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952380 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14796 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4586 | | Runtime unhalted [s] | 0.4505 | | Clock [MHz] | 2591.5373 | | CPI | 0.4679 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.929838e-06 | | Branch misprediction ratio | 3.967263e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 130

Slide 133

Slide 133 text

Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH -f ./BP 10000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 1 1.3448e+07 0.445354s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177490 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167125701 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167146162 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952366 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14720 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4584 | | Runtime unhalted [s] | 0.4504 | | Clock [MHz] | 2591.5345 | | CPI | 0.4678 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.899380e-06 | | Branch misprediction ratio | 3.946885e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 131

Slide 134

Slide 134 text

Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH -f ./BP 10000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 2 1.3448e+07 0.509917s wall, 0.510000s user + 0.000000s system = 0.510000s CPU (100.0%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3191479747 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2264945099 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2264967068 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 468135649 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 15326586 | +------------------------------+---------+------------+ +----------------------------+-----------+ | Metric | Core 1 | +----------------------------+-----------+ | Runtime (RDTSC) [s] | 0.8822 | | Runtime unhalted [s] | 0.8740 | | Clock [MHz] | 2591.5589 | | CPI | 0.7097 | | Branch rate | 0.1467 | | Branch misprediction rate | 0.0048 | | Branch misprediction ratio | 0.0327 | | Instructions per branch | 6.8174 | +----------------------------+-----------+ 132

Slide 135

Slide 135 text

Perf: Branch (Mis)Prediction Example $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 0 Performance counter stats for './BP 10000000 0' (10 runs): 374,121,213 branches ( +- 0.02% ) 23,260 branch-misses # 0.01% of all branches ( +- 0.35% ) 0.460392835 seconds time elapsed ( +- 0.50% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 1 Performance counter stats for './BP 10000000 1' (10 runs): 374,040,282 branches ( +- 0.01% ) 23,124 branch-misses # 0.01% of all branches ( +- 0.45% ) 0.457583418 seconds time elapsed ( +- 0.04% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 2 Performance counter stats for './BP 10000000 2' (10 runs): 469,331,762 branches ( +- 0.01% ) 15,326,501 branch-misses # 3.27% of all branches ( +- 0.01% ) 0.884858777 seconds time elapsed ( +- 0.30% ) 133

Slide 136

Slide 136 text

Sniper The Sniper Multi-Core Simulator http://snipersim.org/ 134

Slide 137

Slide 137 text

Sniper: Branch (Mis)Prediction Example CPI stack: never taken 135

Slide 138

Slide 138 text

Sniper: Branch (Mis)Prediction Example CPI stack: always taken 136

Slide 139

Slide 139 text

Sniper: Branch (Mis)Prediction Example CPI stack: randomly taken 137

Slide 140

Slide 140 text

Sniper: Branch (Mis)Prediction Example CPI graph: never taken 138

Slide 141

Slide 141 text

Sniper: Branch (Mis)Prediction Example CPI graph: always taken 139

Slide 142

Slide 142 text

Sniper: Branch (Mis)Prediction Example CPI graph: randomly taken 140

Slide 143

Slide 143 text

Sniper: Branch (Mis)Prediction Example CPI graph (detailed): always taken 141

Slide 144

Slide 144 text

Sniper: Branch (Mis)Prediction Example CPI graph (detailed): randomly taken 142

Slide 145

Slide 145 text

Branch Prediction & Speculative Execution D. Sima, "Decisive aspects in the evolution of microprocessors", Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 143

Slide 146

Slide 146 text

Block Enlargement Fisher, J. A. (1983). "Very Long Instruction Word architectures and the ELI-512." Proceedings of the 10th Annual International Symposium on Computer Architecture. 144

Slide 147

Slide 147 text

Block Enlargement Joseph A. Fisher and John J. O'Donnell, "VLIW Machines: Multiprocessors We Can Actually Program," CompCon 84 Proceedings, pp. 299-305, IEEE, 1984. 145

Slide 148

Slide 148 text

Branch Predictability • takenness rate? 146

Slide 149

Slide 149 text

Branch Predictability • takenness rate? • transition rate? 146

Slide 150

Slide 150 text

Branch Predictability • takenness rate? • transition rate? • compare: 146

Slide 151

Slide 151 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) 146

Slide 152

Slide 152 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) • 01101101 (i % 3) 146

Slide 153

Slide 153 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) 146

Slide 154

Slide 154 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) 146

Slide 155

Slide 155 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 146

Slide 156

Slide 156 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 146

Slide 157

Slide 157 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: 146

Slide 158

Slide 158 text

Branch Predictability • takenness rate? • transition rate? • compare: • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: • all predictable! 146

Slide 159

Slide 159 text

Branch Predictability & Marker API https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#using- the-marker-api https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC g++ -Ofast -march=native source.cpp -o application -std=c++14 -DLIKWID_PERFMON -lpthread -llikwid likwid-perfctr -f -C 0-3 -g BRANCH -m ./application #include // . . . LIKWID_MARKER_START("branch"); // branch code LIKWID_MARKER_STOP("branch"); 147

Slide 160

Slide 160 text

Branch Entropy linear entropy: EL(p) = 2 × min(p, 1 − p) intuition: miss rate proportional to the probability of the least frequent outcome 148

Slide 161

Slide 161 text

Branch Takenness Probability Sander De Pestel, Stijn Eyerman and Lieven Eeckhout, "Micro-Architecture Independent Branch Behavior Characterization", IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 149

Slide 162

Slide 162 text

Branch Entropy & Miss Rate: Linear Relationship Sander De Pestel, Stijn Eyerman and Lieven Eeckhout, "Micro-Architecture Independent Branch Behavior Characterization", IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 150

Slide 163

Slide 163 text

Branches & Expectations: Code I #include #include #include #include #include #include #include #include #include #include #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable((x))) 151

Slide 164

Slide 164 text

Branches & Expectations: Code II using T = int; void f(T z, T & x, T & y) { ((z < 0) ? x : y) = 5; } void generate_never(std::size_t n, std::vector & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution z(10, 19); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 152

Slide 165

Slide 165 text

Branches & Expectations: Code III void generate_always(std::size_t n, std::vector & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution z(-19, -10); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } void generate_random(std::size_t n, std::vector & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution z(-5, 4); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 153

Slide 166

Slide 166 text

Branches & Expectations: Code IV int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector xs(n), ys(n), zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } else if (type == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); 154

Slide 167

Slide 167 text

Branches & Expectations: Code V const auto time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(zs[i], xs[i], ys[i]); } const auto time_end = std::chrono::steady_clock::now(); std::chrono::duration duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 155

Slide 168

Slide 168 text

Branches & Expectations: Compiling & Timing g++ -ggdb -std=c++14 -march=native -Ofast ./branches.cpp -o branches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./branches.cpp -o branches_c time ./branches_g 1000000 0 time ./branches_g 1000000 1 time ./branches_g 1000000 2 time ./branches_c 1000000 0 time ./branches_c 1000000 1 time ./branches_c 1000000 2 156

Slide 169

Slide 169 text

Branches & Expectations: Timings (GCC) $ time ./branches_g 1000000 0 n = 1000000 type = 0 never duration: 0.00082991 sum(xs): 0 sum(ys): 5000000 real 0m0.034s user 0m0.033s sys 0m0.003s $ time ./branches_g 1000000 1 n = 1000000 type = 1 always duration: 0.000839488 sum(xs): 5000000 sum(ys): 0 real 0m0.031s user 0m0.030s sys 0m0.000s $ time ./branches_g 1000000 2 n = 1000000 type = 2 random duration: 0.0052968 sum(xs): 2498105 sum(ys): 2501895 real 0m0.038s user 0m0.033s sys 0m0.003s 157

Slide 170

Slide 170 text

Branches & Expectations: Timings (Clang) $ time ./branches_c 1000000 0 n = 1000000 type = 0 never duration: 0.00091161 sum(xs): 0 sum(ys): 5000000 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 1 n = 1000000 type = 1 always duration: 0.000765925 sum(xs): 5000000 sum(ys): 0 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 2 n = 1000000 type = 2 random duration: 0.00554585 sum(xs): 2498105 sum(ys): 2501895 real 0m0.041s user 0m0.040s sys 0m0.000s 158

Slide 171

Slide 171 text

So many performance events, so little time "So many performance events, so little time," Gerd Zellweger, Denny Lin, Timothy Roscoe. Proceedings of the 7th Asia-Pacific Workshop on Systems (APSys, Hong Kong, China, August 2016). 159

Slide 172

Slide 172 text

Hierarchical cycle accounting Andrzej Nowak, David Levinthal, Willy Zwaenepoel: "Hierarchical cycle accounting: a new method for application performance tuning." ISPASS 2015. https://github.com/David-Levinthal/gooda 160

Slide 173

Slide 173 text

Top-down Microarchitecture Analysis Method (TMAM) https://github.com/andikleen/pmu-tools/wiki/toplev-manual https://sites.google.com/site/analysismethods/yasin-pubs "A Top-Down Method for Performance Analysis and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 161

Slide 174

Slide 174 text

TMAM: Bottlenecks "A Top-Down Method for Performance Analysis and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 162

Slide 175

Slide 175 text

TMAM: Breakdown "A Top-Down Method for Performance Analysis and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 163

Slide 176

Slide 176 text

TMAM: Meaning Updates: https://download.01.org/perfmon/ "A Top-Down Method for Performance Analysis and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 164

Slide 177

Slide 177 text

Branches & Expectations: TMAM, Level 1 (GCC) $ ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 1000000 type = 2 random duration: 0.00523105 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.92 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_g 1000000 2 165

Slide 178

Slide 178 text

Branches & Expectations: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1 n = 1000000 type = 2 random duration: 0.00528841 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,c n = 1000000 type = 2 random duration: 0.00550316 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.94 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 47.54 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 16.41 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./branches_g 1000000 2 166

Slide 179

Slide 179 text

Branches & Expectations: TMAM, Level 2, perf (GCC) perf record -g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period -o perf.data ./branches_g 1000000 2 perf report -Mintel 167

Slide 180

Slide 180 text

Branches & Expectations: TMAM, Level 1 (Clang) $ ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 1000000 type = 2 random duration: 0.00555177 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.53 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_c 1000000 2 168

Slide 181

Slide 181 text

Branches & Expectations: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 1000000 type = 2 random duration: 0.0055571 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 1000000 type = 2 random duration: 0.00556777 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.54 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 39.20 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 15.18 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 169

Slide 182

Slide 182 text

Branches & Expectations: TMAM, Level 2, perf (Clang) perf record -g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./branches_c 1000000 2 perf report -Mintel 170

Slide 183

Slide 183 text

Virtual Functions & Indirect Branches: Code I #include #include #include #include #include #include #include #include #include #include #include #define str(s) #s #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable(!!(x))) 171

Slide 184

Slide 184 text

Virtual Functions & Indirect Branches: Code II using T = int; struct base { virtual T f() const { return 0; } }; struct derived_taken : base { T f() const override { return -1; } }; struct derived_untaken : base { T f() const override { return 1; } }; void f(const base & b, T & x, T & y) { ((b.f() < 0) ? x : y) = 119; } void generate_never(std::size_t n, std::vector> & zs) { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique()); return; 172

Slide 185

Slide 185 text

Virtual Functions & Indirect Branches: Code III } void generate_always(std::size_t n, std::vector> & zs { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique()); return; } void generate_random(std::size_t n, std::vector> & zs { zs.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution z(0.5); for (std::size_t i = 0; i != n; ++i) { if (z(g)) zs.emplace_back(std::make_unique()); else zs.emplace_back(std::make_unique()); 173

Slide 186

Slide 186 text

Virtual Functions & Indirect Branches: Code IV } return; } int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector xs(n), ys(n); std::vector> zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } 174

Slide 187

Slide 187 text

Virtual Functions & Indirect Branches: Code V else if (type == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); auto time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(*zs[i], xs[i], ys[i]); } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 175

Slide 188

Slide 188 text

Virtual Functions & Indirect Branches: Compiling & Timing g++ -ggdb -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_c time ./vbranches_g 10000000 0 time ./vbranches_g 10000000 1 time ./vbranches_g 10000000 2 time ./vbranches_c 10000000 0 time ./vbranches_c 10000000 1 time ./vbranches_c 10000000 2 176

Slide 189

Slide 189 text

Virtual Functions & Indirect Branches: Timings (GCC) $ time ./vbranches_g 10000000 0 n = 10000000 type = 0 never duration: 0.0338749 sum(xs): 0 sum(ys): 1190000000 real 0m0.645s user 0m0.573s sys 0m0.070s $ time ./vbranches_g 10000000 1 n = 10000000 type = 1 always duration: 0.0406144 sum(xs): 1190000000 sum(ys): 0 real 0m0.648s user 0m0.563s sys 0m0.083s $ time ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.131803 sum(xs): 595154105 sum(ys): 594845895 real 0m0.956s user 0m0.863s sys 0m0.090s 177

Slide 190

Slide 190 text

Branches & Expectations: Timings (Clang) $ time ./vbranches_c 10000000 0 n = 10000000 type = 0 never duration: 0.0314749 sum(xs): 0 sum(ys): 1190000000 real 0m0.623s user 0m0.530s sys 0m0.090s $ time ./vbranches_c 10000000 1 n = 10000000 type = 1 always duration: 0.0314727 sum(xs): 1190000000 sum(ys): 0 real 0m0.623s user 0m0.557s sys 0m0.063s $ time ./vbranches_c 10000000 2 n = 10000000 type = 2 random duration: 0.0854935 sum(xs): 595154105 sum(ys): 594845895 real 0m1.863s user 0m1.800s sys 0m0.063s 178

Slide 191

Slide 191 text

Virtual Functions & Indirect Branches: TMAM, Level 1 (GCC) $ ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x9c,umask=0x1/u,cycles:u}' n = 10000000 type = 2 random duration: 0.131386 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. BAD Bad_Speculation: 12.98 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_g 10000000 2 179

Slide 192

Slide 192 text

Virtual Functions & Indirect Branches: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0xc5,umask=0x0/u,cp n = 10000000 type = 2 random duration: 0.131247 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,cycles:u,cpu/event=0xa3,umask=0x4,cmask=4 n = 10000000 type = 2 random duration: 0.131361 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 36.02 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.41 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u BAD Bad_Speculation: 12.92 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.75 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data . 180

Slide 193

Slide 193 text

Virtual Functions & Indirect Branches: TMAM, Level 3 (GCC) $ ~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.13145 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.44 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.69 % [100.00%] This metric represents cycles fraction the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of miss-predicted branches. For example, branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sampling events: br_misp_retired.all_branches:u BAD Bad_Speculation: 12.97 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.82 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./vbranches_g 10000000 2 181

Slide 194

Slide 194 text

Virtual Functions: TMAM, Level 3, perf (GCC) perf record -g -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_g 10000000 2 perf report -Mintel 182

Slide 195

Slide 195 text

Virtual Functions & Indirect Branches: TMAM, Level 1 (Clang) $ ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 10000000 type = 2 random duration: 0.0858722 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.66 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_c 10000000 2 183

Slide 196

Slide 196 text

Virtual Functions & Indirect Branches: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 10000000 type = 2 random duration: 0.0859943 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 10000000 type = 2 random duration: 0.0861661 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.61 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.64 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 184

Slide 197

Slide 197 text

Virtual Functions & Indirect Branches: TMAM, Level 3 (Clang) ~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.65 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.63 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.MS_Switches: 8.40 % [100.00%] This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals. Sampling events: idq.ms_switches:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,edge=1,cmask=1,name=MS_Switches_IDQ_MS_SWITCHES,period=2000003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./vbranches_c 10000000 2 185

Slide 198

Slide 198 text

Virtual Functions: TMAM, Level 3, perf (Clang) perf record -g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_c 10000000 2 perf report -Mintel 186

Slide 199

Slide 199 text

Branches vs. Predicated Execution https://github.com/mongodb-labs/disasm Interactive Disassembler GUI with optional Intel Architecture Code Analyzer (IACA) integration 187

Slide 200

Slide 200 text

Branches vs. Predicated Execution https://github.com/mongodb-labs/disasm Interactive Disassembler GUI with optional Intel Architecture Code Analyzer (IACA) integration 188

Slide 201

Slide 201 text

Compiler-Specific Built-in Functions GCC & Clang: __builtin_expect http://llvm.org/docs/BranchWeightMetadata.html#built-in- expect-instructions https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html likely & unlikely https://kernelnewbies.org/FAQ/LikelyUnlikely Clang: __builtin_unpredictable http://clang.llvm.org/docs/LanguageExtensions.html#builtin- unpredictable 189

Slide 202

Slide 202 text

Branch Misprediction, Speculation, and Wrong-Path Execution J. Reineke et al., “A Definition and Classification of Timing Anomalies,” Proc. Int'l Workshop Worst Case Execution Time (WCET), 2006. 190

Slide 203

Slide 203 text

Branch Misprediction Penalty & Wrong-Path Execution Tejas S. Karkhanis and James E. Smith. 2004. "A First-Order Superscalar Processor Model." In Proceedings of the 31st annual international symposium on Computer architecture (ISCA '04). 191

Slide 204

Slide 204 text

The Curse of Multiple Granularities Seshadri, V. (2016). "Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems." CoRR, abs/1605.06483. 192

Slide 205

Slide 205 text

Word Granularity != Cache Line Granularity Seshadri, V. (2016). "Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems." CoRR, abs/1605.06483. 193

Slide 206

Slide 206 text

Shortcomings of Strided Access Patterns Seshadri, V. (2016). "Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems." CoRR, abs/1605.06483. 194

Slide 207

Slide 207 text

Pointer Chasing Example - Vector http://pythontutor.com/cpp.html https://github.com/pgbovine/opt-cpp-backend 195

Slide 208

Slide 208 text

Pointer Chasing Example - (Singly) Linked List 196

Slide 209

Slide 209 text

Pointer Chasing Example - Doubly Linked List 197

Slide 210

Slide 210 text

Pointer Chasing Example - Linked List - C++ #include #include #include bool found(const std::forward_list & list, int value) { return find(begin(list), end(list), value) != end(list); } int main() { std::forward_list list {11, 22, 33, 44, 55}; return found(list, 42); } 198

Slide 211

Slide 211 text

Pointer Chasing Example - Linked List - ASM https://godbolt.org/g/rkzQ90 199

Slide 212

Slide 212 text

Pointer Chasing Example - Linked List - ASM (r2) http://rada.re/ 200

Slide 213

Slide 213 text

Pointer Chasing Example - Linked List - CFG (r2) 201

Slide 214

Slide 214 text

Pointer Chasing Example - Linked List - CFG (r2) radiff2 -g sym.found forward_list_app forward_list_app > forward_list_found.dot xdot forward_list_found.dot dot -Tpng -o forward_list_found.png forward_list_found.dot 202

Slide 215

Slide 215 text

Isolated & Clustered Cache Misses Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, and Mateo Valero. 2008. "MLP-aware dynamic cache partitioning." In Proceedings of the 3rd international conference on High performance embedded architectures and compilers (HiPEAC'08). 203

Slide 216

Slide 216 text

Cache Miss Cost & Miss Clustering Thomas R. Puzak, A. Hartstein, P. G. Emma, V. Srinivasan, and Jim Mitchell. 2007. "An analysis of the effects of miss clustering on the cost of a cache miss." In Proceedings of the 4th international conference on Computing frontiers (CF '07). ACM, New York, NY, USA, 3-12.204

Slide 217

Slide 217 text

Cache Miss Penalty: Different STC due to different MLP MLP (memory-level parallelism) & STC (stall-time criticality) R. Das, O. Mutlu, T. Moscibroda and C. R. Das, "Application-aware prioritization mechanisms for on-chip networks," 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), New York, NY, 2009, pp. 280-291. 205

Slide 218

Slide 218 text

Skip Lists William Pugh. 1990. "Skip lists: a probabilistic alternative to balanced trees." Commun. ACM 33, 6, 668-676. 206

Slide 219

Slide 219 text

Jump Pointers S. Chen, P. B. Gibbons, and T. C. Mowry. “Improving Index Performance through Prefetching.” In Proc. of the 20th Annual ACM SIGMOD International Conference on Management of Data, 2001. 207

Slide 220

Slide 220 text

Prefetching Aggressiveness: Distance & Degree Sparsh Mittal. 2016. "A Survey of Recent Prefetching Techniques for Processor Caches." ACM Comput. Surv. 49, 2, Article 35. 208

Slide 221

Slide 221 text

Prefetching Timeliness Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. "When Prefetching Works, When It Doesn’t, and Why." ACM Trans. Archit. Code Optim. 9, 1, Article 2. 209

Slide 222

Slide 222 text

Prefetches Classification Huaiyu Zhu, Yong Chen, and Xian-He Sun. 2010. "Timing local streams: improving timeliness in data prefetching." In Proceedings of the 24th ACM International Conference on Supercomputing (ICS '10). ACM, New York, NY, USA, 169-178. 210

Slide 223

Slide 223 text

Prefetching I #include #include #include #include #include #include #include #include #include #include struct point { double x, y, z; }; using T = point; 211

Slide 224

Slide 224 text

Prefetching II struct timing_result { double duration_initial; double duration_non_prefetched; double duration_degree; double sum_initial; double sum_non_prefetched; double sum_degree; }; timing_result chase(std::size_t n, bool shuffle, std::size_t d, bool prefet timing_result chase_result; std::vector> v; for (std::size_t i = 0; i != n; ++i) { v.emplace_back(new point{1. * i, 2. * i, 5.* i}); } if (shuffle) { std::mt19937 g(1); 212

Slide 225

Slide 225 text

Prefetching III std::shuffle(begin(v), end(v), g); } double sum = 0.0; auto time_start = std::chrono::steady_clock::now(); if (prefetch) { for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); sum += std::exp(-v[i]->y); } } else { for (std::size_t i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration duration = time_end - time_start; chase_result.duration_initial = duration.count(); chase_result.sum_initial = sum; 213

Slide 226

Slide 226 text

Prefetching IV sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; chase_result.duration_non_prefetched = duration.count(); chase_result.sum_non_prefetched = sum; sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); __builtin_prefetch(v[i + 2*d].get()); sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); 214

Slide 227

Slide 227 text

Prefetching V duration = time_end - time_start; chase_result.duration_degree = duration.count(); chase_result.sum_degree = sum; return chase_result; } int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; const bool shuffle = (argc > 2) ? std::atoi(argv[2]) : false; const std::size_t d = (argc > 3) ? std::atoll(argv[3]) : 3; const bool prefetch = (argc > 4) ? std::atoi(argv[4]) : false; const std::size_t threads_count = (argc > 5) ? std::atoll(argv[5]) : 4; printf("size: %zu \n", n); printf("shuffle: %d \n", shuffle); printf("distance: %zu \n", d); printf("prefetch: %d \n", prefetch); 215

Slide 228

Slide 228 text

Prefetching VI printf("threads_count: %d \n", threads_count); const auto thread_work = [n, shuffle, d, prefetch]() { return chase(n, shuffle, d, prefetch); }; std::vector> results; for (std::size_t thread = 0; thread != threads_count; ++thread) results.emplace_back(std::async(std::launch::async, thread_work)); for (auto && future_result : results) if (future_result.valid()) future_result.wait(); std::vector timings_initial, timings_non_prefetched, timings_degree; for (auto && future_result : results) { timing_result chase_result = future_result.get(); timings_initial.push_back(chase_result.duration_initial); 216

Slide 229

Slide 229 text

Prefetching VII timings_non_prefetched.push_back(chase_result.duration_non_prefetched); timings_degree.push_back(chase_result.duration_degree); } const auto timings_initial_minmax = std::minmax_element(begin(timings_initial), end(timings_initial)); const auto timings_non_prefetched_minmax = std::minmax_element(begin(timings_non_prefetched), end(timings_non_pref const auto timings_degree_minmax = std::minmax_element(begin(timings_degree), end(timings_degree)); printf(prefetch ? "prefetched" : "non-prefetched"); printf(" initial duration: [%g, %g] \n", *timings_initial_minmax.first, *timings_initial_minmax.second); printf("non-prefetched duration: [%g, %g] \n", *timings_non_prefetched_mi *timings_non_prefetched_minmax.second); printf("degree-two prefetching duration: [%g, %g] \n", *timings_degree_mi *timings_degree_minmax.second); } 217

Slide 230

Slide 230 text

Prefetch Overhead S. Van der Wiel and D. Lilja, "A Survey of Data Prefetching Techniques," Technical Report No. HPPC 96-05, University of Minnesota, October 1996. 218

Slide 231

Slide 231 text

Prefetching Timings: No Prefetch $ likwid-perfctr -f -C 0-3 -g L3 -m ./prefetch 100000 1 0 0 4 distance: 0 prefetch: 0 non-prefetched initial duration: [0.00280393, 0.00289815] non-prefetched duration: [0.00254968, 0.00257311] degree-two prefetching duration: [0.00290615, 0.00296243] Region chase_initial, Group 1: L3 | CPI STAT | 5.8641 | 1.4529 | 1.4744 | 1.4660 | | L3 bandwidth [MBytes/s] STAT | 10733.6308 | 2666.0364 | 2710.9325 | 2683.4077 | Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | L3 miss rate STAT | 0.0584 | 0.0145 | 0.0148 | 0.0146 | | L3 miss ratio STAT | 3.7723 | 0.9117 | 0.9789 | 0.9431 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 0 0 4 | Cycles without execution [%] STAT | 228.2316 | 56.8136 | 57.4443 | 57.0579 | | Cycles without execution [%] STAT | 227.0385 | 56.5980 | 57.0024 | 56.7596 | 219

Slide 232

Slide 232 text

Prefetching Timings: useless 0-distance prefetch (overhead) $ likwid-perfctr -f -C 0-3 -g L3 -m ./prefetch 100000 1 0 1 4 distance: 0 prefetch: 1 prefetched initial duration: [0.00288751, 0.00295978] non-prefetched duration: [0.0025575, 0.00258342] degree-two prefetching duration: [0.00285772, 0.00287839] Region chase_initial, Group 1: L3 | CPI STAT | 5.7454 | 1.4345 | 1.4387 | 1.4364 | | L3 bandwidth [MBytes/s] STAT | 10518.6383 | 2618.5405 | 2645.6096 | 2629.6596 | 220

Slide 233

Slide 233 text

Prefetching Timings: 1-distance prefetch (mostly overhead) $ likwid-perfctr -f -C 0-3 -g L3CACHE -m ./prefetch 100000 1 1 1 4 prefetched initial duration: [0.00250957, 0.00257662] non-prefetched duration: [0.00255286, 0.00258417] degree-two prefetching duration: [0.00230482, 0.00235828] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.9595 | 1.2343 | 1.2433 | 1.2399 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0889 | 0.4381 | 0.6454 | 0.5222 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 1 1 4 | Cycles without execution [%] STAT | 214.1614 | 53.4628 | 53.6716 | 53.5404 | | Cycles without execution [%] STAT | 200.4785 | 50.0405 | 50.1857 | 50.1196 | Formulas: L3 request rate = MEM_LOAD_UOPS_RETIRED_L3_ALL/UOPS_RETIRED_ALL L3 miss rate = MEM_LOAD_UOPS_RETIRED_L3_MISS/UOPS_RETIRED_ALL L3 miss ratio = MEM_LOAD_UOPS_RETIRED_L3_MISS/MEM_LOAD_UOPS_RETIRED_L3_ALL https://github.com/RRZE-HPC/likwid/blob/master/groups/ivybridge/L3CACHE.txt 221

Slide 234

Slide 234 text

Prefetching Timings: 2-distance prefetch $ likwid-perfctr -f -C 0-3 -g L3CACHE -m ./prefetch 100000 1 2 1 4 size: 100000 shuffle: 1 distance: 2 prefetch: 1 threads_count: 4 prefetched initial duration: [0.0023392, 0.00241287] non-prefetched duration: [0.00257006, 0.00260938] degree-two prefetching duration: [0.00199431, 0.00203528] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.5557 | 1.1331 | 1.1423 | 1.1389 | | L3 request rate STAT | 0.0006 | 0.0001 | 0.0002 | 0.0002 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2317 | 0.3138 | 0.6791 | 0.5579 | Region chase_degree, Group 1: L3CACHE | CPI STAT | 3.6990 | 0.9243 | 0.9253 | 0.9248 | | L3 request rate STAT | 0.0005 | 0.0001 | 0.0002 | 0.0001 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0145 | 0.3597 | 0.6550 | 0.5036 | 222

Slide 235

Slide 235 text

Prefetching Timings: 8-distance prefetch $ likwid-perfctr -f -C 0-3 -g L3CACHE -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00181161, 0.00188783] non-prefetched duration: [0.00257601, 0.0026076] degree-two prefetching duration: [0.00152468, 0.00156814] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0065 | 0.0016 | 0.0017 | 0.0016 | | CPI STAT | 3.4808 | 0.8650 | 0.8788 | 0.8702 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2431 | 0.4694 | 0.6640 | 0.5608 | Region chase_degree, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0053 | 0.0013 | 0.0014 | 0.0013 | | CPI STAT | 2.7450 | 0.6832 | 0.6882 | 0.6863 | | L3 miss rate STAT | 0.0016 | 0.0004 | 0.0004 | 0.0004 | | L3 miss ratio STAT | 3.4045 | 0.7778 | 0.9346 | 0.8511 | 223

Slide 236

Slide 236 text

Prefetching Timings: 8-distance prefetch $ likwid-perfctr -f -C 0-3 -g L3 -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00180738, 0.00189831] non-prefetched duration: [0.00254486, 0.00258013] degree-two prefetching duration: [0.00154542, 0.00158065] Region chase_initial, Group 1: L3 | CPI STAT | 3.5027 | 0.8668 | 0.8835 | 0.8757 | | L3 bandwidth [MBytes/s] STAT | 17384.8731 | 4296.5905 | 4381.7164 | 4346.2183 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 2.7626 | 0.6894 | 0.6919 | 0.6906 | | L3 bandwidth [MBytes/s] STAT | 21505.6670 | 5333.6653 | 5396.4473 | 5376.4168 $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 8 1 4 | Cycles without execution [%] STAT | 187.6689 | 46.3938 | 47.3055 | 46.9172 | | Cycles without execution [%] STAT | 151.5095 | 37.6872 | 38.0656 | 37.8774 | 224

Slide 237

Slide 237 text

Prefetching Timings: suboptimal (untimely) prefetch $ likwid-perfctr -f -C 0-3 -g L3 -m ./prefetch 100000 1 512 1 4 size: 100000 shuffle: 1 distance: 512 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00177956, 0.00186644] non-prefetched duration: [0.00257188, 0.0026064] degree-two prefetching duration: [0.00173249, 0.00178712] Region chase_initial, Group 1: L3 | CPI STAT | 3.4343 | 0.8523 | 0.8683 | 0.8586 | | L3 data volume [GBytes] STAT | 0.0293 | 0.0073 | 0.0074 | 0.0073 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 3.1891 | 0.7903 | 0.8034 | 0.7973 | | L3 bandwidth [MBytes/s] STAT | 19902.4764 | 4954.4107 | 5013.4006 | 4975.6191 | 225

Slide 238

Slide 238 text

Gem5 http://www.gem5.org/ 226

Slide 239

Slide 239 text

Gem5 - std::vector & std::list I Filling with numbers - std::vector vs. std::list Machine code & assembly (std::vector) Micro-ops execution breakdown (std::vector) Assembly is Too High Level: http://xlogicx.net/?p=369 227

Slide 240

Slide 240 text

Gem5 - std::vector & std::list II Micro-ops pipeline stages (std::vector) 228

Slide 241

Slide 241 text

Gem5 - std::vector & std::list III Pipeline diagram - one iteration (std::vector) Pipeline diagram - three iterations (std::vector) 229

Slide 242

Slide 242 text

Gem5 - std::vector & std::list IV Machine code & assembly (std::list) heap allocation in the loop @ 400d85 what could possibly go wrong? 230

Slide 243

Slide 243 text

std::list - one iteration

Slide 244

Slide 244 text

std::list - one iteration (continued...)

Slide 245

Slide 245 text

std::list - one iteration (...continued still)

Slide 246

Slide 246 text

std::list - one iteration (...done!)

Slide 247

Slide 247 text

(The GNU C library's) malloc https://sourceware.org/glibc/wiki/MallocInternals Arena A structure that is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are "free". Threads assigned to each arena will allocate memory from that arena's free lists. Glibc Heap Analysis in Linux Systems with Radare2 https://youtube.com/watch?v=Svm5V4leEho r2con-2016 - rada.re/con/ 235

Slide 248

Slide 248 text

malloc & free - new, new[], delete, delete[] int main() { double * a = new double[8]; double * b = new double[8]; delete[] b; delete[] a; double * c = new double[8]; delete[] c; } 236

Slide 249

Slide 249 text

new[] & delete[] - dmhg 1/6 237

Slide 250

Slide 250 text

new[] & delete[] - dmhg 2/6 238

Slide 251

Slide 251 text

new[] & delete[] - dmhg 3/6 239

Slide 252

Slide 252 text

new[] & delete[] - dmhg 4/6 240

Slide 253

Slide 253 text

new[] & delete[] - dmhg 5/6 241

Slide 254

Slide 254 text

new[] & delete[] - dmhg 6/6 242

Slide 255

Slide 255 text

Memory Access Patterns: Temporal & Spatial Locality horizontal axis - time vertical axis - address D. J. Hatfield and J. Gerald. "Program restructuring for virtual memory." IBM Systems Journal, 10(3):168–192, 1971. 243

Slide 256

Slide 256 text

Loop Fusion 0.429504s (unfused) down to 0.287501s (fused) g++ -Ofast -march=native (5.2.0) void unfused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) a[i] = b[i] * c[i]; for (size_t i = 0; i != N; ++i) d[i] = a[i] * c[i]; } void fused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } } 244

Slide 257

Slide 257 text

Pin - A Dynamic Binary Instrumentation Tool http://www.intel.com/software/pintool pin -t $PIN_ROOT/source/tools/ManualExamples/obj-intel64/pinatrace.so -- ./loop_fusion . . . 0x400e43,R,0x401c48 0x400e59,R,0x401d40 0x400e65,W,0x1c789c0 0x400e65,W,0x1c789e0 . . . r-project.org rstudio.com ggplot2.org rcpp.org 245

Slide 258

Slide 258 text

Loop Fusion: unfused over time PC: Program Counter (instruction pointer) 246

Slide 259

Slide 259 text

Loop Fusion: unfused space-time PC: Program Counter (instruction pointer) MA: Memory Address (array element pointer) 247

Slide 260

Slide 260 text

Loop Fusion: unfused space over time MA: Memory Address (array element pointer) 248

Slide 261

Slide 261 text

Loop Fusion: fused over time PC: Program Counter (instruction pointer) 249

Slide 262

Slide 262 text

Loop Fusion: fused space-time PC: Program Counter (instruction pointer) MA: Memory Address (array element pointer) 250

Slide 263

Slide 263 text

Loop Fusion: fused space over time MA: Memory Address (array element pointer) 251

Slide 264

Slide 264 text

Takeaway: Overlapping Latencies as a General Principle Overlapping latencies also works on a "macro" scale • load as "get the data from the Internet" • compute as "process the data" Another example: Communication Avoiding and Overlapping for Numerical Linear Algebra • https://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-65.html • http://www.cs.berkeley.edu/~egeor/sc12_slides_final.pdf 252

Slide 265

Slide 265 text

Non-Overlapped Timings id,symbol,count,time 1,AAPL,565449,1.59043 2,AXP,731366,3.43745 3,BA,867366,5.40218 4,CAT,830327,7.08103 5,CSCO,400440,8.49192 6,CVX,687198,9.98761 7,DD,910932,12.2254 8,DIS,910430,14.058 9,GE,871676,15.8333 10,GS,280604,17.059 11,HD,556611,18.2738 12,IBM,860071,20.3876 13,INTC,559127,21.9856 14,JNJ,724724,25.5534 15,JPM,500473,26.576 16,KO,864903,28.5405 17,MCD,717021,30.087 18,MMM,698996,31.749 19,MRK,733948,33.2642 20,MSFT,475451,34.3134 21,NKE,556344,36.4545 253

Slide 266

Slide 266 text

Overlapped Timings id,symbol,count,time 1,AAPL,565449,2.00713 2,AXP,731366,2.09158 3,BA,867366,2.13468 4,CAT,830327,2.19194 5,CSCO,400440,2.19197 6,CVX,687198,2.19198 7,DD,910932,2.51895 8,DIS,910430,2.51898 9,GE,871676,2.51899 10,GS,280604,2.519 11,HD,556611,2.51901 12,IBM,860071,2.51902 13,INTC,559127,2.51902 14,JNJ,724724,2.51903 15,JPM,500473,2.51904 16,KO,864903,2.51905 17,MCD,717021,2.51906 18,MMM,698996,2.51907 19,MRK,733948,2.51908 20,MSFT,475451,2.51908 21,NKE,556344,2.51909 254

Slide 267

Slide 267 text

Visualizing & Monitoring Performance https://github.com/Celtoys/Remotery 255

Slide 268

Slide 268 text

Timeline: Without Overlapping 256

Slide 269

Slide 269 text

Timeline: With Overlapping 257

Slide 270

Slide 270 text

Cache Misses, MLP, and STC: Slack R. Das et al., "Aérgia: Exploiting Packet Latency Slack in On-Chip Networks," Proc. 37th Ann. Int’l Symp. Computer Architecture (ISCA 10), ACM Press, 2010. 258

Slide 271

Slide 271 text

Dependent Cache Misses - Non-Overlapped - Serialized A Day in the Life of a Cache Miss Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and James E. Smith, "A Performance Counter Architecture for Computing Accurate CPI Components", ASPLOS 2006, pp. 175-184. 1. load instruction enters the window (ROB) 2. the load issues from the instruction buffer (RS) 3. the load blocks the ROB head 4. ROB eventually fills 5. dispatch stops, instruction window drains 6. eventually issue and commit stop

Slide 272

Slide 272 text

Independent Cache Misses in ROB - Overlapped Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith, "A Top-Down Approach to Architecting CPI Component Performance Counters", IEEE Micro, Special Issue on Top Picks from 2006 Microarchitecture Conferences, Vol 27, No 1, pp. 84-93. 260

Slide 273

Slide 273 text

Miss-Dependent Mispredicted Branch - Penalties Serialization S. Eyerman, J.E. Smith and L. Eeckhout, "Characterizing the branch misprediction penalty", Performance Analysis of Systems and Software 2006 IEEE International Symposium on 2006, pp. 48-58. 261

Slide 274

Slide 274 text

Dependent Cache Misses - Non-Overlapped - Serialized Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. "Accelerating Dependent Cache Misses with an Enhanced Memory Controller." In ISCA, 2016. 262

Slide 275

Slide 275 text

Independent Misses Connected by a Pending Cache Hit • MLP - supported by non-blocking caches, out-of-order execution • multiple outstanding cache-misses - Miss Status Holding Registers (MSHRs) / Line Fill Buffers (LFBs) • MSHR file entries - merging redundant (same cache line) memory requests Xi E. Chen and Tor M. Aamodt. 2008. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). 263

Slide 276

Slide 276 text

Independent Misses Connected by a Pending Cache Hit Xi E. Chen and Tor M. Aamodt. 2011. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 264

Slide 277

Slide 277 text

Finite MSHRs => Finite MLP Xi E. Chen and Tor M. Aamodt. 2011. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 265

Slide 278

Slide 278 text

Cache Miss Penalty: Leading Edge & Trailing Edge "The End of Scaling? Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node," Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 266

Slide 279

Slide 279 text

Cache Miss Penalty: Bandwidth Utilization Impact "The End of Scaling? Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node," Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 267

Slide 280

Slide 280 text

Memory Capacity & Multicore Processors Memory utilization even more important - contention for capacity & bandwidth! "Disaggregated Memory Architectures for Blade Servers," Kevin Te-Ming Lim, Ph.D. Thesis, The University of Michigan, 2010. 268

Slide 281

Slide 281 text

Multicore: Sequential / Parallel Execution Model L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 269

Slide 282

Slide 282 text

Multicore: Amdahl's Law, Strong Scaling "Reevaluating Amdahl's Law," John L. Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 270

Slide 283

Slide 283 text

Multicore: Gustafson's Law, Weak Scaling "Reevaluating Amdahl's Law," John L. Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 271

Slide 284

Slide 284 text

Amdahl's Law Optimistic Assumes perfect parallelism of the parallel portion: Only Serial Bottlenecks, No Parallel Bottlenecks Counterpoint: https://blogs.msdn.microsoft.com/ddperf/2009/04/29/parallel-scalability-isnt-childs-play-part-2-amdahls-law-vs- gunthers-law/ 272

Slide 285

Slide 285 text

Multicore: Synchronization, Actual Scaling M. A. Suleman, M. K. Qureshi, and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 273

Slide 286

Slide 286 text

Multicore: Communication, Actual Scaling M. A. Suleman, M. K. Qureshi, and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 274

Slide 287

Slide 287 text

Multicore & DRAM: AoS I #include #include #include #include #include #include #include struct contract { double K; double T; double P; }; using element = contract; using container = std::vector; 275

Slide 288

Slide 288 text

Multicore & DRAM: AoS II double sum_if(const container & a, const container & b, const std::vector & index) { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a[j].K == b[j].K) sum += a[j].K; } return sum; } template double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) average += f() / m; return average; } 276

Slide 289

Slide 289 text

Multicore & DRAM: AoS III std::vector index_stream(std::size_t n) { std::vector index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector index_random(std::size_t n) { std::vector index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution u(0, n - 1); for (std::size_t i = 0; i != n; ++i) index.push_back(u(g)); return index; } 277

Slide 290

Slide 290 text

Multicore & DRAM: AoS IV int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { thread_type[thread] = (argc > 3 + thread) ? std::atoll(argv[3 + thread]) : 0; std::cout << "thread_type[" << thread << "] = " << thread_type[thread] << '\n'; } 278

Slide 291

Slide 291 text

Multicore & DRAM: AoS V endl(std::cout); std::vector> index(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } const container v1(n, {1.0, 0.5, 3.0}); const container v2(n, {1.0, 2.0, 1.0}); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 279

Slide 292

Slide 292 text

Multicore & DRAM: AoS VI boost::timer::auto_cpu_timer timer; std::vector> results; results.reserve(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 280

Slide 293

Slide 293 text

Multicore & DRAM: AoS Timings 1 thread, sequential access $ ./DRAM_CMP 10000000 10 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.395408s wall, 0.406250s user + 0.000000s system = 0.406250s CPU (102.7%) 281

Slide 294

Slide 294 text

Multicore & DRAM: AoS Timings 1 thread, random access $ ./DRAM_CMP 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 5.348314s wall, 5.343750s user + 0.000000s system = 5.343750s CPU (99.9%) 282

Slide 295

Slide 295 text

Multicore & DRAM: AoS Timings 4 threads, sequential access $ ./DRAM_CMP 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.508894s wall, 2.000000s user + 0.000000s system = 2.000000s CPU (393.0%) 283

Slide 296

Slide 296 text

Multicore & DRAM: AoS Timings 4 threads: 3 sequential access + 1 random access $ ./DRAM_CMP 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 5.666049s wall, 7.265625s user + 0.000000s system = 7.265625s CPU (128.2%) 284

Slide 297

Slide 297 text

Multicore & DRAM: AoS Timings Memory Access Patterns & Multicore: Interactions Matter Inter-thread Interference Sharing - Contention - Interference - Slowdown Threads using a shared resource (like on-chip/off-chip interconnects and memory) contend for it, interfering with each other's progress, resulting in slowdown (and thus negative returns to increased threads count). cf. Thomas Moscibroda and Onur Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," Microsoft Research Technical Report, MSR-TR-2007-15, February 2007. 285

Slide 298

Slide 298 text

Multicore & DRAM: SoA I #include #include #include #include #include #include #include // SoA (structure-of-arrays) struct data { std::vector K; std::vector T; std::vector P; }; 286

Slide 299

Slide 299 text

Multicore & DRAM: SoA II double sum_if(const data & a, const data & b, const std::vector double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) { average += f() / m; } 287

Slide 300

Slide 300 text

Multicore & DRAM: SoA III return average; } std::vector index_stream(std::size_t n) { std::vector index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector index_random(std::size_t n) { std::vector index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution u(0, n - 1); 288

Slide 301

Slide 301 text

Multicore & DRAM: SoA IV for (std::size_t i = 0; i != n; ++i) index.push_back(u(g)); return index; } int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { 289

Slide 302

Slide 302 text

Multicore & DRAM: SoA V thread_type[thread] = (argc > 3 + thread) ? std::atoll(argv[3 + thr std::cout << "thread_type[" << thread << "] = " << thread_type[thre } endl(std::cout); std::vector> index(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } data v1; v1.K.resize(n, 1.0); v1.T.resize(n, 0.5); v1.P.resize(n, 3.0); 290

Slide 303

Slide 303 text

Multicore & DRAM: SoA VI data v2; v2.K.resize(n, 1.0); v2.T.resize(n, 2.0); v2.P.resize(n, 1.0); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 291

Slide 304

Slide 304 text

Multicore & DRAM: SoA VII boost::timer::auto_cpu_timer timer; std::vector> results; results.reserve(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 292

Slide 305

Slide 305 text

Multicore & DRAM: SoA Timings 1 thread, sequential access $ ./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.211877s wall, 0.203125s user + 0.000000s system = 0.203125s CPU (95.9%) 293

Slide 306

Slide 306 text

Multicore & DRAM: SoA Timings 1 thread, random access $ ./DRAM_CMP.SoA 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 4.534646s wall, 4.546875s user + 0.000000s system = 4.546875s CPU (100.3%) 294

Slide 307

Slide 307 text

Multicore & DRAM: SoA Timings 4 threads, sequential access $ ./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.256391s wall, 1.031250s user + 0.000000s system = 1.031250s CPU (402.2%) 295

Slide 308

Slide 308 text

Multicore & DRAM: SoA Timings 4 threads: 3 sequential access + 1 random access $ ./DRAM_CMP.SoA 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 4.581033s wall, 5.265625s user + 0.000000s system = 5.265625s CPU (114.9%) 296

Slide 309

Slide 309 text

Multicore & DRAM: SoA Timings Better Access Patterns yield Better Single-core Performance but also Reduced Interference and thus Better Multi-core Performance 297

Slide 310

Slide 310 text

Multicore: Arithmetic Intensity L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 298

Slide 311

Slide 311 text

Multicore: Synchronization & Connectivity Intensity L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 299

Slide 312

Slide 312 text

Speedup: Synchronization and Connectivity Bottlenecks f: parallelizable fraction f1 : connectivity intensity f2 : synchronization intensity L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 300

Slide 313

Slide 313 text

Speedup: Synchronization & Connectivity Bottlenecks Speedup - affected by sequential-to-parallel data synchronization and inter-core communication. L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 301

Slide 314

Slide 314 text

Partitioning-Sharing Tradeoffs Butler W. Lampson. 1983. "Hints for computer system design." In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP '83). ACM, New York, NY, USA, 33-48. 302

Slide 315

Slide 315 text

Shared Resource: DRAM Heechul Yun, Renato, Zheng-Pei Wu, Rodolfo Pellizzoni. "PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms," IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), 2014. https://github.com/heechul/palloc 303

Slide 316

Slide 316 text

Shared Resource: MSHRs Heechul Yun, Rodolfo Pellizzon, and Prathap Kumar Valsan. 2015. "Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems." In Proceedings of the 2015 27th Euromicro Conference on Real-Time Systems (ECRTS '15). 304

Slide 317

Slide 317 text

Partitioning Multithreading • Thread affinity • POSIX: sched_getcpu, pthread_setaffinity_np • http://eli.thegreenplace.net/2016/c11-threads-affinity-and- hyperthreading/ • https://github.com/RRZE- HPC/likwid/blob/master/groups/skylake/FALSE_SHARE.txt • Local LLC false sharing rate = MEM_LOAD_L3_HIT_RETIRED_XSNP_HITM / MEM_INST_RETIRED_ALL • NUMA: Remote Memory Accesses (RMA), Local Memory Accesses (LMA), RMA/LMA ratio • https://01.org/numatop/ • https://github.com/01org/numatop 305

Slide 318

Slide 318 text

Cache Partitioning: Index-Based & Way-Based Giovani Gracioli, Ahmed Alhammad, Renato Mancuso, Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. "A Survey on Cache Management Mechanisms for Real-Time Embedded Systems." ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 306

Slide 319

Slide 319 text

Cache Partitioning: CPU Support Giovani Gracioli, Ahmed Alhammad, Renato Mancuso, Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. "A Survey on Cache Management Mechanisms for Real-Time Embedded Systems." ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 307

Slide 320

Slide 320 text

Cache Partitioning & Intel: CAT & CMT Cache Monitoring Technology and Cache Allocation Technology https://github.com/01org/intel-cmt-cat A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer, “Cache QoS: From concept to reality in the Intel Xeon processor E5-2600 v3 product family,” in Intl. Symp. on High Performance Computer Architecture (HPCA), Mar. 2016. 308

Slide 321

Slide 321 text

Cache Partitioning != Cache Access Timing Isolation H. Yun and P. Valsan, "Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms", International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 309

Slide 322

Slide 322 text

Cache Partitioning != Cache Access Timing Isolation H. Yun and P. Valsan, "Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms", International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 310

Slide 323

Slide 323 text

Cache Partitioning != Cache Access Timing Isolation H. Yun and P. Valsan, "Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms", International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 311

Slide 324

Slide 324 text

Cache Partitioning != Cache Access Timing Isolation https://github.com/CSL-KU/IsolBench Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 312

Slide 325

Slide 325 text

Cache Partitioning != Cache Access Timing Isolation • Shared: MSHRs (Miss information/Status Holding Registers) / LFBs (Line Fill Buffers) • Contention => cache space partitioning != cache access timing isolation Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 313

Slide 326

Slide 326 text

Cache Partitioning != Cache Access Timing Isolation • mutiple MSHRs support multiple outstanding cache-misses • the number of MSHRs determines the MLP of the cache • local MLP - outstanding misses one core can generate • global MLP - parallelism of the entire shared memory hierarchy (i.e., shared LLC and DRAM) • "the aggregated parallelism of the cores (the sum of local MLP) exceeds the parallelism supported by the shared LLC and DRAM (global MLP) in the out-of-order architectures" Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 314

Slide 327

Slide 327 text

Shared Resource (MSHRs) & Prefetching: Xeon Phi Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. "Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking." ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 315

Slide 328

Slide 328 text

Shared Resource (MSHRs) & Prefetching: SNB Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. "Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking." ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 316

Slide 329

Slide 329 text

Weighted Speedup A. Snavely and D. M. Tullsen, “Symbiotic jobscheduling for simultaneous multithreading processor,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Nov. 2000, pp. 234– 244. S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance,” in Computer Architecture Letters, vol. 13, no. 2, 2014.317

Slide 330

Slide 330 text

The Number of Cycles Sam Van den Steen; Stijn Eyerman; Sander De Pestel; Moncef Mechri; Trevor E. Carlson; David Black-Schaffer; Erik Hagersten; Lieven Eeckhout, “Analytical Processor Performance and Power Modeling using Micro-Architecture Independent Characteristics,” Transactions on Computers (TC) 2016. C - #cycles, N - #instructions, Deff - effective dispatch rate, mbpred - #branch mispredictions, cres - branch resolution time, cfe - front-end pipeline depth, mILi - #instruction fetch misses at each level i in the cache hierarchy, cLi - access latency to each cache level, ROB - size of the Reorder Buffer, mLLC - #number of LLC load misses, cmem - memory access time, cbus - memory bus transfer and waiting time, MLP - amount of memory-level parallelism, PhLLC - LLC hit chain penalty 318

Slide 331

Slide 331 text

Roofline Model: Potential "Auto-tuning Performance on Multicore Computers," S. Williams, PhD, 2008. 319

Slide 332

Slide 332 text

Roofline Model: Optimization "Auto-tuning Performance on Multicore Computers," S. Williams, PhD, 2008. 320

Slide 333

Slide 333 text

Cache-aware Roofline model "Cache-aware Roofline model: Upgrading the loft." Aleksandar Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 321

Slide 334

Slide 334 text

Cache-aware Roofline model "Cache-aware Roofline model: Upgrading the loft." Aleksandar Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 322

Slide 335

Slide 335 text

Roofline Model: Microarchitectural Bottlenecks "Extending the Roofline Model: Bottleneck Analysis with Microarchitectural Constraints." Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 323

Slide 336

Slide 336 text

Roofline Model: Microarchitectural Bottlenecks "Extending the Roofline Model: Bottleneck Analysis with Microarchitectural Constraints." Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 324

Slide 337

Slide 337 text

C++ Standards: C++11 & C++14 Atomic Operations & Concurrent Memory Model http://en.cppreference.com/w/cpp/atomic http://github.com/MattPD/cpplinks/blob/master/atomics.lockfree.memory_model.md "The C11 and C++11 Concurrency Model" by Mark John Batty: http://www.cl.cam.ac.uk/~mjb220/thesis/ Move semantics https://isocpp.org/wiki/faq/cpp11-language#rval http://thbecker.net/articles/rvalue_references/section_01.html http://kholdstare.github.io/technical/2013/11/23/moves-demystified.html scoped_allocator (stateful allocators support) https://isocpp.org/wiki/faq/cpp11-library#scoped-allocator http://en.cppreference.com/w/cpp/header/scoped_allocator https://accu.org/content/conf2012/JonathanWakely-CXX11_allocators.pdf https://accu.org/content/conf2013/Frank_Birbacher_Allocators.r210article.pdf 325

Slide 338

Slide 338 text

C++ Standards: C++11, C++14, and C++17 reducing the need for conditional compilation via macros and template metaprogramming constexpr https://isocpp.org/wiki/faq/cpp11-language#cpp11-constexpr https://isocpp.org/wiki/faq/cpp14-language#extended-constexpr if constexpr http://en.cppreference.com/w/cpp/language/if#Constexpr_If 326

Slide 339

Slide 339 text

C++17 Standard std::string_view http://en.cppreference.com/w/cpp/string/basic_string_view interoperatbility with C APIs (e.g., sockets) without extra allocations / copies std::aligned_alloc (C11) http://en.cppreference.com/w/cpp/memory/c/aligned_alloc aligned uninitialized storage allocation (vectorization) Hardware interference size http://eel.is/c++draft/hardware.interference http://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size portable cache line size information (e.g., padding to avoid false sharing) Extended allocators & polymorphic memory resources http://en.cppreference.com/w/cpp/memory/polymorphic_allocator http://stackoverflow.com/questions/38010544/polymorphic-allocator-when- and-why-should-i-use-it http://boost.org/doc/libs/release/doc/html/container/extended_functionality.html 327

Slide 340

Slide 340 text

C++ Core Guidelines P: Philosophy • P.9: Don't waste time or space. Per: Performance • Per.3: Don't optimize something that's not performance critical. • Per.6: Don't make claims about performance without measurements. • Per.7: Design to enable optimization • Per.18: Space is time. • Per.19: Access memory predictably. • Per.30: Avoid context switches on the critical path https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-performance https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#S-performance 328

Slide 341

Slide 341 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? 329

Slide 342

Slide 342 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? 329

Slide 343

Slide 343 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency 329

Slide 344

Slide 344 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence 329

Slide 345

Slide 345 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) 329

Slide 346

Slide 346 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? 329

Slide 347

Slide 347 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty 329

Slide 348

Slide 348 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time 329

Slide 349

Slide 349 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality 329

Slide 350

Slide 350 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead 329

Slide 351

Slide 351 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work 329

Slide 352

Slide 352 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict 329

Slide 353

Slide 353 text

Takeaway: It depends! • Memory access cost: latency / bandwidth? • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict • cmov & tradeoffs: converting control dependencies to data dependencies 329

Slide 354

Slide 354 text

Takeaways Principles Data structures & data layout - fundamental part of design CPUs & pervasive forms parallelism • can support each other: PLP, ILP (MLP!), TLP, DLP Balanced design vs. bottlenecks Overlapping latencies Sharing-contention-interference-slowdown Yale Patt's Phase 2: Break the layers: • break through the hardware/software interface • harness all levels of the transformation hierarchy 330

Slide 355

Slide 355 text

Phase 2: Harnessing the Transformation Hierarchy Yale N. Patt, Microprocessor Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 331

Slide 356

Slide 356 text

Break the Layers Yale N. Patt, Microprocessor Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 332

Slide 357

Slide 357 text

Pigeonholing has to go Yale N. Patt at Yale Patt 75 Visions of the Future Computer Architecture Workshop: "Are you a software person or a hardware person?" I'm a person this pigeonholing has to go We must break the layers Abstractions are great - AFTER you understand what's being abstracted Yale N. Patt, 2013 IEEE CS Harry H. Goode Award Recipient Interview — https://youtu.be/S7wXivUy-tk Yale N. Patt at Yale Patt 75 Visions of the Future Computer Architecture Workshop — https://youtu.be/x4LH1cJCvxs 333

Slide 358

Slide 358 text

Resources http://www.agner.org/optimize/ https://users.ece.cmu.edu/~omutlu/lecture-videos.html https://github.com/MattPD/cpplinks/ 334

Slide 359

Slide 359 text

Slides https://speakerdeck.com/mattpd 335

Slide 360

Slide 360 text

Thank You! Questions? 336