Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Computer Architecture, C++, and High Performanc...

Computer Architecture, C++, and High Performance (CppCon 2016)

With the increase in the available computational power, the Nathan Myhrvold's Laws of Software continue to apply: New opportunities enable new applications with increased needs, which subsequently become constrained by the hardware that used to be "modern" at adoption time. C++ itself opens the access to high-quality optimizing compilers and a wide ecosystem of high-performance tooling and libraries. At the same time, simply turning on the highest optimization flags and hoping for the best is not going to automagically yield the highest performance -- i.e., the lowest execution time. The reasons are twofold: Algorithms' performance can differ in theory -- and that of their implementations can differ even more so in practice.

Modern CPU architecture has continued to yield increases in performance through the advances in microarchitecture, such as pipelining, multiple issue (superscalar) out-of-order execution, branch prediction, SIMD-within-a-register (SWAR) vector units, and chip multi-processor (CMP, also known as multi-core) architecture. All of these developments have provided us with the opportunities associated with a higher peak performance -- while at the same time raising new optimization challenges when actually trying to reach that peak.

In this talk we'll consider the properties of code which can make it either friendly -- or hostile -- to a modern microprocessor. We will offer advice on achieving higher performance, from the ways of analyzing it beyond algorithmic complexity, recognizing the aspects we can entrust to the compiler, to practical optimization of the existing code. Instead of stopping at the "you should measure it" advice (which is correct, but incomplete), the talk will be focused on providing practical, hands-on examples on _how_ to actually perform the measurements (presenting tools -- including perf and likwid -- simplifying the access to CPU performance monitoring counters) and how to reason about the resulting measurements (informed by the understanding of the modern CPU architecture, generated assembly code, as well as an in-depth look at how the CPU cycles are spent using modern microarchitectural simulation tools) to improve the performance of C++ applications.

Matt P. Dziubinski

September 19, 2016
Tweet

More Decks by Matt P. Dziubinski

Other Decks in Programming

Transcript

  1. Computer Architecture, C++, and High Performance Matt P. Dziubinski CppCon

    2016 [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)
  2. Outline • Performance • Why do we care? • What

    is it? • How to • measure it - reason about it - improve it? 2
  3. Costs and Curves Moore, Gordon E. (1965). "Cramming more components

    onto integrated circuits". Electronics Magazine. 4
  4. Cramming more components onto integrated circuits Moore, Gordon E. (1965).

    "Cramming more components onto integrated circuits". Electronics Magazine. 5
  5. Transformation Hierarchy Yale N. Patt, Microprocessor Performance, Phase 2: Can

    We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 7
  6. Phase I & The Walls Yale N. Patt, Microprocessor Performance,

    Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 8
  7. CPU Performance Trends 1 5 9 13 18 24 51

    80 117 183 280 481 649 993 1,267 1,779 3,016 4,195 6,043 6,681 7,108 11,865 14,387 19,484 21,871 24,129 1 10 100 1000 10,000 100,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Performance (vs. VAX-11/780) 25%/year 52%/year 22%/year IBM POWERstation 100, 150 MHz Digital Alphastation 4/266, 266 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164 Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor IBM Power4, 1.3 GHz Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 1.5, VAX-11/785 AMD Athlon 64, 2.8 GHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz VAX 8700, 22 MHz AX-11/780, 5 MHz Hennessy, John L.; Patterson, David A., 2011, "Computer Architecture: A Quantitative Approach," Morgan Kaufmann. 9
  8. Processor-Memory Performance Gap 1 100 10 1000 Performance 10,000 100,000

    1980 2010 2005 2000 1995 Year Processor Memory 1990 1985 The difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access. Hennessy, John L.; Patterson, David A., 2011, "Computer Architecture: A Quantitative Approach," Morgan Kaufmann. Computer Architecture is Back: Parallel Computing Landscape https://www.youtube.com/watch?v=On-k-E5HpcQ 11
  9. DRAM Performance Trends D. Lee: "Reducing DRAM Latency at Low

    Cost by Exploiting Heterogeneity." http://arxiv.org/abs/1604.08041 (2016) D. Lee et al., "Tiered-latency DRAM: A low latency and low cost DRAM architecture," in HPCA, 2013. 12
  10. Emerging Memory Technologies - Further Down The Hierarchy Qureshi et

    al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. 13
  11. NVMs as Storage Class Memories - Bottlenecks: New & Old

    Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, 2013. 14
  12. DBs Execution Cycles: Useful Computation vs. Stall Cycles R. Panda,

    C. Erb, M. LeBeane, J. H. Ryoo and L. K. John, "Performance Characterization of Modern Databases on Out-of-Order CPUs," Computer Architecture and High Performance Computing (SBAC-PAD), 2015 27th International Symposium on, Florianopolis, 2015, pp. 114-121. 15
  13. System Calls - Performance Impact Livio Soares and Michael Stumm.

    2010. "FlexSC: flexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 16
  14. System Calls, Interrupts, and Asynchronous I/O Jisoo Yang, Dave B.

    Minturn, and Frank Hady. 2012. "When poll is better than interrupt." In Proceedings of the 10th USENIX conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA. 17
  15. System Calls as CPU Exceptions Craig B. Zilles, Joel S.

    Emer, and Gurindar S. Sohi. 1999. "The use of multithreading for exception handling." In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture (MICRO 32). IEEE Computer Society, Washington, DC, USA, 219-229. 18
  16. Pollution & Context Switch Misses Replaced Miss (D) & Reordered

    Miss (C) F. Liu, F. Guo, Y. Solihin, S. Kim and A. Eker, "Characterizing and modeling the behavior of context switch misses", Intl. Conf. on Parallel Architectures and Compilation Techniques, 2008. 19
  17. Beyond Mode Switch Time: Footprint & Pollution Livio Soares and

    Michael Stumm. 2010. "FlexSC: flexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 20
  18. Beyond Mode Switch Time: Direct & Indirect Costs Livio Soares

    and Michael Stumm. 2010. "FlexSC: flexible system call scheduling with exception-less system calls." In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 33-46. 21
  19. Feature Scaling Trends Lee, Yunsup, "Decoupled Vector-Fetch Architecture with a

    Scalarizing Compiler," EECS Department, University of California, Berkeley. 2016. http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-82.html 22
  20. Process-Architecture-Optimization Intel's Annual Report on Form 10-K for the fiscal

    year ended December 26, 2015, filed with the SEC on February 12, 2016. https://www.sec.gov/Archives/edgar/data/50863/000005086316000105/a10kdocument12262015q4.htm 23
  21. Make it fast Butler W. Lampson. 1983. "Hints for computer

    system design." In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP '83). ACM, New York, NY, USA, 33-48. 24
  22. Performance: The Early Days A. Greenbaum and T. Chartier. "Numerical

    Methods: Design, analysis, and computer implementation of algorithms." 2010. Course Notes for Short Course on Numerical Analysis. 26
  23. Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), "On

    the computational complexity of algorithms", Transactions of the American Mathematical Society 117: 285–306. 27
  24. Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), "On

    the computational complexity of algorithms", Transactions of the American Mathematical Society 117: 285–306. 28
  25. Analysis of Algorithms - Scientific Method Robert Sedgewick and Kevin

    Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 30
  26. Analysis of Algorithms - Problem Size N vs. Running Time

    T(N) Robert Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 31
  27. Analysis of Algorithms - Tilde Notation & Tilde Approximations Robert

    Sedgewick and Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 32
  28. Analysis of Algorithms - Doubling Ratio Experiments Robert Sedgewick and

    Kevin Wayne, "Algorithms," 4th Edition, Addison-Wesley Professional, 2011. 33
  29. Find Example C++ Code I #include <algorithm> #include <chrono> #include

    <cstddef> #include <cstdint> #include <cstdio> #include <iterator> #include <random> #include <set> #include <vector> #include <boost/container/flat_set.hpp> #include <EASTL/vector_set.h> // EASTL // https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned debugFlags, const char* file, int line) { return malloc(size); } 34
  30. Find Example C++ Code II void* operator new[](size_t size, size_t

    alignment, size_t alignmentOffset, const char* pName, int flags, unsigned debugFlags, const char* file, int line) { return malloc(size); } using T = std::uint32_t; std::vector<T> odd_numbers(std::size_t count) { std::vector<T> result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 35
  31. Find Example C++ Code III template <typename container_type> void ctor_and_find(const

    char * type_name, const std::vector<T> & v, std::size_t q) { printf("%s\n", type_name); std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution<T> uniform(0, 2 * n + 2); printf("ctor\t"); auto time_start = std::chrono::steady_clock::now(); const container_type s(begin(v), end(v)); auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; printf("duration: %g \n", duration.count()); printf("search\t"); time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != q; ++i) { 36
  32. Find Example C++ Code IV const auto it = s.find(uniform(prng));

    sum += (it != end(s)) ? *it : 0; } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; printf("duration: %g \t", duration.count()); printf("sum: %zu \n\n", sum); } void ctor_and_find(const char * type_name, const std::vector<T> & v_src, std::size_t q) { printf("%s\n", type_name); std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution<T> uniform(0, 2*n + 2); printf("prep\t"); auto time_start = std::chrono::steady_clock::now(); auto v = v_src; 37
  33. Find Example C++ Code V std::sort(begin(v), end(v)); auto time_end =

    std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; printf("duration: %g \n", duration.count()); printf("search\t"); time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; printf("duration: %g \t", duration.count()); printf("sum: %zu \n\n", sum); } 38
  34. Find Example C++ Code VI int main(int argc, char *

    argv[]) { // `n`: elements count (size) const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; printf("size: %zu \n", n); // `q`: queries count const std::size_t q = (argc > 2) ? std::atoll(argv[2]) : 10; printf("queries: %zu \n", q); const auto v = odd_numbers(n); printf("\n"); ctor_and_find<std::set<T>>("std::set", v, q); ctor_and_find("std::vector: copy & sort", v, q); ctor_and_find<boost::container::flat_set<T>>("boost::container::flat_set" ctor_and_find<eastl::vector_set<T>>("eastl::vector_set", v, q); } 39
  35. Find Example - Benchmark (Nonius) Code I #include <algorithm> #include

    <cstddef> #include <cstdint> #include <cstdio> #include <iterator> #include <random> #include <set> #include <vector> #include <boost/container/flat_set.hpp> #include <EASTL/vector_set.h> #include <nonius/nonius.h++> #include <nonius/main.h++> NONIUS_PARAM(size, std::size_t{100u}) NONIUS_PARAM(queries, std::size_t{10u}) 40
  36. Find Example - Benchmark (Nonius) Code II // EASTL //

    https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned de return malloc(size); } void* operator new[](size_t size, size_t alignment, size_t alignmentOffset, return malloc(size); } using T = std::uint32_t; std::vector<T> odd_numbers(std::size_t count) { std::vector<T> result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 41
  37. Find Example - Benchmark (Nonius) Code III template <typename container_type>

    T ctor_and_find(const char * type_name, const std::vector<T> & v, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution<T> uniform(0, 2 * n + 2); const container_type s(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto it = s.find(uniform(prng)); sum += (it != end(s)) ? *it : 0; } return sum; } 42
  38. Find Example - Benchmark (Nonius) Code IV T ctor_and_find(const char

    * type_name, const std::vector<T> & v_src, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution<T> uniform(0, 2*n + 2); auto v = v_src; std::sort(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } return sum; } 43
  39. Find Example - Benchmark (Nonius) Code V NONIUS_BENCHMARK("std::set", [](nonius::chronometer meter)

    { const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<std::set<T>>("std::set", v, q); }); }); NONIUS_BENCHMARK("std::vector: copy & sort", [](nonius::chronometer meter) const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find("std::vector: copy & sort", v, q); }); }); 44
  40. Find Example - Benchmark (Nonius) Code VI NONIUS_BENCHMARK("boost::container::flat_set", [](nonius::chronometer meter

    const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<boost::container::flat_set<T>>("boost::container::flat_se }); }); NONIUS_BENCHMARK("eastl::vector_set", [](nonius::chronometer meter) { const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<eastl::vector_set<T>>("eastl::vector_set", v, q); }); }); int main(int argc, char * argv[]) { nonius::main(argc, argv); } 45
  41. Find Example - Benchmark (Nonius) Code I Nonius: statistics-powered micro-benchmarking

    framework: https://nonius.io/ https://github.com/libnonius/nonius Running: BNSIZE=10000; BNQUERIES=1000 ./find --param=size:$BNSIZE --param=queries:$BNQUERIES > results.size=$BNSIZE.queries=$BNQUERIES.txt ./find --param=size:$BNSIZE --param=queries:$BNQUERIES --reporter=html --output=results.size=$BNSIZE.queries=$BNQUERIES.html 46
  42. Asymptotic growth & "random access machines"? Tomasz Jurkiewicz and Kurt

    Mehlhorn. 2015. "On a Model of Virtual Address Translation." J. Exp. Algorithmics 19. http://arxiv.org/abs/1212.0703 & https://people.mpi-inf.mpg.de/~mehlhorn/ftp/KMvat.pdf 50
  43. Asymptotic growth & "random access machines"? Asymptotic - growing problem

    size • for large data need to take into account the costs of actually bringing it in • communication complexity vs. computation complexity • including overlapping computation-communication latencies 51
  44. "Operation"? Jack Dongarra. 2016. "With Extreme Scale Computing the Rules

    Have Changed." In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 52
  45. "Operation"? Jack Dongarra. 2016. "With Extreme Scale Computing the Rules

    Have Changed." In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 53
  46. Complexity - constants, microarchitecture? "Array Layouts for Comparison-Based Searching" Paul-Virak

    Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search 54
  47. Complexity - constants, microarchitecture? "Array Layouts for Comparison-Based Searching" Paul-Virak

    Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). 54
  48. Complexity - constants, microarchitecture? "Array Layouts for Comparison-Based Searching" Paul-Virak

    Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • "With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). • It was only through careful and controlled experimentation with different implementations of each of the search algorithms that we are able to understand how the interactions between processor features such as pipelining, prefetching, speculative execution, and conditional moves affect the running times of the search algorithms." 54
  49. Reasoning about Performance: The Scientific Method Requires - enabled by

    - the knowledge of microachitectural details. Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi, Chapter 2 "Methods" from "Readings in Computer Architecture," Morgan Kaufmann, 2000. Prefetching benefits evaluation: Disable/enable prefetchers using likwid-features: https://github.com/RRZE-HPC/likwid/wiki/likwid-features Example: https://gist.github.com/MattPD/06e293fb935eaf67ee9c301e70db6975 55
  50. Pervasive CPU Parallelism pipeline-level parallelism (PLP) instruction-level parallelism (ILP) memory-level

    parallelism (MLP) data-level parallelism (DLP) thread-level parallelism (TLP) 57
  51. Pipelining & Temporal Parallelism D. Sima, "Decisive aspects in the

    evolution of microprocessors", Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 58
  52. Pipelining: Base N. P. Jouppi and D. W. Wall. 1989.

    "Available instruction-level parallelism for superscalar and superpipelined machines." In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 59
  53. Pipelining: Superscalar N. P. Jouppi and D. W. Wall. 1989.

    "Available instruction-level parallelism for superscalar and superpipelined machines." In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 60
  54. The Cache Liptay, J. S. (1968) "Structural Aspects of the

    System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 61
  55. The Cache: Processor-Memory Performance Gap Liptay, J. S. (1968) "Structural

    Aspects of the System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 62
  56. The Cache: Assumptions & Effectiveness Liptay, J. S. (1968) "Structural

    Aspects of the System/360 Model 85, Part II: The Cache," IBM System Journal, 7(1). 63
  57. Out-of-Order Execution: Overlap R.M. Tomasulo, “An Efficient Algorithm for Exploiting

    Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 65
  58. Out-of-Order Execution: Reservation Stations R.M. Tomasulo, “An Efficient Algorithm for

    Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 66
  59. Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efficient Algorithm

    for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 67
  60. Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efficient Algorithm

    for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 68
  61. Out-of-Order Execution of Simple Micro-Operations Y.N. Patt, W.M. Hwu, and

    M. Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 69
  62. Out-of-Order Execution: Restricted Dataflow Y.N. Patt, W.M. Hwu, and M.

    Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 70
  63. Out-of-Order Execution: Results Buffer Y.N. Patt, W.M. Hwu, and M.

    Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 71
  64. Pipelining & Precise Exceptions: Reorder Buffer (ROB) J.E. Smith and

    A.R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. 12th Ann. IEEE/ACM Int’l Symp. Computer Architecture, 1985, pp. 36–44. 72
  65. Execution: Superscalar & Out-Of-Order J.E. Smith and G.S. Sohi, "The

    Microarchitecture of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 73
  66. Superscalar CPU Organization J.E. Smith and G.S. Sohi, "The Microarchitecture

    of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 74
  67. Superscalar CPU: ROB J.E. Smith and G.S. Sohi, "The Microarchitecture

    of Superscalar Processors," Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 75
  68. Computer Architecture: A Science of Tradeoffs "My tongue in cheek

    phrase to emphasize the importance of tradeoffs to the discipline of computer architecture. Clearly, computer architecture is more art than science. Science, we like to think, involves a coherent body of knowledge, even though we have yet to figure out all the connections. Art, on the other hand, is the result of individual expressions of the various artists. Since each computer architecture is the result of the individual(s) who specified it, there is no such completely coherent structure. So, I opined if computer architecture is a science at all, it is a science of tradeoffs. In class, we keep coming up with design choices that involve tradeoffs. In my view, "tradeoffs" is at the heart of computer architecture." — Yale N. Patt 76
  69. Design Points: Dictated the Application Space The design of a

    microprocessor is about making relevant tradeoffs. We refer to the set of considerations, along with the relevant importance of each, as the “design point” for the microprocessor—that is, the characteristics that are most important to the use of the microprocessor, such that one is willing to be less concerned about other characteristics. In each case, it is usually the problem we are addressing . . . which dictates the design point for the microprocessor, and the resulting tradeoffs that must be made. Patt, Y., & Cockrell, E. (2001). "Requirements, bottlenecks, and good fortune: Agents for microprocessor evolution." Proceedings of the IEEE, 89(11), 1553-1559. 77
  70. A Science of Tradeoffs Software Performance Optimization - Analogous! The

    multiplicity of tradeoffs: • Multidimensional • Multiple levels • Costs and benefits 78
  71. Trade-offs - Latency & Bandwidth I Intel(R) Memory Latency Checker

    - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 60.4 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using traffic with the following read-write ratios ALL Reads : 24152.0 3:1 Reads-Writes : 22313.2 2:1 Reads-Writes : 22050.5 1:1 Reads-Writes : 21130.4 Stream-triad like: 21559.4 79
  72. Trade-offs - Latency & Bandwidth II Measuring Memory Bandwidths between

    nodes within system Using Read-only traffic type Memory node Socket 0 0 24155.0 Measuring Loaded Latencies for the system Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 122.27 24109.6 00002 121.99 24082.7 00008 120.60 23952.1 00015 119.28 23837.6 00050 70.87 17408.7 00100 64.59 12496.6 80
  73. Trade-offs - Latency & Bandwidth III Inject Latency Bandwidth Delay

    (ns) MB/sec ========================== 00200 61.76 8129.1 00300 60.75 6194.8 00400 60.63 5085.6 00500 60.12 4377.0 00700 60.51 3505.2 01000 60.60 2812.6 01300 60.66 2425.3 01700 60.51 2117.0 02500 60.36 1789.5 03500 60.33 1585.4 05000 60.29 1430.9 09000 60.31 1267.9 20000 60.32 1154.7 81
  74. Trade-offs - Latency & Size I Intel i3-2120 (Sandy Bridge),

    3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 20 9 + 16 (L3) 1 M 24 4 2 M 26 2 4 M 27 + 18 ns 1 + 18 ns + 56 ns (RAM) 8 M 28 + 38 ns 1 + 20 ns 16 M 28 + 47 ns 9 ns 32 M 28 + 52 ns 5 ns 64 M 28 + 54 ns 2 ns 128 M 36 + 55 ns 8 + 1 ns + 16 (TLB miss) 82
  75. Trade-offs - Latency & Size II Size Latency Increase Description

    256 M 40 + 56 ns 4 + 1 ns 512 M 42 + 56 ns 2 1024 M 43 + 56 ns 1 2048 M 44 + 56 ns 1 4096 M 44 + 56 ns 0 8192 M 53 + 56 ns 9 + 18 (PDPTE cache miss) Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html 83
  76. Trade-offs - Least Squares Golub & Van Loan (2013) "Matrix

    Computations" Trade-offs: FLOPs (FLoating-point OPerations) vs. Applicability / Numerical Stability / Speed / Accuracy Example: Catalogue of dense decompositions: http://eigen.tuxfamily.org/dox/group__TopicLinearAlgebraDecompositions.html 84
  77. Trade-offs - Multidimensional - Numerical Optimization Ben Recht, Feng Niu,

    Christopher Ré, Stephen Wright. "Lock-Free Approaches to Parallelizing Stochastic Gradient Descent" OPT 2011: 4th International Workshop on Optimization for Machine Learning http://opt.kyb.tuebingen.mpg.de/slides/opt2011-recht.pdf 85
  78. Trade-offs - Multiple levels - Numerical Optimization Gradient computation -

    accuracy vs. function evaluations f : Rd → RN • Finite differencing: • forward-difference: O( √ ϵM) error, d O(Cost(f)) evaluations • central-difference: O(ϵ2/3 M ) error, 2d O(Cost(f)) evaluations w/ the machine epsilon ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • Algorithmic differentiation (AD): precision - as in hand-coded analytical gradient • rough forward-mode cost d O(Cost(f)) • rough reverse-mode cost N O(Cost(f)) 86
  79. Trade-offs: Costs and Benefits Gabriel, Richard P. (1985). "Performance and

    Evaluation of Lisp Systems." Cambridge, Mass: MIT Press; Computer Systems Series. 87
  80. Costs and Benefits: Implications • Important to know what to

    focus on • Optimize the optimization: so that it doesn't always take hours or days or weeks or months... 88
  81. Superscalar CPU Model Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and

    James E. Smith. 2009. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2, Article 3. 89
  82. Instruction Level Parallelism & Loop Unrolling - Code I #include

    <cstddef> #include <cstdint> #include <cstdlib> #include <iostream> #include <vector> #include <boost/timer/timer.hpp> 90
  83. Instruction Level Parallelism & Loop Unrolling - Code II using

    T = double; T sum_1(const std::vector<T> & input) { T sum = 0.0; for (std::size_t i = 0, n = input.size(); i != n; ++i) sum += input[i]; return sum; } T sum_2(const std::vector<T> & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { sum1 += input[i]; sum2 += input[i + 1]; } return sum1 + sum2; } 91
  84. Instruction Level Parallelism & Loop Unrolling - Code III int

    main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 10000000; const std::size_t f = (argc > 2) ? std::atoll(argv[2]) : 1; std::cout << "n = " << n << '\n'; // iterations count std::cout << "f = " << f << '\n'; // unroll factor const std::vector<T> a(n, T(1)); boost::timer::auto_cpu_timer timer; const T sum = (f == 1) ? sum_1(a) : (f == 2) ? sum_2(a) : 0; std::cout << sum << '\n'; } 92
  85. Instruction Level Parallelism & Loop Unrolling - Results make vector_sums

    CXXFLAGS="-std=c++14 -O2 -march=native" LDLIBS=-lboost_timer $ ./vector_sums 1000000000 2 n = 1000000000 f = 2 1e+09 0.466293s wall, 0.460000s user + 0.000000s system = 0.460000s CPU (98.7%) $ ./vector_sums 1000000000 1 n = 1000000000 f = 1 1e+09 0.841269s wall, 0.840000s user + 0.010000s system = 0.850000s CPU (101.0%) 93
  86. perf Results - sum_1 Performance counter stats for './vector_sums 1000000000

    1': 1675.812457 task-clock (msec) # 0.850 CPUs utilized 34 context-switches # 0.020 K/sec 5 cpu-migrations # 0.003 K/sec 8,953 page-faults # 0.005 M/sec 5,760,418,457 cycles # 3.437 GHz 3,456,046,515 stalled-cycles-frontend # 60.00% frontend cycles id 8,225,763,566 instructions # 1.43 insns per cycle # 0.42 stalled cycles per 2,050,710,005 branches # 1223.711 M/sec 104,331 branch-misses # 0.01% of all branches 1.970909249 seconds time elapsed 95
  87. perf Results - sum_2 Performance counter stats for './vector_sums 1000000000

    2': 1283.910371 task-clock (msec) # 0.835 CPUs utilized 38 context-switches # 0.030 K/sec 3 cpu-migrations # 0.002 K/sec 9,466 page-faults # 0.007 M/sec 4,458,594,733 cycles # 3.473 GHz 2,149,690,303 stalled-cycles-frontend # 48.21% frontend cycles id 6,734,925,029 instructions # 1.51 insns per cycle # 0.32 stalled cycles per 1,552,029,608 branches # 1208.830 M/sec 119,358 branch-misses # 0.01% of all branches 1.537971058 seconds time elapsed 96
  88. Intel Architecture Code Analyzer (IACA) #include <iacaMarks.h> T sum_2(const std::vector<T>

    & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { IACA_START sum1 += input[i]; sum2 += input[i + 1]; } IACA_END return sum1 + sum2; } $ g++ -std=c++14 -O2 -march=native vector_sums_2i.cpp -o vector_sums_2i $ iaca -64 -arch IVB -graph ./vector_sums_2i • https://software.intel.com/en-us/articles/intel-architecture-code-analyzer • https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i- use-it • http://kylehegeman.com/blog/2013/12/28/introduction-to-iaca/ 101
  89. IACA Results - sum_1 $ iaca -64 -arch IVB -graph

    ./vector_sums_1i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_1i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 3.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.0 0.0 | 1.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.0 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 1.0 1.0 | | | | | mov rdx, qword ptr [rdi] | 2 | | 1.0 | | 1.0 1.0 | | | CP | vaddsd xmm0, xmm0, qword ptr [rdx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x1 | 1 | | | | | | 1.0 | | cmp rax, rcx | 0F | | | | | | | | jnz 0xffffffffffffffe7 Total Num Of Uops: 5 102
  90. IACA Results - sum_2 $ iaca -64 -arch IVB -graph

    ./vector_sums_2i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_2i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 6.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.5 0.0 | 3.0 | 1.5 1.5 | 1.5 1.5 | 0.0 | 1.5 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rcx, qword ptr [rdi] | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | CP | vaddsd xmm0, xmm0, qword ptr [rcx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x2 | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | vaddsd xmm1, xmm1, qword ptr [rcx+rdx*1] | 1 | 0.5 | | | | | 0.5 | | add rdx, 0x10 | 1 | | | | | | 1.0 | | cmp rax, rsi | 0F | | | | | | | | jnz 0xffffffffffffffde | 1 | | 1.0 | | | | | CP | vaddsd xmm0, xmm0, xmm1 Total Num Of Uops: 9 103
  91. ILP & Data (In)dependence G. S. Tjaden and M. J.

    Flynn, ‘‘Detection and Parallel Execution of Independent Instructions,’’ IEEE Transactions on Computers, vol. C-19, pp. 889-895, October 1970. 107
  92. ILP vs. Dependencies D. W. Wall, “Limits of instruction-level parallelism,”

    Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 108
  93. ILP, Criticality & Latency Hiding D. W. Wall, “Limits of

    instruction-level parallelism,” Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 109
  94. Empty Issue Slots: Horizontal Waste & Vertical Waste D. M.

    Tullsen, S. J. Eggers and H. M. Levy, "Simultaneous multithreading: Maximizing on-chip parallelism," Proceedings, 22nd Annual International Symposium on Computer Architecture, 1995. 110
  95. Wasted Slots: Causes D. M. Tullsen, S. J. Eggers and

    H. M. Levy, "Simultaneous multithreading: Maximizing on-chip parallelism," Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, Santa Margherita Ligure, Italy, 1995, pp. 392-403. 111
  96. Wasted Slots: Miss Events Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis,

    and James E. Smith. 2006. "A performance counter architecture for computing accurate CPI components." SIGOPS Oper. Syst. Rev. 40, 5 (October 2006), 175-184. 112
  97. likwid Results - sum_1: 489 Scalar MUOPS/s $ likwid-perfctr -C

    S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 1.090122s wall, 0.880000s user + 0.000000s system = 0.880000s CPU (80.7%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 8002493499 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 4285189526 | | CPU_CLK_UNHALTED_REF | FIXC2 | 3258346806 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000155741 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 2.0456 | | Runtime unhalted [s] | 1.6536 | | Clock [MHz] | 3408.2011 | | CPI | 0.5355 | | MFLOP/s | 488.9303 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 488.9303 | +----------------------+-----------+ 114
  98. likwid Results - sum_2: 595 Scalar MUOPS/s $ likwid-perfctr -C

    S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 2 1e+09 0.620421s wall, 0.470000s user + 0.000000s system = 0.470000s CPU (75.8%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 6502566958 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2948446599 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2223894218 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000328727 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.6809 | | Runtime unhalted [s] | 1.1377 | | Clock [MHz] | 3435.8987 | | CPI | 0.4534 | | MFLOP/s | 595.1079 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 595.1079 | +----------------------+-----------+ 115
  99. likwid Results: sum_vectorized: 676 AVX MFLOP/s g++ -std=c++14 -O2 -ftree-vectorize

    -ffast-math -march=native -lboost_timer vector_sums.cpp -o vector_sums_vf $ likwid-perfctr -C S0:0 -g FLOPS_DP -f ./vector_sums_vf 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 0.561288s wall, 0.390000s user + 0.000000s system = 0.390000s CPU (69.5%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3002491149 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2709364345 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2043804906 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 91 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 260258099 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.5390 | | Runtime unhalted [s] | 1.0454 | | Clock [MHz] | 3435.5297 | | CPI | 0.9024 | | MFLOP/s | 676.4420 | | AVX MFLOP/s | 676.4420 | | Packed MUOPS/s | 169.1105 | | Scalar MUOPS/s | 0.0001 | +----------------------+-----------+ 116
  100. Performance: CPI Steven K. Przybylski, "Cache and Memory Hierarchy Design

    – A Performance-Directed Approach," San Fransisco, Morgan-Kaufmann, 1990. 117
  101. Performance: [YMMV]PI - Power Grochowski, E., Ronen, R., Shen, J.,

    & Wang, H. (2004). "Best of Both Latency and Throughput." Proceedings of the IEEE International Conference on Computer Design. 118
  102. Performance: [YMMV]PI - Graphs Scott Beamer, Krste Asanović, and David

    A. Patterson. "GAIL: The Graph Algorithm Iron Law." Workshop on Irregular Applications: Architectures and Algorithms (IAˆ3), at the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015. 119
  103. Performance: [YMMV]PI - Packets packet_processing_times = seconds/packet = instructions/packet *

    clock_cycles/instruction * seconds/clock_cycle = clock_cycles/packet * seconds/clock_cycle = CPP / core_frequency cycles per packet (CPP) http://blogs.cisco.com/sp/a-bigger-helping-of-internet-please 120
  104. Performance: separable components of a CPI CPI = (Infinite-cache CPI)

    + finite-cache effect (FCE) Infinite-cache CPI = execute busy (EBusy) + execute idle (EIdle) FCE = (cycles per miss) × (misses per instruction) = (miss penalty) × (miss rate) P. G. Emma. "Understanding some simple processor-performance limits." IBM Journal of Research and Development, 41(3):215–232, May 1997. 121
  105. Pipelining & Branches P. Emma and E. Davidson, "Characterization of

    Branch and Data Dependencies in Programs for Evaluating Pipeline Performance," IEEE Trans. Computers C-36, No. 7, 859-875 (July 1987) 122
  106. Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and

    James E. Smith. 2009. "A mechanistic performance model for superscalar out-of-order processors." ACM Trans. Comput. Syst. 27, 2, Article 3. 123
  107. Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and

    James E. Smith, "A Performance Counter Architecture for Computing Accurate CPI Components", ASPLOS 2006, pp. 175-184. 124
  108. Branch (Mis)Prediction Example I #include <cmath> #include <cstddef> #include <cstdlib>

    #include <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> double sum1(const std::vector<double> & x, const std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::cos(x[i]) : std::sin(x[i]); } return sum; } 125
  109. Branch (Mis)Prediction Example II double sum2(const std::vector<double> & x, const

    std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::sin(x[i]) : std::cos(x[i]); } return sum; } std::vector<bool> inclusion_random(std::size_t n, double p) { std::vector<bool> which; which.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution decision(p); for (std::size_t i = 0; i != n; ++i) which.push_back(decision(g)); 126
  110. Branch (Mis)Prediction Example III return which; } int main(int argc,

    char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // branch takenness / predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\n'; // takenness probability // 0.0: never; 1.0: always const double p = (argc > 3) ? std::atof(argv[3]) : 0.5; std::cout << "p = " << p << '\n'; 127
  111. Branch (Mis)Prediction Example IV std::vector<bool> which; if (type == 0)

    which.resize(n, false); else if (type == 1) which.resize(n, true); else if (type == 2) which = inclusion_random(n, p); const std::vector<double> x(n, 1.1); boost::timer::auto_cpu_timer timer; std::cout << sum1(x, which) + sum2(x, which) << '\n'; } 128
  112. Timing: Branch (Mis)Prediction Example $ make BP CXXFLAGS="-std=c++14 -O3 -march=native"

    LDLIBS=-lboost_timer-mt $ ./BP 10000000 0 n = 10000000 type = 0 1.3448e+007 1.190391s wall, 1.187500s user + 0.000000s system = 1.187500s CPU (99.8%) $ ./BP 10000000 1 n = 10000000 type = 1 1.3448e+007 1.172734s wall, 1.156250s user + 0.000000s system = 1.156250s CPU (98.6%) $ ./BP 10000000 2 n = 10000000 type = 2 1.3448e+007 1.296455s wall, 1.296875s user + 0.000000s system = 1.296875s CPU (100.0%) 129
  113. Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH

    -f ./BP 10000000 0 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 0 1.3448e+07 0.445464s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177597 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167613066 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167632206 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952380 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14796 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4586 | | Runtime unhalted [s] | 0.4505 | | Clock [MHz] | 2591.5373 | | CPI | 0.4679 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.929838e-06 | | Branch misprediction ratio | 3.967263e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 130
  114. Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH

    -f ./BP 10000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 1 1.3448e+07 0.445354s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177490 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167125701 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167146162 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952366 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14720 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4584 | | Runtime unhalted [s] | 0.4504 | | Clock [MHz] | 2591.5345 | | CPI | 0.4678 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.899380e-06 | | Branch misprediction ratio | 3.946885e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 131
  115. Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH

    -f ./BP 10000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 2 1.3448e+07 0.509917s wall, 0.510000s user + 0.000000s system = 0.510000s CPU (100.0%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3191479747 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2264945099 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2264967068 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 468135649 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 15326586 | +------------------------------+---------+------------+ +----------------------------+-----------+ | Metric | Core 1 | +----------------------------+-----------+ | Runtime (RDTSC) [s] | 0.8822 | | Runtime unhalted [s] | 0.8740 | | Clock [MHz] | 2591.5589 | | CPI | 0.7097 | | Branch rate | 0.1467 | | Branch misprediction rate | 0.0048 | | Branch misprediction ratio | 0.0327 | | Instructions per branch | 6.8174 | +----------------------------+-----------+ 132
  116. Perf: Branch (Mis)Prediction Example $ perf stat -e branches,branch-misses -r

    10 ./BP 10000000 0 Performance counter stats for './BP 10000000 0' (10 runs): 374,121,213 branches ( +- 0.02% ) 23,260 branch-misses # 0.01% of all branches ( +- 0.35% ) 0.460392835 seconds time elapsed ( +- 0.50% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 1 Performance counter stats for './BP 10000000 1' (10 runs): 374,040,282 branches ( +- 0.01% ) 23,124 branch-misses # 0.01% of all branches ( +- 0.45% ) 0.457583418 seconds time elapsed ( +- 0.04% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 2 Performance counter stats for './BP 10000000 2' (10 runs): 469,331,762 branches ( +- 0.01% ) 15,326,501 branch-misses # 3.27% of all branches ( +- 0.01% ) 0.884858777 seconds time elapsed ( +- 0.30% ) 133
  117. Branch Prediction & Speculative Execution D. Sima, "Decisive aspects in

    the evolution of microprocessors", Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 143
  118. Block Enlargement Fisher, J. A. (1983). "Very Long Instruction Word

    architectures and the ELI-512." Proceedings of the 10th Annual International Symposium on Computer Architecture. 144
  119. Block Enlargement Joseph A. Fisher and John J. O'Donnell, "VLIW

    Machines: Multiprocessors We Can Actually Program," CompCon 84 Proceedings, pp. 299-305, IEEE, 1984. 145
  120. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) 146
  121. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) 146
  122. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 146
  123. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 146
  124. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: 146
  125. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: • all predictable! 146
  126. Branch Predictability & Marker API https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#using- the-marker-api https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC g++ -Ofast

    -march=native source.cpp -o application -std=c++14 -DLIKWID_PERFMON -lpthread -llikwid likwid-perfctr -f -C 0-3 -g BRANCH -m ./application #include <likwid.h> // . . . LIKWID_MARKER_START("branch"); // branch code LIKWID_MARKER_STOP("branch"); 147
  127. Branch Entropy linear entropy: EL(p) = 2 × min(p, 1

    − p) intuition: miss rate proportional to the probability of the least frequent outcome 148
  128. Branch Takenness Probability Sander De Pestel, Stijn Eyerman and Lieven

    Eeckhout, "Micro-Architecture Independent Branch Behavior Characterization", IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 149
  129. Branch Entropy & Miss Rate: Linear Relationship Sander De Pestel,

    Stijn Eyerman and Lieven Eeckhout, "Micro-Architecture Independent Branch Behavior Characterization", IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 150
  130. Branches & Expectations: Code I #include <chrono> #include <cmath> #include

    <cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <numeric> #include <random> #include <string> #include <vector> #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable((x))) 151
  131. Branches & Expectations: Code II using T = int; void

    f(T z, T & x, T & y) { ((z < 0) ? x : y) = 5; } void generate_never(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(10, 19); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 152
  132. Branches & Expectations: Code III void generate_always(std::size_t n, std::vector<T> &

    zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-19, -10); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } void generate_random(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-5, 4); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 153
  133. Branches & Expectations: Code IV int main(int argc, char *

    argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n), zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } else if (type == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); 154
  134. Branches & Expectations: Code V const auto time_start = std::chrono::steady_clock::now();

    T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(zs[i], xs[i], ys[i]); } const auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 155
  135. Branches & Expectations: Compiling & Timing g++ -ggdb -std=c++14 -march=native

    -Ofast ./branches.cpp -o branches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./branches.cpp -o branches_c time ./branches_g 1000000 0 time ./branches_g 1000000 1 time ./branches_g 1000000 2 time ./branches_c 1000000 0 time ./branches_c 1000000 1 time ./branches_c 1000000 2 156
  136. Branches & Expectations: Timings (GCC) $ time ./branches_g 1000000 0

    n = 1000000 type = 0 never duration: 0.00082991 sum(xs): 0 sum(ys): 5000000 real 0m0.034s user 0m0.033s sys 0m0.003s $ time ./branches_g 1000000 1 n = 1000000 type = 1 always duration: 0.000839488 sum(xs): 5000000 sum(ys): 0 real 0m0.031s user 0m0.030s sys 0m0.000s $ time ./branches_g 1000000 2 n = 1000000 type = 2 random duration: 0.0052968 sum(xs): 2498105 sum(ys): 2501895 real 0m0.038s user 0m0.033s sys 0m0.003s 157
  137. Branches & Expectations: Timings (Clang) $ time ./branches_c 1000000 0

    n = 1000000 type = 0 never duration: 0.00091161 sum(xs): 0 sum(ys): 5000000 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 1 n = 1000000 type = 1 always duration: 0.000765925 sum(xs): 5000000 sum(ys): 0 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 2 n = 1000000 type = 2 random duration: 0.00554585 sum(xs): 2498105 sum(ys): 2501895 real 0m0.041s user 0m0.040s sys 0m0.000s 158
  138. So many performance events, so little time "So many performance

    events, so little time," Gerd Zellweger, Denny Lin, Timothy Roscoe. Proceedings of the 7th Asia-Pacific Workshop on Systems (APSys, Hong Kong, China, August 2016). 159
  139. Hierarchical cycle accounting Andrzej Nowak, David Levinthal, Willy Zwaenepoel: "Hierarchical

    cycle accounting: a new method for application performance tuning." ISPASS 2015. https://github.com/David-Levinthal/gooda 160
  140. Top-down Microarchitecture Analysis Method (TMAM) https://github.com/andikleen/pmu-tools/wiki/toplev-manual https://sites.google.com/site/analysismethods/yasin-pubs "A Top-Down Method

    for Performance Analysis and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 161
  141. TMAM: Bottlenecks "A Top-Down Method for Performance Analysis and Counters

    Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 162
  142. TMAM: Breakdown "A Top-Down Method for Performance Analysis and Counters

    Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 163
  143. TMAM: Meaning Updates: https://download.01.org/perfmon/ "A Top-Down Method for Performance Analysis

    and Counters Architecture," Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 164
  144. Branches & Expectations: TMAM, Level 1 (GCC) $ ~/builds/pmu-tools/toplev.py -l1

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 1000000 type = 2 random duration: 0.00523105 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.92 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_g 1000000 2 165
  145. Branches & Expectations: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1 n = 1000000 type = 2 random duration: 0.00528841 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,c n = 1000000 type = 2 random duration: 0.00550316 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.94 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 47.54 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 16.41 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./branches_g 1000000 2 166
  146. Branches & Expectations: TMAM, Level 2, perf (GCC) perf record

    -g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period -o perf.data ./branches_g 1000000 2 perf report -Mintel 167
  147. Branches & Expectations: TMAM, Level 1 (Clang) $ ~/builds/pmu-tools/toplev.py -l1

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 1000000 type = 2 random duration: 0.00555177 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.53 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_c 1000000 2 168
  148. Branches & Expectations: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 1000000 type = 2 random duration: 0.0055571 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 1000000 type = 2 random duration: 0.00556777 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.54 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 39.20 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 15.18 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 169
  149. Branches & Expectations: TMAM, Level 2, perf (Clang) perf record

    -g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./branches_c 1000000 2 perf report -Mintel 170
  150. Virtual Functions & Indirect Branches: Code I #include <chrono> #include

    <cmath> #include <cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <memory> #include <numeric> #include <random> #include <string> #include <vector> #define str(s) #s #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable(!!(x))) 171
  151. Virtual Functions & Indirect Branches: Code II using T =

    int; struct base { virtual T f() const { return 0; } }; struct derived_taken : base { T f() const override { return -1; } }; struct derived_untaken : base { T f() const override { return 1; } }; void f(const base & b, T & x, T & y) { ((b.f() < 0) ? x : y) = 119; } void generate_never(std::size_t n, std::vector<std::unique_ptr<base>> & zs) { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_untaken>()); return; 172
  152. Virtual Functions & Indirect Branches: Code III } void generate_always(std::size_t

    n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_taken>()); return; } void generate_random(std::size_t n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution z(0.5); for (std::size_t i = 0; i != n; ++i) { if (z(g)) zs.emplace_back(std::make_unique<derived_taken>()); else zs.emplace_back(std::make_unique<derived_untaken>()); 173
  153. Virtual Functions & Indirect Branches: Code IV } return; }

    int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n); std::vector<std::unique_ptr<base>> zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } 174
  154. Virtual Functions & Indirect Branches: Code V else if (type

    == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); auto time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(*zs[i], xs[i], ys[i]); } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 175
  155. Virtual Functions & Indirect Branches: Compiling & Timing g++ -ggdb

    -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_c time ./vbranches_g 10000000 0 time ./vbranches_g 10000000 1 time ./vbranches_g 10000000 2 time ./vbranches_c 10000000 0 time ./vbranches_c 10000000 1 time ./vbranches_c 10000000 2 176
  156. Virtual Functions & Indirect Branches: Timings (GCC) $ time ./vbranches_g

    10000000 0 n = 10000000 type = 0 never duration: 0.0338749 sum(xs): 0 sum(ys): 1190000000 real 0m0.645s user 0m0.573s sys 0m0.070s $ time ./vbranches_g 10000000 1 n = 10000000 type = 1 always duration: 0.0406144 sum(xs): 1190000000 sum(ys): 0 real 0m0.648s user 0m0.563s sys 0m0.083s $ time ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.131803 sum(xs): 595154105 sum(ys): 594845895 real 0m0.956s user 0m0.863s sys 0m0.090s 177
  157. Branches & Expectations: Timings (Clang) $ time ./vbranches_c 10000000 0

    n = 10000000 type = 0 never duration: 0.0314749 sum(xs): 0 sum(ys): 1190000000 real 0m0.623s user 0m0.530s sys 0m0.090s $ time ./vbranches_c 10000000 1 n = 10000000 type = 1 always duration: 0.0314727 sum(xs): 1190000000 sum(ys): 0 real 0m0.623s user 0m0.557s sys 0m0.063s $ time ./vbranches_c 10000000 2 n = 10000000 type = 2 random duration: 0.0854935 sum(xs): 595154105 sum(ys): 594845895 real 0m1.863s user 0m1.800s sys 0m0.063s 178
  158. Virtual Functions & Indirect Branches: TMAM, Level 1 (GCC) $

    ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x9c,umask=0x1/u,cycles:u}' n = 10000000 type = 2 random duration: 0.131386 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. BAD Bad_Speculation: 12.98 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_g 10000000 2 179
  159. Virtual Functions & Indirect Branches: TMAM, Level 2 (GCC) $

    ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0xc5,umask=0x0/u,cp n = 10000000 type = 2 random duration: 0.131247 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,cycles:u,cpu/event=0xa3,umask=0x4,cmask=4 n = 10000000 type = 2 random duration: 0.131361 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 36.02 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.41 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u BAD Bad_Speculation: 12.92 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.75 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data . 180
  160. Virtual Functions & Indirect Branches: TMAM, Level 3 (GCC) $

    ~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.13145 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.44 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.69 % [100.00%] This metric represents cycles fraction the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of miss-predicted branches. For example, branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sampling events: br_misp_retired.all_branches:u BAD Bad_Speculation: 12.97 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.82 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./vbranches_g 10000000 2 181
  161. Virtual Functions: TMAM, Level 3, perf (GCC) perf record -g

    -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_g 10000000 2 perf report -Mintel 182
  162. Virtual Functions & Indirect Branches: TMAM, Level 1 (Clang) $

    ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/ n = 10000000 type = 2 random duration: 0.0858722 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.66 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_c 10000000 2 183
  163. Virtual Functions & Indirect Branches: TMAM, Level 2 (Clang) $

    ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask n = 10000000 type = 2 random duration: 0.0859943 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions n = 10000000 type = 2 random duration: 0.0861661 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.61 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.64 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,pe 184
  164. Virtual Functions & Indirect Branches: TMAM, Level 3 (Clang) ~/builds/pmu-tools/toplev.py

    -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.65 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.63 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.MS_Switches: 8.40 % [100.00%] This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals. Sampling events: idq.ms_switches:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,edge=1,cmask=1,name=MS_Switches_IDQ_MS_SWITCHES,period=2000003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./vbranches_c 10000000 2 185
  165. Virtual Functions: TMAM, Level 3, perf (Clang) perf record -g

    -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_c 10000000 2 perf report -Mintel 186
  166. Compiler-Specific Built-in Functions GCC & Clang: __builtin_expect http://llvm.org/docs/BranchWeightMetadata.html#built-in- expect-instructions https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

    likely & unlikely https://kernelnewbies.org/FAQ/LikelyUnlikely Clang: __builtin_unpredictable http://clang.llvm.org/docs/LanguageExtensions.html#builtin- unpredictable 189
  167. Branch Misprediction, Speculation, and Wrong-Path Execution J. Reineke et al.,

    “A Definition and Classification of Timing Anomalies,” Proc. Int'l Workshop Worst Case Execution Time (WCET), 2006. 190
  168. Branch Misprediction Penalty & Wrong-Path Execution Tejas S. Karkhanis and

    James E. Smith. 2004. "A First-Order Superscalar Processor Model." In Proceedings of the 31st annual international symposium on Computer architecture (ISCA '04). 191
  169. The Curse of Multiple Granularities Seshadri, V. (2016). "Simple DRAM

    and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems." CoRR, abs/1605.06483. 192
  170. Word Granularity != Cache Line Granularity Seshadri, V. (2016). "Simple

    DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems." CoRR, abs/1605.06483. 193
  171. Shortcomings of Strided Access Patterns Seshadri, V. (2016). "Simple DRAM

    and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems." CoRR, abs/1605.06483. 194
  172. Pointer Chasing Example - Linked List - C++ #include <algorithm>

    #include <forward_list> #include <iterator> bool found(const std::forward_list<int> & list, int value) { return find(begin(list), end(list), value) != end(list); } int main() { std::forward_list<int> list {11, 22, 33, 44, 55}; return found(list, 42); } 198
  173. Pointer Chasing Example - Linked List - CFG (r2) radiff2

    -g sym.found forward_list_app forward_list_app > forward_list_found.dot xdot forward_list_found.dot dot -Tpng -o forward_list_found.png forward_list_found.dot 202
  174. Isolated & Clustered Cache Misses Miquel Moreto, Francisco J. Cazorla,

    Alex Ramirez, and Mateo Valero. 2008. "MLP-aware dynamic cache partitioning." In Proceedings of the 3rd international conference on High performance embedded architectures and compilers (HiPEAC'08). 203
  175. Cache Miss Cost & Miss Clustering Thomas R. Puzak, A.

    Hartstein, P. G. Emma, V. Srinivasan, and Jim Mitchell. 2007. "An analysis of the effects of miss clustering on the cost of a cache miss." In Proceedings of the 4th international conference on Computing frontiers (CF '07). ACM, New York, NY, USA, 3-12.204
  176. Cache Miss Penalty: Different STC due to different MLP MLP

    (memory-level parallelism) & STC (stall-time criticality) R. Das, O. Mutlu, T. Moscibroda and C. R. Das, "Application-aware prioritization mechanisms for on-chip networks," 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), New York, NY, 2009, pp. 280-291. 205
  177. Skip Lists William Pugh. 1990. "Skip lists: a probabilistic alternative

    to balanced trees." Commun. ACM 33, 6, 668-676. 206
  178. Jump Pointers S. Chen, P. B. Gibbons, and T. C.

    Mowry. “Improving Index Performance through Prefetching.” In Proc. of the 20th Annual ACM SIGMOD International Conference on Management of Data, 2001. 207
  179. Prefetching Aggressiveness: Distance & Degree Sparsh Mittal. 2016. "A Survey

    of Recent Prefetching Techniques for Processor Caches." ACM Comput. Surv. 49, 2, Article 35. 208
  180. Prefetching Timeliness Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012.

    "When Prefetching Works, When It Doesn’t, and Why." ACM Trans. Archit. Code Optim. 9, 1, Article 2. 209
  181. Prefetches Classification Huaiyu Zhu, Yong Chen, and Xian-He Sun. 2010.

    "Timing local streams: improving timeliness in data prefetching." In Proceedings of the 24th ACM International Conference on Supercomputing (ICS '10). ACM, New York, NY, USA, 169-178. 210
  182. Prefetching I #include <algorithm> #include <chrono> #include <cinttypes> #include <cstddef>

    #include <cstdio> #include <cstdlib> #include <future> #include <iterator> #include <memory> #include <random> struct point { double x, y, z; }; using T = point; 211
  183. Prefetching II struct timing_result { double duration_initial; double duration_non_prefetched; double

    duration_degree; double sum_initial; double sum_non_prefetched; double sum_degree; }; timing_result chase(std::size_t n, bool shuffle, std::size_t d, bool prefet timing_result chase_result; std::vector<std::unique_ptr<T>> v; for (std::size_t i = 0; i != n; ++i) { v.emplace_back(new point{1. * i, 2. * i, 5.* i}); } if (shuffle) { std::mt19937 g(1); 212
  184. Prefetching III std::shuffle(begin(v), end(v), g); } double sum = 0.0;

    auto time_start = std::chrono::steady_clock::now(); if (prefetch) { for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); sum += std::exp(-v[i]->y); } } else { for (std::size_t i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; chase_result.duration_initial = duration.count(); chase_result.sum_initial = sum; 213
  185. Prefetching IV sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t

    i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; chase_result.duration_non_prefetched = duration.count(); chase_result.sum_non_prefetched = sum; sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); __builtin_prefetch(v[i + 2*d].get()); sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); 214
  186. Prefetching V duration = time_end - time_start; chase_result.duration_degree = duration.count();

    chase_result.sum_degree = sum; return chase_result; } int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; const bool shuffle = (argc > 2) ? std::atoi(argv[2]) : false; const std::size_t d = (argc > 3) ? std::atoll(argv[3]) : 3; const bool prefetch = (argc > 4) ? std::atoi(argv[4]) : false; const std::size_t threads_count = (argc > 5) ? std::atoll(argv[5]) : 4; printf("size: %zu \n", n); printf("shuffle: %d \n", shuffle); printf("distance: %zu \n", d); printf("prefetch: %d \n", prefetch); 215
  187. Prefetching VI printf("threads_count: %d \n", threads_count); const auto thread_work =

    [n, shuffle, d, prefetch]() { return chase(n, shuffle, d, prefetch); }; std::vector<std::future<timing_result>> results; for (std::size_t thread = 0; thread != threads_count; ++thread) results.emplace_back(std::async(std::launch::async, thread_work)); for (auto && future_result : results) if (future_result.valid()) future_result.wait(); std::vector<double> timings_initial, timings_non_prefetched, timings_degree; for (auto && future_result : results) { timing_result chase_result = future_result.get(); timings_initial.push_back(chase_result.duration_initial); 216
  188. Prefetching VII timings_non_prefetched.push_back(chase_result.duration_non_prefetched); timings_degree.push_back(chase_result.duration_degree); } const auto timings_initial_minmax = std::minmax_element(begin(timings_initial),

    end(timings_initial)); const auto timings_non_prefetched_minmax = std::minmax_element(begin(timings_non_prefetched), end(timings_non_pref const auto timings_degree_minmax = std::minmax_element(begin(timings_degree), end(timings_degree)); printf(prefetch ? "prefetched" : "non-prefetched"); printf(" initial duration: [%g, %g] \n", *timings_initial_minmax.first, *timings_initial_minmax.second); printf("non-prefetched duration: [%g, %g] \n", *timings_non_prefetched_mi *timings_non_prefetched_minmax.second); printf("degree-two prefetching duration: [%g, %g] \n", *timings_degree_mi *timings_degree_minmax.second); } 217
  189. Prefetch Overhead S. Van der Wiel and D. Lilja, "A

    Survey of Data Prefetching Techniques," Technical Report No. HPPC 96-05, University of Minnesota, October 1996. 218
  190. Prefetching Timings: No Prefetch $ likwid-perfctr -f -C 0-3 -g

    L3 -m ./prefetch 100000 1 0 0 4 distance: 0 prefetch: 0 non-prefetched initial duration: [0.00280393, 0.00289815] non-prefetched duration: [0.00254968, 0.00257311] degree-two prefetching duration: [0.00290615, 0.00296243] Region chase_initial, Group 1: L3 | CPI STAT | 5.8641 | 1.4529 | 1.4744 | 1.4660 | | L3 bandwidth [MBytes/s] STAT | 10733.6308 | 2666.0364 | 2710.9325 | 2683.4077 | Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | L3 miss rate STAT | 0.0584 | 0.0145 | 0.0148 | 0.0146 | | L3 miss ratio STAT | 3.7723 | 0.9117 | 0.9789 | 0.9431 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 0 0 4 | Cycles without execution [%] STAT | 228.2316 | 56.8136 | 57.4443 | 57.0579 | | Cycles without execution [%] STAT | 227.0385 | 56.5980 | 57.0024 | 56.7596 | 219
  191. Prefetching Timings: useless 0-distance prefetch (overhead) $ likwid-perfctr -f -C

    0-3 -g L3 -m ./prefetch 100000 1 0 1 4 distance: 0 prefetch: 1 prefetched initial duration: [0.00288751, 0.00295978] non-prefetched duration: [0.0025575, 0.00258342] degree-two prefetching duration: [0.00285772, 0.00287839] Region chase_initial, Group 1: L3 | CPI STAT | 5.7454 | 1.4345 | 1.4387 | 1.4364 | | L3 bandwidth [MBytes/s] STAT | 10518.6383 | 2618.5405 | 2645.6096 | 2629.6596 | 220
  192. Prefetching Timings: 1-distance prefetch (mostly overhead) $ likwid-perfctr -f -C

    0-3 -g L3CACHE -m ./prefetch 100000 1 1 1 4 prefetched initial duration: [0.00250957, 0.00257662] non-prefetched duration: [0.00255286, 0.00258417] degree-two prefetching duration: [0.00230482, 0.00235828] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.9595 | 1.2343 | 1.2433 | 1.2399 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0889 | 0.4381 | 0.6454 | 0.5222 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 1 1 4 | Cycles without execution [%] STAT | 214.1614 | 53.4628 | 53.6716 | 53.5404 | | Cycles without execution [%] STAT | 200.4785 | 50.0405 | 50.1857 | 50.1196 | Formulas: L3 request rate = MEM_LOAD_UOPS_RETIRED_L3_ALL/UOPS_RETIRED_ALL L3 miss rate = MEM_LOAD_UOPS_RETIRED_L3_MISS/UOPS_RETIRED_ALL L3 miss ratio = MEM_LOAD_UOPS_RETIRED_L3_MISS/MEM_LOAD_UOPS_RETIRED_L3_ALL https://github.com/RRZE-HPC/likwid/blob/master/groups/ivybridge/L3CACHE.txt 221
  193. Prefetching Timings: 2-distance prefetch $ likwid-perfctr -f -C 0-3 -g

    L3CACHE -m ./prefetch 100000 1 2 1 4 size: 100000 shuffle: 1 distance: 2 prefetch: 1 threads_count: 4 prefetched initial duration: [0.0023392, 0.00241287] non-prefetched duration: [0.00257006, 0.00260938] degree-two prefetching duration: [0.00199431, 0.00203528] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.5557 | 1.1331 | 1.1423 | 1.1389 | | L3 request rate STAT | 0.0006 | 0.0001 | 0.0002 | 0.0002 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2317 | 0.3138 | 0.6791 | 0.5579 | Region chase_degree, Group 1: L3CACHE | CPI STAT | 3.6990 | 0.9243 | 0.9253 | 0.9248 | | L3 request rate STAT | 0.0005 | 0.0001 | 0.0002 | 0.0001 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0145 | 0.3597 | 0.6550 | 0.5036 | 222
  194. Prefetching Timings: 8-distance prefetch $ likwid-perfctr -f -C 0-3 -g

    L3CACHE -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00181161, 0.00188783] non-prefetched duration: [0.00257601, 0.0026076] degree-two prefetching duration: [0.00152468, 0.00156814] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0065 | 0.0016 | 0.0017 | 0.0016 | | CPI STAT | 3.4808 | 0.8650 | 0.8788 | 0.8702 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2431 | 0.4694 | 0.6640 | 0.5608 | Region chase_degree, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0053 | 0.0013 | 0.0014 | 0.0013 | | CPI STAT | 2.7450 | 0.6832 | 0.6882 | 0.6863 | | L3 miss rate STAT | 0.0016 | 0.0004 | 0.0004 | 0.0004 | | L3 miss ratio STAT | 3.4045 | 0.7778 | 0.9346 | 0.8511 | 223
  195. Prefetching Timings: 8-distance prefetch $ likwid-perfctr -f -C 0-3 -g

    L3 -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00180738, 0.00189831] non-prefetched duration: [0.00254486, 0.00258013] degree-two prefetching duration: [0.00154542, 0.00158065] Region chase_initial, Group 1: L3 | CPI STAT | 3.5027 | 0.8668 | 0.8835 | 0.8757 | | L3 bandwidth [MBytes/s] STAT | 17384.8731 | 4296.5905 | 4381.7164 | 4346.2183 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 2.7626 | 0.6894 | 0.6919 | 0.6906 | | L3 bandwidth [MBytes/s] STAT | 21505.6670 | 5333.6653 | 5396.4473 | 5376.4168 $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 8 1 4 | Cycles without execution [%] STAT | 187.6689 | 46.3938 | 47.3055 | 46.9172 | | Cycles without execution [%] STAT | 151.5095 | 37.6872 | 38.0656 | 37.8774 | 224
  196. Prefetching Timings: suboptimal (untimely) prefetch $ likwid-perfctr -f -C 0-3

    -g L3 -m ./prefetch 100000 1 512 1 4 size: 100000 shuffle: 1 distance: 512 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00177956, 0.00186644] non-prefetched duration: [0.00257188, 0.0026064] degree-two prefetching duration: [0.00173249, 0.00178712] Region chase_initial, Group 1: L3 | CPI STAT | 3.4343 | 0.8523 | 0.8683 | 0.8586 | | L3 data volume [GBytes] STAT | 0.0293 | 0.0073 | 0.0074 | 0.0073 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 3.1891 | 0.7903 | 0.8034 | 0.7973 | | L3 bandwidth [MBytes/s] STAT | 19902.4764 | 4954.4107 | 5013.4006 | 4975.6191 | 225
  197. Gem5 - std::vector & std::list I Filling with numbers -

    std::vector vs. std::list Machine code & assembly (std::vector) Micro-ops execution breakdown (std::vector) Assembly is Too High Level: http://xlogicx.net/?p=369 227
  198. Gem5 - std::vector & std::list III Pipeline diagram - one

    iteration (std::vector) Pipeline diagram - three iterations (std::vector) 229
  199. Gem5 - std::vector & std::list IV Machine code & assembly

    (std::list) heap allocation in the loop @ 400d85 what could possibly go wrong? 230
  200. (The GNU C library's) malloc https://sourceware.org/glibc/wiki/MallocInternals Arena A structure that

    is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are "free". Threads assigned to each arena will allocate memory from that arena's free lists. Glibc Heap Analysis in Linux Systems with Radare2 https://youtube.com/watch?v=Svm5V4leEho r2con-2016 - rada.re/con/ 235
  201. malloc & free - new, new[], delete, delete[] int main()

    { double * a = new double[8]; double * b = new double[8]; delete[] b; delete[] a; double * c = new double[8]; delete[] c; } 236
  202. Memory Access Patterns: Temporal & Spatial Locality horizontal axis -

    time vertical axis - address D. J. Hatfield and J. Gerald. "Program restructuring for virtual memory." IBM Systems Journal, 10(3):168–192, 1971. 243
  203. Loop Fusion 0.429504s (unfused) down to 0.287501s (fused) g++ -Ofast

    -march=native (5.2.0) void unfused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) a[i] = b[i] * c[i]; for (size_t i = 0; i != N; ++i) d[i] = a[i] * c[i]; } void fused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } } 244
  204. Pin - A Dynamic Binary Instrumentation Tool http://www.intel.com/software/pintool pin -t

    $PIN_ROOT/source/tools/ManualExamples/obj-intel64/pinatrace.so -- ./loop_fusion . . . 0x400e43,R,0x401c48 0x400e59,R,0x401d40 0x400e65,W,0x1c789c0 0x400e65,W,0x1c789e0 . . . r-project.org rstudio.com ggplot2.org rcpp.org 245
  205. Takeaway: Overlapping Latencies as a General Principle Overlapping latencies also

    works on a "macro" scale • load as "get the data from the Internet" • compute as "process the data" Another example: Communication Avoiding and Overlapping for Numerical Linear Algebra • https://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-65.html • http://www.cs.berkeley.edu/~egeor/sc12_slides_final.pdf 252
  206. Non-Overlapped Timings id,symbol,count,time 1,AAPL,565449,1.59043 2,AXP,731366,3.43745 3,BA,867366,5.40218 4,CAT,830327,7.08103 5,CSCO,400440,8.49192 6,CVX,687198,9.98761 7,DD,910932,12.2254

    8,DIS,910430,14.058 9,GE,871676,15.8333 10,GS,280604,17.059 11,HD,556611,18.2738 12,IBM,860071,20.3876 13,INTC,559127,21.9856 14,JNJ,724724,25.5534 15,JPM,500473,26.576 16,KO,864903,28.5405 17,MCD,717021,30.087 18,MMM,698996,31.749 19,MRK,733948,33.2642 20,MSFT,475451,34.3134 21,NKE,556344,36.4545 253
  207. Overlapped Timings id,symbol,count,time 1,AAPL,565449,2.00713 2,AXP,731366,2.09158 3,BA,867366,2.13468 4,CAT,830327,2.19194 5,CSCO,400440,2.19197 6,CVX,687198,2.19198 7,DD,910932,2.51895

    8,DIS,910430,2.51898 9,GE,871676,2.51899 10,GS,280604,2.519 11,HD,556611,2.51901 12,IBM,860071,2.51902 13,INTC,559127,2.51902 14,JNJ,724724,2.51903 15,JPM,500473,2.51904 16,KO,864903,2.51905 17,MCD,717021,2.51906 18,MMM,698996,2.51907 19,MRK,733948,2.51908 20,MSFT,475451,2.51908 21,NKE,556344,2.51909 254
  208. Cache Misses, MLP, and STC: Slack R. Das et al.,

    "Aérgia: Exploiting Packet Latency Slack in On-Chip Networks," Proc. 37th Ann. Int’l Symp. Computer Architecture (ISCA 10), ACM Press, 2010. 258
  209. Dependent Cache Misses - Non-Overlapped - Serialized A Day in

    the Life of a Cache Miss Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and James E. Smith, "A Performance Counter Architecture for Computing Accurate CPI Components", ASPLOS 2006, pp. 175-184. 1. load instruction enters the window (ROB) 2. the load issues from the instruction buffer (RS) 3. the load blocks the ROB head 4. ROB eventually fills 5. dispatch stops, instruction window drains 6. eventually issue and commit stop
  210. Independent Cache Misses in ROB - Overlapped Stijn Eyerman, Lieven

    Eeckhout, Tejas Karkhanis, and James E. Smith, "A Top-Down Approach to Architecting CPI Component Performance Counters", IEEE Micro, Special Issue on Top Picks from 2006 Microarchitecture Conferences, Vol 27, No 1, pp. 84-93. 260
  211. Miss-Dependent Mispredicted Branch - Penalties Serialization S. Eyerman, J.E. Smith

    and L. Eeckhout, "Characterizing the branch misprediction penalty", Performance Analysis of Systems and Software 2006 IEEE International Symposium on 2006, pp. 48-58. 261
  212. Dependent Cache Misses - Non-Overlapped - Serialized Milad Hashemi, Khubaib,

    Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. "Accelerating Dependent Cache Misses with an Enhanced Memory Controller." In ISCA, 2016. 262
  213. Independent Misses Connected by a Pending Cache Hit • MLP

    - supported by non-blocking caches, out-of-order execution • multiple outstanding cache-misses - Miss Status Holding Registers (MSHRs) / Line Fill Buffers (LFBs) • MSHR file entries - merging redundant (same cache line) memory requests Xi E. Chen and Tor M. Aamodt. 2008. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). 263
  214. Independent Misses Connected by a Pending Cache Hit Xi E.

    Chen and Tor M. Aamodt. 2011. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 264
  215. Finite MSHRs => Finite MLP Xi E. Chen and Tor

    M. Aamodt. 2011. "Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs." ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 265
  216. Cache Miss Penalty: Leading Edge & Trailing Edge "The End

    of Scaling? Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node," Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 266
  217. Cache Miss Penalty: Bandwidth Utilization Impact "The End of Scaling?

    Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node," Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 267
  218. Memory Capacity & Multicore Processors Memory utilization even more important

    - contention for capacity & bandwidth! "Disaggregated Memory Architectures for Blade Servers," Kevin Te-Ming Lim, Ph.D. Thesis, The University of Michigan, 2010. 268
  219. Multicore: Sequential / Parallel Execution Model L. Yavits, A. Morad,

    R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 269
  220. Multicore: Amdahl's Law, Strong Scaling "Reevaluating Amdahl's Law," John L.

    Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 270
  221. Multicore: Gustafson's Law, Weak Scaling "Reevaluating Amdahl's Law," John L.

    Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 271
  222. Amdahl's Law Optimistic Assumes perfect parallelism of the parallel portion:

    Only Serial Bottlenecks, No Parallel Bottlenecks Counterpoint: https://blogs.msdn.microsoft.com/ddperf/2009/04/29/parallel-scalability-isnt-childs-play-part-2-amdahls-law-vs- gunthers-law/ 272
  223. Multicore: Synchronization, Actual Scaling M. A. Suleman, M. K. Qureshi,

    and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 273
  224. Multicore: Communication, Actual Scaling M. A. Suleman, M. K. Qureshi,

    and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 274
  225. Multicore & DRAM: AoS I #include <cstddef> #include <cstdlib> #include

    <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> struct contract { double K; double T; double P; }; using element = contract; using container = std::vector<element>; 275
  226. Multicore & DRAM: AoS II double sum_if(const container & a,

    const container & b, const std::vector<std::size_t> & index) { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a[j].K == b[j].K) sum += a[j].K; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) average += f() / m; return average; } 276
  227. Multicore & DRAM: AoS III std::vector<std::size_t> index_stream(std::size_t n) { std::vector<std::size_t>

    index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); for (std::size_t i = 0; i != n; ++i) index.push_back(u(g)); return index; } 277
  228. Multicore & DRAM: AoS IV int main(int argc, char *

    argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { thread_type[thread] = (argc > 3 + thread) ? std::atoll(argv[3 + thread]) : 0; std::cout << "thread_type[" << thread << "] = " << thread_type[thread] << '\n'; } 278
  229. Multicore & DRAM: AoS V endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t

    thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } const container v1(n, {1.0, 0.5, 3.0}); const container v2(n, {1.0, 2.0, 1.0}); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 279
  230. Multicore & DRAM: AoS VI boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);

    for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 280
  231. Multicore & DRAM: AoS Timings 1 thread, sequential access $

    ./DRAM_CMP 10000000 10 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.395408s wall, 0.406250s user + 0.000000s system = 0.406250s CPU (102.7%) 281
  232. Multicore & DRAM: AoS Timings 1 thread, random access $

    ./DRAM_CMP 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 5.348314s wall, 5.343750s user + 0.000000s system = 5.343750s CPU (99.9%) 282
  233. Multicore & DRAM: AoS Timings 4 threads, sequential access $

    ./DRAM_CMP 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.508894s wall, 2.000000s user + 0.000000s system = 2.000000s CPU (393.0%) 283
  234. Multicore & DRAM: AoS Timings 4 threads: 3 sequential access

    + 1 random access $ ./DRAM_CMP 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 5.666049s wall, 7.265625s user + 0.000000s system = 7.265625s CPU (128.2%) 284
  235. Multicore & DRAM: AoS Timings Memory Access Patterns & Multicore:

    Interactions Matter Inter-thread Interference Sharing - Contention - Interference - Slowdown Threads using a shared resource (like on-chip/off-chip interconnects and memory) contend for it, interfering with each other's progress, resulting in slowdown (and thus negative returns to increased threads count). cf. Thomas Moscibroda and Onur Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," Microsoft Research Technical Report, MSR-TR-2007-15, February 2007. 285
  236. Multicore & DRAM: SoA I #include <cstddef> #include <cstdlib> #include

    <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> // SoA (structure-of-arrays) struct data { std::vector<double> K; std::vector<double> T; std::vector<double> P; }; 286
  237. Multicore & DRAM: SoA II double sum_if(const data & a,

    const data & b, const std::vector<std::size_t { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a.K[j] == b.K[j]) sum += a.K[j]; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) { average += f() / m; } 287
  238. Multicore & DRAM: SoA III return average; } std::vector<std::size_t> index_stream(std::size_t

    n) { std::vector<std::size_t> index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); 288
  239. Multicore & DRAM: SoA IV for (std::size_t i = 0;

    i != n; ++i) index.push_back(u(g)); return index; } int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { 289
  240. Multicore & DRAM: SoA V thread_type[thread] = (argc > 3

    + thread) ? std::atoll(argv[3 + thr std::cout << "thread_type[" << thread << "] = " << thread_type[thre } endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } data v1; v1.K.resize(n, 1.0); v1.T.resize(n, 0.5); v1.P.resize(n, 3.0); 290
  241. Multicore & DRAM: SoA VI data v2; v2.K.resize(n, 1.0); v2.T.resize(n,

    2.0); v2.P.resize(n, 1.0); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 291
  242. Multicore & DRAM: SoA VII boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);

    for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 292
  243. Multicore & DRAM: SoA Timings 1 thread, sequential access $

    ./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.211877s wall, 0.203125s user + 0.000000s system = 0.203125s CPU (95.9%) 293
  244. Multicore & DRAM: SoA Timings 1 thread, random access $

    ./DRAM_CMP.SoA 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 4.534646s wall, 4.546875s user + 0.000000s system = 4.546875s CPU (100.3%) 294
  245. Multicore & DRAM: SoA Timings 4 threads, sequential access $

    ./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.256391s wall, 1.031250s user + 0.000000s system = 1.031250s CPU (402.2%) 295
  246. Multicore & DRAM: SoA Timings 4 threads: 3 sequential access

    + 1 random access $ ./DRAM_CMP.SoA 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 4.581033s wall, 5.265625s user + 0.000000s system = 5.265625s CPU (114.9%) 296
  247. Multicore & DRAM: SoA Timings Better Access Patterns yield Better

    Single-core Performance but also Reduced Interference and thus Better Multi-core Performance 297
  248. Multicore: Arithmetic Intensity L. Yavits, A. Morad, R. Ginosar, The

    effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 298
  249. Multicore: Synchronization & Connectivity Intensity L. Yavits, A. Morad, R.

    Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 299
  250. Speedup: Synchronization and Connectivity Bottlenecks f: parallelizable fraction f1 :

    connectivity intensity f2 : synchronization intensity L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 300
  251. Speedup: Synchronization & Connectivity Bottlenecks Speedup - affected by sequential-to-parallel

    data synchronization and inter-core communication. L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl's law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 301
  252. Partitioning-Sharing Tradeoffs Butler W. Lampson. 1983. "Hints for computer system

    design." In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP '83). ACM, New York, NY, USA, 33-48. 302
  253. Shared Resource: DRAM Heechul Yun, Renato, Zheng-Pei Wu, Rodolfo Pellizzoni.

    "PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms," IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), 2014. https://github.com/heechul/palloc 303
  254. Shared Resource: MSHRs Heechul Yun, Rodolfo Pellizzon, and Prathap Kumar

    Valsan. 2015. "Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems." In Proceedings of the 2015 27th Euromicro Conference on Real-Time Systems (ECRTS '15). 304
  255. Partitioning Multithreading • Thread affinity • POSIX: sched_getcpu, pthread_setaffinity_np •

    http://eli.thegreenplace.net/2016/c11-threads-affinity-and- hyperthreading/ • https://github.com/RRZE- HPC/likwid/blob/master/groups/skylake/FALSE_SHARE.txt • Local LLC false sharing rate = MEM_LOAD_L3_HIT_RETIRED_XSNP_HITM / MEM_INST_RETIRED_ALL • NUMA: Remote Memory Accesses (RMA), Local Memory Accesses (LMA), RMA/LMA ratio • https://01.org/numatop/ • https://github.com/01org/numatop 305
  256. Cache Partitioning: Index-Based & Way-Based Giovani Gracioli, Ahmed Alhammad, Renato

    Mancuso, Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. "A Survey on Cache Management Mechanisms for Real-Time Embedded Systems." ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 306
  257. Cache Partitioning: CPU Support Giovani Gracioli, Ahmed Alhammad, Renato Mancuso,

    Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. "A Survey on Cache Management Mechanisms for Real-Time Embedded Systems." ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 307
  258. Cache Partitioning & Intel: CAT & CMT Cache Monitoring Technology

    and Cache Allocation Technology https://github.com/01org/intel-cmt-cat A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer, “Cache QoS: From concept to reality in the Intel Xeon processor E5-2600 v3 product family,” in Intl. Symp. on High Performance Computer Architecture (HPCA), Mar. 2016. 308
  259. Cache Partitioning != Cache Access Timing Isolation H. Yun and

    P. Valsan, "Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms", International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 309
  260. Cache Partitioning != Cache Access Timing Isolation H. Yun and

    P. Valsan, "Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms", International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 310
  261. Cache Partitioning != Cache Access Timing Isolation H. Yun and

    P. Valsan, "Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms", International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 311
  262. Cache Partitioning != Cache Access Timing Isolation https://github.com/CSL-KU/IsolBench Prathap Kumar

    Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 312
  263. Cache Partitioning != Cache Access Timing Isolation • Shared: MSHRs

    (Miss information/Status Holding Registers) / LFBs (Line Fill Buffers) • Contention => cache space partitioning != cache access timing isolation Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 313
  264. Cache Partitioning != Cache Access Timing Isolation • mutiple MSHRs

    support multiple outstanding cache-misses • the number of MSHRs determines the MLP of the cache • local MLP - outstanding misses one core can generate • global MLP - parallelism of the entire shared memory hierarchy (i.e., shared LLC and DRAM) • "the aggregated parallelism of the cores (the sum of local MLP) exceeds the parallelism supported by the shared LLC and DRAM (global MLP) in the out-of-order architectures" Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. "Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems." IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 314
  265. Shared Resource (MSHRs) & Prefetching: Xeon Phi Zhenman Fang, Sanyam

    Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. "Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking." ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 315
  266. Shared Resource (MSHRs) & Prefetching: SNB Zhenman Fang, Sanyam Mehta,

    Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. "Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking." ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 316
  267. Weighted Speedup A. Snavely and D. M. Tullsen, “Symbiotic jobscheduling

    for simultaneous multithreading processor,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Nov. 2000, pp. 234– 244. S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance,” in Computer Architecture Letters, vol. 13, no. 2, 2014.317
  268. The Number of Cycles Sam Van den Steen; Stijn Eyerman;

    Sander De Pestel; Moncef Mechri; Trevor E. Carlson; David Black-Schaffer; Erik Hagersten; Lieven Eeckhout, “Analytical Processor Performance and Power Modeling using Micro-Architecture Independent Characteristics,” Transactions on Computers (TC) 2016. C - #cycles, N - #instructions, Deff - effective dispatch rate, mbpred - #branch mispredictions, cres - branch resolution time, cfe - front-end pipeline depth, mILi - #instruction fetch misses at each level i in the cache hierarchy, cLi - access latency to each cache level, ROB - size of the Reorder Buffer, mLLC - #number of LLC load misses, cmem - memory access time, cbus - memory bus transfer and waiting time, MLP - amount of memory-level parallelism, PhLLC - LLC hit chain penalty 318
  269. Cache-aware Roofline model "Cache-aware Roofline model: Upgrading the loft." Aleksandar

    Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 321
  270. Cache-aware Roofline model "Cache-aware Roofline model: Upgrading the loft." Aleksandar

    Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 322
  271. Roofline Model: Microarchitectural Bottlenecks "Extending the Roofline Model: Bottleneck Analysis

    with Microarchitectural Constraints." Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 323
  272. Roofline Model: Microarchitectural Bottlenecks "Extending the Roofline Model: Bottleneck Analysis

    with Microarchitectural Constraints." Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 324
  273. C++ Standards: C++11 & C++14 Atomic Operations & Concurrent Memory

    Model http://en.cppreference.com/w/cpp/atomic http://github.com/MattPD/cpplinks/blob/master/atomics.lockfree.memory_model.md "The C11 and C++11 Concurrency Model" by Mark John Batty: http://www.cl.cam.ac.uk/~mjb220/thesis/ Move semantics https://isocpp.org/wiki/faq/cpp11-language#rval http://thbecker.net/articles/rvalue_references/section_01.html http://kholdstare.github.io/technical/2013/11/23/moves-demystified.html scoped_allocator (stateful allocators support) https://isocpp.org/wiki/faq/cpp11-library#scoped-allocator http://en.cppreference.com/w/cpp/header/scoped_allocator https://accu.org/content/conf2012/JonathanWakely-CXX11_allocators.pdf https://accu.org/content/conf2013/Frank_Birbacher_Allocators.r210article.pdf 325
  274. C++ Standards: C++11, C++14, and C++17 reducing the need for

    conditional compilation via macros and template metaprogramming constexpr https://isocpp.org/wiki/faq/cpp11-language#cpp11-constexpr https://isocpp.org/wiki/faq/cpp14-language#extended-constexpr if constexpr http://en.cppreference.com/w/cpp/language/if#Constexpr_If 326
  275. C++17 Standard std::string_view http://en.cppreference.com/w/cpp/string/basic_string_view interoperatbility with C APIs (e.g., sockets)

    without extra allocations / copies std::aligned_alloc (C11) http://en.cppreference.com/w/cpp/memory/c/aligned_alloc aligned uninitialized storage allocation (vectorization) Hardware interference size http://eel.is/c++draft/hardware.interference http://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size portable cache line size information (e.g., padding to avoid false sharing) Extended allocators & polymorphic memory resources http://en.cppreference.com/w/cpp/memory/polymorphic_allocator http://stackoverflow.com/questions/38010544/polymorphic-allocator-when- and-why-should-i-use-it http://boost.org/doc/libs/release/doc/html/container/extended_functionality.html 327
  276. C++ Core Guidelines P: Philosophy • P.9: Don't waste time

    or space. Per: Performance • Per.3: Don't optimize something that's not performance critical. • Per.6: Don't make claims about performance without measurements. • Per.7: Design to enable optimization • Per.18: Space is time. • Per.19: Access memory predictably. • Per.30: Avoid context switches on the critical path https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-performance https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#S-performance 328
  277. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency 329
  278. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence 329
  279. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) 329
  280. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? 329
  281. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty 329
  282. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time 329
  283. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality 329
  284. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead 329
  285. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work 329
  286. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict 329
  287. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict • cmov & tradeoffs: converting control dependencies to data dependencies 329
  288. Takeaways Principles Data structures & data layout - fundamental part

    of design CPUs & pervasive forms parallelism • can support each other: PLP, ILP (MLP!), TLP, DLP Balanced design vs. bottlenecks Overlapping latencies Sharing-contention-interference-slowdown Yale Patt's Phase 2: Break the layers: • break through the hardware/software interface • harness all levels of the transformation hierarchy 330
  289. Phase 2: Harnessing the Transformation Hierarchy Yale N. Patt, Microprocessor

    Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 331
  290. Break the Layers Yale N. Patt, Microprocessor Performance, Phase 2:

    Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 332
  291. Pigeonholing has to go Yale N. Patt at Yale Patt

    75 Visions of the Future Computer Architecture Workshop: "Are you a software person or a hardware person?" I'm a person this pigeonholing has to go We must break the layers Abstractions are great - AFTER you understand what's being abstracted Yale N. Patt, 2013 IEEE CS Harry H. Goode Award Recipient Interview — https://youtu.be/S7wXivUy-tk Yale N. Patt at Yale Patt 75 Visions of the Future Computer Architecture Workshop — https://youtu.be/x4LH1cJCvxs 333