Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Computer Architecture, C++, and High Performance (Meeting C++ 2016)

Computer Architecture, C++, and High Performance (Meeting C++ 2016)

With the increase in the available computational power, the Nathan Myhrvold's Laws of Software continue to apply: New opportunities enable new applications with increased needs, which subsequently become constrained by the hardware that used to be "modern" at adoption time. C++ itself opens the access to high-quality optimizing compilers and a wide ecosystem of high-performance tooling and libraries. At the same time, simply turning on the highest optimization flags and hoping for the best is not going to automagically yield the highest performance -- i.e., the lowest execution time. The reasons are twofold: Algorithms' performance can differ in theory -- and that of their implementations can differ even more so in practice.

Modern CPU architecture has continued to yield increases in performance through the advances in microarchitecture, such as pipelining, multiple issue (superscalar) out-of-order execution, branch prediction, SIMD-within-a-register (SWAR) vector units, and chip multi-processor (CMP, also known as multi-core) architecture. All of these developments have provided us with the opportunities associated with a higher peak performance -- while at the same time raising new optimization challenges when actually trying to reach that peak.

In this talk we'll consider the properties of code which can make it either friendly -- or hostile -- to a modern microprocessor. We will offer advice on achieving higher performance, from the ways of analyzing it beyond algorithmic complexity, recognizing the aspects we can entrust to the compiler, to practical optimization of the existing code. Instead of stopping at the "you should measure it" advice (which is correct, but incomplete), the talk will be focused on providing practical, hands-on examples on _how_ to actually perform the measurements (presenting tools -- including perf and likwid -- simplifying the access to CPU performance monitoring counters) and how to reason about the resulting measurements (informed by the understanding of the modern CPU architecture, generated assembly code, as well as an in-depth look at how the CPU cycles are spent using modern microarchitectural simulation tools) to improve the performance of C++ applications.

Resources:
https://github.com/MattPD/cpplinks/
https://github.com/MattPD/cpplinks/blob/master/assembly.x86.md
https://github.com/MattPD/cpplinks/blob/master/comparch.md
https://github.com/MattPD/cpplinks/blob/master/performance.tools.md

References:
http://www.agner.org/optimize/
https://users.ece.cmu.edu/~omutlu/lecture-videos.html

Matt P. Dziubinski

November 19, 2016
Tweet

More Decks by Matt P. Dziubinski

Other Decks in Programming

Transcript

  1. Computer Architecture, C++, and High Performance Matt P. Dziubinski Meeting

    C++ 2016 [email protected] // @matt_dz Department of Mathematical Sciences, Aalborg University CREATES (Center for Research in Econometric Analysis of Time Series)
  2. Outline • Performance • Why do we care? • What

    is it? • How to • measure it - reason about it - improve it? 2
  3. Costs and Curves Moore, Gordon E. (1965). ”Cramming more components

    onto integrated circuits”. Electronics Magazine. 4
  4. Cramming more components onto integrated circuits Moore, Gordon E. (1965).

    ”Cramming more components onto integrated circuits”. Electronics Magazine. 5
  5. Transformation Hierarchy Yale N. Patt, Microprocessor Performance, Phase 2: Can

    We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 7
  6. Phase I & The Walls Yale N. Patt, Microprocessor Performance,

    Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 8
  7. CPU Performance Trends 1 5 9 13 18 24 51

    80 117 183 280 481 649 993 1,267 1,779 3,016 4,195 6,043 6,681 7,108 11,865 14,387 19,484 21,871 24,129 1 10 100 1000 10,000 100,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Performance (vs. VAX-11/780) 25%/year 52%/year 22%/year IBM POWERstation 100, 150 MHz Digital Alphastation 4/266, 266 MHz Digital Alphastation 5/300, 300 MHz Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164 Digital AlphaServer 8400 6/575, 575 MHz 21264 Professional Workstation XP1000, 667 MHz 21264A Intel VC820 motherboard, 1.0 GHz Pentium III processor IBM Power4, 1.3 GHz Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 1.5, VAX-11/785 AMD Athlon 64, 2.8 GHz Digital 3000 AXP/500, 150 MHz HP 9000/750, 66 MHz IBM RS6000/540, 30 MHz MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz VAX 8700, 22 MHz AX-11/780, 5 MHz Hennessy, John L.; Patterson, David A., 2011, ”Computer Architecture: A Quantitative Approach,” Morgan Kaufmann. 9
  8. Processor-Memory Performance Gap 1 100 10 1000 Performance 10,000 100,000

    1980 2010 2005 2000 1995 Year Processor Memory 1990 1985 The difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access. Hennessy, John L.; Patterson, David A., 2011, ”Computer Architecture: A Quantitative Approach,” Morgan Kaufmann. Computer Architecture is Back: Parallel Computing Landscape https://www.youtube.com/watch?v=On-k-E5HpcQ 11
  9. DRAM Performance Trends D. Lee: ”Reducing DRAM Latency at Low

    Cost by Exploiting Heterogeneity.” http://arxiv.org/abs/1604.08041 (2016) D. Lee et al., ”Tiered-latency DRAM: A low latency and low cost DRAM architecture,” in HPCA, 2013. 12
  10. Emerging Memory Technologies - Further Down The Hierarchy Qureshi et

    al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. 13
  11. Feature Scaling Trends Lee, Yunsup, ”Decoupled Vector-Fetch Architecture with a

    Scalarizing Compiler,” EECS Department, University of California, Berkeley. 2016. http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-82.html 14
  12. Process-Architecture-Optimization Intel’s Annual Report on Form 10-K for the fiscal

    year ended December 26, 2015, filed with the SEC on February 12, 2016. https://www.sec.gov/Archives/edgar/data/50863/000005086316000105/a10kdocument12262015q4.htm 15
  13. Make it fast Butler W. Lampson. 1983. ”Hints for computer

    system design.” In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP ’83). ACM, New York, NY, USA, 33-48. 16
  14. Performance: The Early Days A. Greenbaum and T. Chartier. ”Numerical

    Methods: Design, analysis, and computer implementation of algorithms.” 2010. Course Notes for Short Course on Numerical Analysis. 18
  15. Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), ”On

    the computational complexity of algorithms”, Transactions of the American Mathematical Society 117: 285–306. 19
  16. Algorithms Classification Problem Hartmanis, J.; Stearns, R. E. (1965), ”On

    the computational complexity of algorithms”, Transactions of the American Mathematical Society 117: 285–306. 20
  17. Analysis of Algorithms - Scientific Method Robert Sedgewick and Kevin

    Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 21
  18. Analysis of Algorithms - Problem Size N vs. Running Time

    T(N) Robert Sedgewick and Kevin Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 22
  19. Analysis of Algorithms - Tilde Notation & Tilde Approximations Robert

    Sedgewick and Kevin Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 23
  20. Analysis of Algorithms - Doubling Ratio Experiments Robert Sedgewick and

    Kevin Wayne, ”Algorithms,” 4th Edition, Addison-Wesley Professional, 2011. 24
  21. Find Example - Benchmark (Nonius) Code I #include <algorithm> #include

    <cstddef> #include <cstdint> #include <cstdio> #include <iterator> #include <random> #include <set> #include <vector> #include <boost/container/flat_set.hpp> #include <EASTL/vector_set.h> #include <nonius/nonius.h++> #include <nonius/main.h++> NONIUS_PARAM(size, std::size_t{100u}) NONIUS_PARAM(queries, std::size_t{10u}) 26
  22. Find Example - Benchmark (Nonius) Code II // EASTL //

    https://wuyingren.github.io/howto/2016/02/11/Using-EASTL-in-your-project void* operator new[](size_t size, const char* pName, int flags, unsigned de return malloc(size); } void* operator new[](size_t size, size_t alignment, size_t alignmentOffset, return malloc(size); } using T = std::uint32_t; std::vector<T> odd_numbers(std::size_t count) { std::vector<T> result; result.reserve(count); for (std::size_t i = 0; i != count; i++) result.push_back(2 * i + 1); return result; } 27
  23. Find Example - Benchmark (Nonius) Code III template <typename container_type>

    T ctor_and_find(const char * type_name, const std::vector<T> & v, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v.size(); std::uniform_int_distribution<T> uniform(0, 2 * n + 2); const container_type s(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto it = s.find(uniform(prng)); sum += (it != end(s)) ? *it : 0; } return sum; } 28
  24. Find Example - Benchmark (Nonius) Code IV T ctor_and_find(const char

    * type_name, const std::vector<T> & v_src, std::size_t q) { std::mt19937 prng(1); const std::size_t n = v_src.size(); std::uniform_int_distribution<T> uniform(0, 2*n + 2); auto v = v_src; std::sort(begin(v), end(v)); T sum = 0; for (std::size_t i = 0; i != q; ++i) { const auto k = uniform(prng); const auto it = std::lower_bound(begin(v), end(v), k); sum += (it != end(v)) ? (*it == k ? k : 0) : 0; } return sum; } 29
  25. Find Example - Benchmark (Nonius) Code V NONIUS_BENCHMARK("std::set", [](nonius::chronometer meter)

    { const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<std::set<T>>("std::set", v, q); }); }); NONIUS_BENCHMARK("std::vector: copy & sort", [](nonius::chronometer meter) const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find("std::vector: copy & sort", v, q); }); }); 30
  26. Find Example - Benchmark (Nonius) Code VI NONIUS_BENCHMARK("boost::container::flat_set", [](nonius::chronometer meter

    const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<boost::container::flat_set<T>>("boost::container::flat_se }); }); NONIUS_BENCHMARK("eastl::vector_set", [](nonius::chronometer meter) { const auto n = meter.param<size>(); const auto q = meter.param<queries>(); const auto v = odd_numbers(n); meter.measure([q, &v] { ctor_and_find<eastl::vector_set<T>>("eastl::vector_set", v, q); }); }); int main(int argc, char * argv[]) { nonius::main(argc, argv); } 31
  27. Find Example - Benchmark (Nonius) Code I Nonius: statistics-powered micro-benchmarking

    framework: https://nonius.io/ https://github.com/libnonius/nonius Running: BNSIZE=10000; BNQUERIES=1000 ./find --param=size:$BNSIZE --param=queries:$BNQUERIES > results.size=$BNSIZE.queries=$BNQUERIES.txt ./find --param=size:$BNSIZE --param=queries:$BNQUERIES --reporter=html --output=results.size=$BNSIZE.queries=$BNQUERIES.html 32
  28. Asymptotic growth & "random access machines"? Tomasz Jurkiewicz and Kurt

    Mehlhorn. 2015. ”On a Model of Virtual Address Translation.” J. Exp. Algorithmics 19. http://arxiv.org/abs/1212.0703 & https://people.mpi-inf.mpg.de/~mehlhorn/ftp/KMvat.pdf 36
  29. Asymptotic growth & "random access machines"? Asymptotic - growing problem

    size • for large data need to take into account the costs of actually bringing it in • communication complexity vs. computation complexity • including overlapping computation-communication latencies 37
  30. "Operation"? Jack Dongarra. 2016. ”With Extreme Scale Computing the Rules

    Have Changed.” In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16). 38
  31. "Operation"? Jack Dongarra. 2016. ”With Extreme Scale Computing the Rules

    Have Changed.” In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16). 39
  32. Complexity - constants, microarchitecture? ”Array Layouts for Comparison-Based Searching” Paul-Virak

    Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • ”With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search 40
  33. Complexity - constants, microarchitecture? ”Array Layouts for Comparison-Based Searching” Paul-Virak

    Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • ”With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). 40
  34. Complexity - constants, microarchitecture? ”Array Layouts for Comparison-Based Searching” Paul-Virak

    Khuong, Pat Morin http://cglab.ca/~morin/misc/arraylayout-v2/ • ”With this understanding, we are able to choose layouts and design search algorithms that perform searches in 1/2 to 2/3 (depending on the array length) the time of the C++ std::lower_bound() implementation of binary search • (which itself performs searches in 1/3 the time of searching in the std::set implemementation of red-black trees). • It was only through careful and controlled experimentation with different implementations of each of the search algorithms that we are able to understand how the interactions between processor features such as pipelining, prefetching, speculative execution, and conditional moves affect the running times of the search algorithms.” 40
  35. Reasoning about Performance: The Scientific Method Requires - enabled by

    - the knowledge of microachitectural details. Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi, Chapter 2 ”Methods” from ”Readings in Computer Architecture,” Morgan Kaufmann, 2000. Prefetching benefits evaluation: Disable/enable prefetchers using likwid-features: https://github.com/RRZE-HPC/likwid/wiki/likwid-features Example: https://gist.github.com/MattPD/06e293fb935eaf67ee9c301e70db6975 41
  36. Instruction Level Parallelism & Loop Unrolling - Code I #include

    <cstddef> #include <cstdint> #include <cstdlib> #include <iostream> #include <vector> #include <boost/timer/timer.hpp> 43
  37. Instruction Level Parallelism & Loop Unrolling - Code II using

    T = double; T sum_1(const std::vector<T> & input) { T sum = 0.0; for (std::size_t i = 0, n = input.size(); i != n; ++i) sum += input[i]; return sum; } T sum_2(const std::vector<T> & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { sum1 += input[i]; sum2 += input[i + 1]; } return sum1 + sum2; } 44
  38. Instruction Level Parallelism & Loop Unrolling - Code III int

    main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 10000000; const std::size_t f = (argc > 2) ? std::atoll(argv[2]) : 1; std::cout << "n = " << n << '\n'; // iterations count std::cout << "f = " << f << '\n'; // unroll factor const std::vector<T> a(n, T(1)); boost::timer::auto_cpu_timer timer; const T sum = (f == 1) ? sum_1(a) : (f == 2) ? sum_2(a) : 0; std::cout << sum << '\n'; } 45
  39. Instruction Level Parallelism & Loop Unrolling - Results make vector_sums

    CXXFLAGS="-std=c++14 -O2 -march=native" LDLIBS=-lboost_timer $ ./vector_sums 1000000000 1 n = 1000000000 f = 1 1e+09 0.841269s wall, 0.840000s user + 0.010000s system = 0.850000s CPU (101.0%) $ ./vector_sums 1000000000 2 n = 1000000000 f = 2 1e+09 0.466293s wall, 0.460000s user + 0.000000s system = 0.460000s CPU (98.7%) 46
  40. perf Results - sum_1 Performance counter stats for './vector_sums 1000000000

    1': 1675.812457 task-clock (msec) # 0.850 CPUs utilized 34 context-switches # 0.020 K/sec 5 cpu-migrations # 0.003 K/sec 8,953 page-faults # 0.005 M/sec 5,760,418,457 cycles # 3.437 GHz 3,456,046,515 stalled-cycles-frontend # 60.00% frontend cycles id 8,225,763,566 instructions # 1.43 insns per cycle # 0.42 stalled cycles per 2,050,710,005 branches # 1223.711 M/sec 104,331 branch-misses # 0.01% of all branches 1.970909249 seconds time elapsed 48
  41. perf Results - sum_2 Performance counter stats for './vector_sums 1000000000

    2': 1283.910371 task-clock (msec) # 0.835 CPUs utilized 38 context-switches # 0.030 K/sec 3 cpu-migrations # 0.002 K/sec 9,466 page-faults # 0.007 M/sec 4,458,594,733 cycles # 3.473 GHz 2,149,690,303 stalled-cycles-frontend # 48.21% frontend cycles id 6,734,925,029 instructions # 1.51 insns per cycle # 0.32 stalled cycles per 1,552,029,608 branches # 1208.830 M/sec 119,358 branch-misses # 0.01% of all branches 1.537971058 seconds time elapsed 49
  42. Compiler Explorer: sum_2 (x86-64 Assembly) http://gcc.godbolt.org/ now with: embedded view,

    Visual C++, Intel C++ compilers http://xania.org/201611/compiler-explorer-now-supports-embedded-view 53
  43. Intel Architecture Code Analyzer (IACA) #include <iacaMarks.h> T sum_2(const std::vector<T>

    & input) { T sum1 = 0.0, sum2 = 0.0; for (std::size_t i = 0, n = input.size(); i != n; i += 2) { IACA_START sum1 += input[i]; sum2 += input[i + 1]; } IACA_END return sum1 + sum2; } $ g++ -std=c++14 -O2 -march=native vector_sums_2i.cpp -o vector_sums_2i $ iaca -64 -arch IVB -graph ./vector_sums_2i • https://software.intel.com/en-us/articles/intel-architecture-code-analyzer • https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i- use-it • http://kylehegeman.com/blog/2013/12/28/introduction-to-iaca/ 54
  44. Microarchitecture: Sandy Bridge Intel® 64 and IA-32 Architectures Optimization Reference

    Manual https://www-ssl.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html 55
  45. IACA Results - sum_1 $ iaca -64 -arch IVB -graph

    ./vector_sums_1i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_1i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 3.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.0 0.0 | 1.0 | 1.0 1.0 | 1.0 1.0 | 0.0 | 1.0 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 1.0 1.0 | | | | | mov rdx, qword ptr [rdi] | 2 | | 1.0 | | 1.0 1.0 | | | CP | vaddsd xmm0, xmm0, qword ptr [rdx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x1 | 1 | | | | | | 1.0 | | cmp rax, rcx | 0F | | | | | | | | jnz 0xffffffffffffffe7 Total Num Of Uops: 5 56
  46. IACA Results - sum_2 $ iaca -64 -arch IVB -graph

    ./vector_sums_2i Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./vector_sums_2i Binary Format - 64Bit Architecture - IVB Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 6.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: ------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | ------------------------------------------------------------------------- | Cycles | 1.5 0.0 | 3.0 | 1.5 1.5 | 1.5 1.5 | 0.0 | 1.5 | ------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | | --------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | mov rcx, qword ptr [rdi] | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | CP | vaddsd xmm0, xmm0, qword ptr [rcx+rax*8] | 1 | 1.0 | | | | | | | add rax, 0x2 | 2 | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | vaddsd xmm1, xmm1, qword ptr [rcx+rdx*1] | 1 | 0.5 | | | | | 0.5 | | add rdx, 0x10 | 1 | | | | | | 1.0 | | cmp rax, rsi | 0F | | | | | | | | jnz 0xffffffffffffffde | 1 | | 1.0 | | | | | CP | vaddsd xmm0, xmm0, xmm1 Total Num Of Uops: 9 57
  47. Parallelism: Work (no. of ops) / Span (critical path length)

    Guy E. Blelloch, ”Programming parallel algorithms”, Communications of the ACM, 1996. 60
  48. ILP & Data (In)dependence G. S. Tjaden and M. J.

    Flynn, ‘‘Detection and Parallel Execution of Independent Instructions,’’ IEEE Transactions on Computers, vol. C-19, pp. 889-895, October 1970. 61
  49. ILP vs. Dependencies D. W. Wall, “Limits of instruction-level parallelism,”

    Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 62
  50. ILP, Criticality & Latency Hiding D. W. Wall, “Limits of

    instruction-level parallelism,” Digital Western. Research Laboratory, Tech. Rep. 93/6, Nov. 1993. 63
  51. Empty Issue Slots: Horizontal Waste & Vertical Waste D. M.

    Tullsen, S. J. Eggers and H. M. Levy, ”Simultaneous multithreading: Maximizing on-chip parallelism,” Proceedings, 22nd Annual International Symposium on Computer Architecture, 1995. 64
  52. Wasted Slots: Causes D. M. Tullsen, S. J. Eggers and

    H. M. Levy, ”Simultaneous multithreading: Maximizing on-chip parallelism,” Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, Santa Margherita Ligure, Italy, 1995, pp. 392-403. 65
  53. Wasted Slots: Miss Events Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis,

    and James E. Smith. 2006. ”A performance counter architecture for computing accurate CPI components.” SIGOPS Oper. Syst. Rev. 40, 5 (October 2006), 175-184. 66
  54. likwid Results - sum_1: 489 Scalar MUOPS/s $ likwid-perfctr -C

    S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 1.090122s wall, 0.880000s user + 0.000000s system = 0.880000s CPU (80.7%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 8002493499 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 4285189526 | | CPU_CLK_UNHALTED_REF | FIXC2 | 3258346806 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000155741 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 2.0456 | | Runtime unhalted [s] | 1.6536 | | Clock [MHz] | 3408.2011 | | CPI | 0.5355 | | MFLOP/s | 488.9303 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 488.9303 | +----------------------+-----------+ 68
  55. likwid Results - sum_2: 595 Scalar MUOPS/s $ likwid-perfctr -C

    S0:0 -g FLOPS_DP -f ./vector_sums 1000000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 2 1e+09 0.620421s wall, 0.470000s user + 0.000000s system = 0.470000s CPU (75.8%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 6502566958 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2948446599 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2223894218 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 1000328727 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 0 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.6809 | | Runtime unhalted [s] | 1.1377 | | Clock [MHz] | 3435.8987 | | CPI | 0.4534 | | MFLOP/s | 595.1079 | | AVX MFLOP/s | 0 | | Packed MUOPS/s | 0 | | Scalar MUOPS/s | 595.1079 | +----------------------+-----------+ 69
  56. likwid Results: sum_vectorized: 676 AVX MFLOP/s g++ -std=c++14 -O2 -ftree-vectorize

    -ffast-math -march=native -lboost_timer vector_sums.cpp -o vector_sums_vf $ likwid-perfctr -C S0:0 -g FLOPS_DP -f ./vector_sums_vf 1000000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 1000000000 f = 1 1e+09 0.561288s wall, 0.390000s user + 0.000000s system = 0.390000s CPU (69.5%) -------------------------------------------------------------------------------- Group 1: FLOPS_DP +--------------------------------------+---------+------------+ | Event | Counter | Core 0 | +--------------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3002491149 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2709364345 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2043804906 | | FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE | PMC0 | 0 | | FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE | PMC1 | 91 | | SIMD_FP_256_PACKED_DOUBLE | PMC2 | 260258099 | +--------------------------------------+---------+------------+ +----------------------+-----------+ | Metric | Core 0 | +----------------------+-----------+ | Runtime (RDTSC) [s] | 1.5390 | | Runtime unhalted [s] | 1.0454 | | Clock [MHz] | 3435.5297 | | CPI | 0.9024 | | MFLOP/s | 676.4420 | | AVX MFLOP/s | 676.4420 | | Packed MUOPS/s | 169.1105 | | Scalar MUOPS/s | 0.0001 | +----------------------+-----------+ 70
  57. Intel PCM (Performance Counter Monitor) Intel PCM (Performance Counter Monitor)

    https://software.intel.com/en-us/articles/ intel-performance-counter-monitor/ Cross-platform: FreeBSD, Linux, Mac OS X, Windows C++ Performance Counters API https://software.intel.com/en-us/articles/ intel-performance-counter-monitor/#calling_pcm PCM-core utility - similar to perf: pcm-core -e cpu/umask=0x01,event=0x05,name=MISALIGN_MEM_REF.LOADS 71
  58. Performance: CPI Steven K. Przybylski, ”Cache and Memory Hierarchy Design

    – A Performance-Directed Approach,” San Fransisco, Morgan-Kaufmann, 1990. 72
  59. Performance: [YMMV]PI - Power Grochowski, E., Ronen, R., Shen, J.,

    & Wang, H. (2004). ”Best of Both Latency and Throughput.” Proceedings of the IEEE International Conference on Computer Design. 73
  60. Performance: [YMMV]PI - Graphs Scott Beamer, Krste Asanović, and David

    A. Patterson. ”GAIL: The Graph Algorithm Iron Law.” Workshop on Irregular Applications: Architectures and Algorithms (IA 3), at the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015. 74
  61. Performance: [YMMV]PI - Packets packet_processing_times = seconds/packet = instructions/packet *

    clock_cycles/instruction * seconds/clock_cycle = clock_cycles/packet * seconds/clock_cycle = CPP / core_frequency cycles per packet (CPP) http://blogs.cisco.com/sp/a-bigger-helping-of-internet-please 75
  62. Performance: separable components of a CPI CPI = (Infinite-cache CPI)

    + finite-cache effect (FCE) Infinite-cache CPI = execute busy (EBusy) + execute idle (EIdle) FCE = (cycles per miss) × (misses per instruction) = (miss penalty) × (miss rate) P. G. Emma. ”Understanding some simple processor-performance limits.” IBM Journal of Research and Development, 41(3):215–232, May 1997. 76
  63. Pervasive CPU Parallelism pipeline-level parallelism (PLP) instruction-level parallelism (ILP) memory-level

    parallelism (MLP) data-level parallelism (DLP) thread-level parallelism (TLP) 77
  64. The Cache Liptay, J. S. (1968) ”Structural Aspects of the

    System/360 Model 85, Part II: The Cache,” IBM System Journal, 7(1). 78
  65. The Cache: Processor-Memory Performance Gap Liptay, J. S. (1968) ”Structural

    Aspects of the System/360 Model 85, Part II: The Cache,” IBM System Journal, 7(1). 79
  66. The Cache: Assumptions & Effectiveness Liptay, J. S. (1968) ”Structural

    Aspects of the System/360 Model 85, Part II: The Cache,” IBM System Journal, 7(1). 80
  67. The Curse of Multiple Granularities Seshadri, V. (2016). ”Simple DRAM

    and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems.” CoRR, abs/1605.06483. 81
  68. Word Granularity != Cache Line Granularity Seshadri, V. (2016). ”Simple

    DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems.” CoRR, abs/1605.06483. 82
  69. Shortcomings of Strided Access Patterns Seshadri, V. (2016). ”Simple DRAM

    and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems.” CoRR, abs/1605.06483. 83
  70. Pointer Chasing Example - Linked List - C++ #include <algorithm>

    #include <forward_list> #include <iterator> bool found(const std::forward_list<int> & list, int value) { return find(begin(list), end(list), value) != end(list); } int main() { std::forward_list<int> list {11, 22, 33, 44, 55}; return found(list, 42); } 87
  71. Pointer Chasing Example - Linked List - CFG (r2) radiff2

    -g sym.found forward_list_app forward_list_app > forward_list_found.dot xdot forward_list_found.dot dot -Tpng -o forward_list_found.png forward_list_found.dot 91
  72. Isolated & Clustered Cache Misses Miquel Moreto, Francisco J. Cazorla,

    Alex Ramirez, and Mateo Valero. 2008. ”MLP-aware dynamic cache partitioning.” In Proceedings of the 3rd international conference on High performance embedded architectures and compilers (HiPEAC’08). 92
  73. Cache Miss Cost & Miss Clustering Thomas R. Puzak, A.

    Hartstein, P. G. Emma, V. Srinivasan, and Jim Mitchell. 2007. ”An analysis of the effects of miss clustering on the cost of a cache miss.” In Proceedings of the 4th international conference on Computing frontiers (CF ’07). ACM, New York, NY, USA, 3-12. 93
  74. Cache Miss Penalty: Different STC due to different MLP MLP

    (memory-level parallelism) & STC (stall-time criticality) R. Das, O. Mutlu, T. Moscibroda and C. R. Das, ”Application-aware prioritization mechanisms for on-chip networks,” 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), New York, NY, 2009, pp. 280-291. 94
  75. Skip Lists William Pugh. 1990. ”Skip lists: a probabilistic alternative

    to balanced trees.” Commun. ACM 33, 6, 668-676. 95
  76. Jump Pointers S. Chen, P. B. Gibbons, and T. C.

    Mowry. “Improving Index Performance through Prefetching.” In Proc. of the 20th Annual ACM SIGMOD International Conference on Management of Data, 2001. 96
  77. Prefetching Aggressiveness: Distance & Degree Sparsh Mittal. 2016. ”A Survey

    of Recent Prefetching Techniques for Processor Caches.” ACM Comput. Surv. 49, 2, Article 35. 97
  78. Prefetching Timeliness Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012.

    ”When Prefetching Works, When It Doesn’t, and Why.” ACM Trans. Archit. Code Optim. 9, 1, Article 2. 98
  79. Prefetches Classification Huaiyu Zhu, Yong Chen, and Xian-He Sun. 2010.

    ”Timing local streams: improving timeliness in data prefetching.” In Proceedings of the 24th ACM International Conference on Supercomputing (ICS ’10). ACM, New York, NY, USA, 169-178. 99
  80. Prefetching I #include <algorithm> #include <chrono> #include <cinttypes> #include <cstddef>

    #include <cstdio> #include <cstdlib> #include <future> #include <iterator> #include <memory> #include <random> struct point { double x, y, z; }; using T = point; 100
  81. Prefetching II struct timing_result { double duration_initial; double duration_non_prefetched; double

    duration_degree; double sum_initial; double sum_non_prefetched; double sum_degree; }; timing_result chase(std::size_t n, bool shuffle, std::size_t d, bool prefet timing_result chase_result; std::vector<std::unique_ptr<T>> v; for (std::size_t i = 0; i != n; ++i) { v.emplace_back(new point{1. * i, 2. * i, 5.* i}); } if (shuffle) { std::mt19937 g(1); 101
  82. Prefetching III std::shuffle(begin(v), end(v), g); } double sum = 0.0;

    auto time_start = std::chrono::steady_clock::now(); if (prefetch) { for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); sum += std::exp(-v[i]->y); } } else { for (std::size_t i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; chase_result.duration_initial = duration.count(); chase_result.sum_initial = sum; 102
  83. Prefetching IV sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t

    i = 0; i != n; ++i) { sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); duration = time_end - time_start; chase_result.duration_non_prefetched = duration.count(); chase_result.sum_non_prefetched = sum; sum = 0.0; time_start = std::chrono::steady_clock::now(); for (std::size_t i = 0; i != n; ++i) { __builtin_prefetch(v[i + d].get()); __builtin_prefetch(v[i + 2*d].get()); sum += std::exp(-v[i]->y); } time_end = std::chrono::steady_clock::now(); 103
  84. Prefetching V duration = time_end - time_start; chase_result.duration_degree = duration.count();

    chase_result.sum_degree = sum; return chase_result; } int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 100; const bool shuffle = (argc > 2) ? std::atoi(argv[2]) : false; const std::size_t d = (argc > 3) ? std::atoll(argv[3]) : 3; const bool prefetch = (argc > 4) ? std::atoi(argv[4]) : false; const std::size_t threads_count = (argc > 5) ? std::atoll(argv[5]) : 4; printf("size: %zu \n", n); printf("shuffle: %d \n", shuffle); printf("distance: %zu \n", d); printf("prefetch: %d \n", prefetch); 104
  85. Prefetching VI printf("threads_count: %d \n", threads_count); const auto thread_work =

    [n, shuffle, d, prefetch]() { return chase(n, shuffle, d, prefetch); }; std::vector<std::future<timing_result>> results; for (std::size_t thread = 0; thread != threads_count; ++thread) results.emplace_back(std::async(std::launch::async, thread_work)); for (auto && future_result : results) if (future_result.valid()) future_result.wait(); std::vector<double> timings_initial, timings_non_prefetched, timings_degree; for (auto && future_result : results) { timing_result chase_result = future_result.get(); timings_initial.push_back(chase_result.duration_initial); 105
  86. Prefetching VII timings_non_prefetched.push_back(chase_result.duration_non_prefetched); timings_degree.push_back(chase_result.duration_degree); } const auto timings_initial_minmax = std::minmax_element(begin(timings_initial),

    end(timings_initial)); const auto timings_non_prefetched_minmax = std::minmax_element(begin(timings_non_prefetched), end(timings_non_pref const auto timings_degree_minmax = std::minmax_element(begin(timings_degree), end(timings_degree)); printf(prefetch ? "prefetched" : "non-prefetched"); printf(" initial duration: [%g, %g] \n", *timings_initial_minmax.first, *timings_initial_minmax.second); printf("non-prefetched duration: [%g, %g] \n", *timings_non_prefetched_mi *timings_non_prefetched_minmax.second); printf("degree-two prefetching duration: [%g, %g] \n", *timings_degree_mi *timings_degree_minmax.second); } 106
  87. Prefetch Overhead S. Van der Wiel and D. Lilja, ”A

    Survey of Data Prefetching Techniques,” Technical Report No. HPPC 96-05, University of Minnesota, October 1996. 107
  88. Prefetching Timings: No Prefetch $ likwid-perfctr -f -C 0-3 -g

    L3 -m ./prefetch 100000 1 0 0 4 distance: 0 prefetch: 0 non-prefetched initial duration: [0.00280393, 0.00289815] non-prefetched duration: [0.00254968, 0.00257311] degree-two prefetching duration: [0.00290615, 0.00296243] Region chase_initial, Group 1: L3 | CPI STAT | 5.8641 | 1.4529 | 1.4744 | 1.4660 | | L3 bandwidth [MBytes/s] STAT | 10733.6308 | 2666.0364 | 2710.9325 | 2683.4077 | Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | L3 miss rate STAT | 0.0584 | 0.0145 | 0.0148 | 0.0146 | | L3 miss ratio STAT | 3.7723 | 0.9117 | 0.9789 | 0.9431 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 0 0 4 | Cycles without execution [%] STAT | 228.2316 | 56.8136 | 57.4443 | 57.0579 | | Cycles without execution [%] STAT | 227.0385 | 56.5980 | 57.0024 | 56.7596 | 108
  89. Prefetching Timings: useless 0-distance prefetch (overhead) $ likwid-perfctr -f -C

    0-3 -g L3 -m ./prefetch 100000 1 0 1 4 distance: 0 prefetch: 1 prefetched initial duration: [0.00288751, 0.00295978] non-prefetched duration: [0.0025575, 0.00258342] degree-two prefetching duration: [0.00285772, 0.00287839] Region chase_initial, Group 1: L3 | CPI STAT | 5.7454 | 1.4345 | 1.4387 | 1.4364 | | L3 bandwidth [MBytes/s] STAT | 10518.6383 | 2618.5405 | 2645.6096 | 2629.6596 | 109
  90. Prefetching Timings: 1-distance prefetch (mostly overhead) $ likwid-perfctr -f -C

    0-3 -g L3CACHE -m ./prefetch 100000 1 1 1 4 prefetched initial duration: [0.00250957, 0.00257662] non-prefetched duration: [0.00255286, 0.00258417] degree-two prefetching duration: [0.00230482, 0.00235828] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.9595 | 1.2343 | 1.2433 | 1.2399 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0889 | 0.4381 | 0.6454 | 0.5222 | $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 1 1 4 | Cycles without execution [%] STAT | 214.1614 | 53.4628 | 53.6716 | 53.5404 | | Cycles without execution [%] STAT | 200.4785 | 50.0405 | 50.1857 | 50.1196 | Formulas: L3 request rate = MEM_LOAD_UOPS_RETIRED_L3_ALL/UOPS_RETIRED_ALL L3 miss rate = MEM_LOAD_UOPS_RETIRED_L3_MISS/UOPS_RETIRED_ALL L3 miss ratio = MEM_LOAD_UOPS_RETIRED_L3_MISS/MEM_LOAD_UOPS_RETIRED_L3_ALL https://github.com/RRZE-HPC/likwid/blob/master/groups/ivybridge/L3CACHE.txt 110
  91. Prefetching Timings: 2-distance prefetch $ likwid-perfctr -f -C 0-3 -g

    L3CACHE -m ./prefetch 100000 1 2 1 4 size: 100000 shuffle: 1 distance: 2 prefetch: 1 threads_count: 4 prefetched initial duration: [0.0023392, 0.00241287] non-prefetched duration: [0.00257006, 0.00260938] degree-two prefetching duration: [0.00199431, 0.00203528] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | CPI STAT | 4.5557 | 1.1331 | 1.1423 | 1.1389 | | L3 request rate STAT | 0.0006 | 0.0001 | 0.0002 | 0.0002 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2317 | 0.3138 | 0.6791 | 0.5579 | Region chase_degree, Group 1: L3CACHE | CPI STAT | 3.6990 | 0.9243 | 0.9253 | 0.9248 | | L3 request rate STAT | 0.0005 | 0.0001 | 0.0002 | 0.0001 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.0145 | 0.3597 | 0.6550 | 0.5036 | 111
  92. Prefetching Timings: 8-distance prefetch $ likwid-perfctr -f -C 0-3 -g

    L3CACHE -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00181161, 0.00188783] non-prefetched duration: [0.00257601, 0.0026076] degree-two prefetching duration: [0.00152468, 0.00156814] Region chase_initial, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0065 | 0.0016 | 0.0017 | 0.0016 | | CPI STAT | 3.4808 | 0.8650 | 0.8788 | 0.8702 | | L3 miss rate STAT | 0.0004 | 0.0001 | 0.0001 | 0.0001 | | L3 miss ratio STAT | 2.2431 | 0.4694 | 0.6640 | 0.5608 | Region chase_degree, Group 1: L3CACHE | Metric | Sum | Min | Max | Avg | | Runtime (RDTSC) [s] STAT | 0.0053 | 0.0013 | 0.0014 | 0.0013 | | CPI STAT | 2.7450 | 0.6832 | 0.6882 | 0.6863 | | L3 miss rate STAT | 0.0016 | 0.0004 | 0.0004 | 0.0004 | | L3 miss ratio STAT | 3.4045 | 0.7778 | 0.9346 | 0.8511 | 112
  93. Prefetching Timings: 8-distance prefetch $ likwid-perfctr -f -C 0-3 -g

    L3 -m ./prefetch 100000 1 8 1 4 size: 100000 shuffle: 1 distance: 8 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00180738, 0.00189831] non-prefetched duration: [0.00254486, 0.00258013] degree-two prefetching duration: [0.00154542, 0.00158065] Region chase_initial, Group 1: L3 | CPI STAT | 3.5027 | 0.8668 | 0.8835 | 0.8757 | L3 bandwidth [MBytes/s] STAT | 17384.8731 | 4296.5905 | 4381.7164 | 4346.2183 Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | | CPI STAT | 2.7626 | 0.6894 | 0.6919 | 0 | L3 bandwidth [MBytes/s] STAT | 21505.6670 | 5333.6653 | 5396.4473 | 53 $ likwid-perfctr -f -C 0-3 -g CYCLE_ACTIVITY -m ./prefetch 100000 1 8 1 4 | Cycles without execution [%] STAT | 187.6689 | 46.3938 | 47.3055 | 46.91 | Cycles without execution [%] STAT | 151.5095 | 37.6872 | 38.0656 | 37.87 113
  94. Prefetching Timings: suboptimal (untimely) prefetch $ likwid-perfctr -f -C 0-3

    -g L3 -m ./prefetch 100000 1 512 1 4 size: 100000 shuffle: 1 distance: 512 prefetch: 1 threads_count: 4 prefetched initial duration: [0.00177956, 0.00186644] non-prefetched duration: [0.00257188, 0.0026064] degree-two prefetching duration: [0.00173249, 0.00178712] Region chase_initial, Group 1: L3 | CPI STAT | 3.4343 | 0.8523 | 0.8683 | 0.8586 | | L3 data volume [GBytes] STAT | 0.0293 | 0.0073 | 0.0074 | 0.0073 | Region chase_degree, Group 1: L3 | Metric | Sum | Min | Max | Avg | | CPI STAT | 3.1891 | 0.7903 | 0.8034 | 0.7973 | | L3 bandwidth [MBytes/s] STAT | 19902.4764 | 4954.4107 | 5013.4006 | 4975.6191 | 114
  95. Gem5 - std::vector & std::list I Filling with numbers -

    std::vector vs. std::list Machine code & assembly (std::vector) Micro-ops execution breakdown (std::vector) Assembly is Too High Level: http://xlogicx.net/?p=369 116
  96. Gem5 - std::vector & std::list III Pipeline diagram - one

    iteration (std::vector) Pipeline diagram - three iterations (std::vector) 118
  97. Gem5 - std::vector & std::list IV Machine code & assembly

    (std::list) heap allocation in the loop @ 400d85 what could possibly go wrong? 119
  98. (The GNU C library's) malloc https://sourceware.org/glibc/wiki/MallocInternals Arena A structure that

    is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are ”free”. Threads assigned to each arena will allocate memory from that arena’s free lists. Glibc Heap Analysis in Linux Systems with Radare2 https://youtube.com/watch?v=Svm5V4leEho r2con-2016 - rada.re/con/ 124
  99. malloc & free - new, new[], delete, delete[] int main()

    { double * a = new double[8]; double * b = new double[8]; delete[] b; delete[] a; double * c = new double[8]; delete[] c; } 125
  100. Memory Access Patterns: Temporal & Spatial Locality horizontal axis -

    time vertical axis - address D. J. Hatfield and J. Gerald. ”Program restructuring for virtual memory.” IBM Systems Journal, 10(3):168–192, 1971. 132
  101. Loop Fusion 0.429504s (unfused) down to 0.287501s (fused) g++ -Ofast

    -march=native (5.2.0) void unfused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) a[i] = b[i] * c[i]; for (size_t i = 0; i != N; ++i) d[i] = a[i] * c[i]; } void fused(double * a, double * b, double * c, double * d, size_t N) { for (size_t i = 0; i != N; ++i) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } } 133
  102. Pin - A Dynamic Binary Instrumentation Tool http://www.intel.com/software/pintool pin -t

    $PIN_ROOT/source/tools/ManualExamples/obj-intel64/pinatrace.so -- ./loop_fusion . . . 0x400e43,R,0x401c48 0x400e59,R,0x401d40 0x400e65,W,0x1c789c0 0x400e65,W,0x1c789e0 . . . r-project.org rstudio.com ggplot2.org rcpp.org 134
  103. Takeaway: Overlapping Latencies as a General Principle Overlapping latencies also

    works on a ”macro” scale • load as ”get the data from the Internet” • compute as ”process the data” Another example: Communication Avoiding and Overlapping for Numerical Linear Algebra • https://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-65.html • http://www.cs.berkeley.edu/~egeor/sc12_slides_final.pdf 141
  104. Non-Overlapped Timings id,symbol,count,time 1,AAPL,565449,1.59043 2,AXP,731366,3.43745 3,BA,867366,5.40218 4,CAT,830327,7.08103 5,CSCO,400440,8.49192 6,CVX,687198,9.98761 7,DD,910932,12.2254

    8,DIS,910430,14.058 9,GE,871676,15.8333 10,GS,280604,17.059 11,HD,556611,18.2738 12,IBM,860071,20.3876 13,INTC,559127,21.9856 14,JNJ,724724,25.5534 15,JPM,500473,26.576 16,KO,864903,28.5405 17,MCD,717021,30.087 18,MMM,698996,31.749 19,MRK,733948,33.2642 20,MSFT,475451,34.3134 21,NKE,556344,36.4545 142
  105. Overlapped Timings id,symbol,count,time 1,AAPL,565449,2.00713 2,AXP,731366,2.09158 3,BA,867366,2.13468 4,CAT,830327,2.19194 5,CSCO,400440,2.19197 6,CVX,687198,2.19198 7,DD,910932,2.51895

    8,DIS,910430,2.51898 9,GE,871676,2.51899 10,GS,280604,2.519 11,HD,556611,2.51901 12,IBM,860071,2.51902 13,INTC,559127,2.51902 14,JNJ,724724,2.51903 15,JPM,500473,2.51904 16,KO,864903,2.51905 17,MCD,717021,2.51906 18,MMM,698996,2.51907 19,MRK,733948,2.51908 20,MSFT,475451,2.51908 21,NKE,556344,2.51909 143
  106. Cache Misses, MLP, and STC: Slack R. Das et al.,

    ”Aérgia: Exploiting Packet Latency Slack in On-Chip Networks,” Proc. 37th Ann. Int’l Symp. Computer Architecture (ISCA 10), ACM Press, 2010. 147
  107. Dependent Cache Misses - Non-Overlapped - Serialized A Day in

    the Life of a Cache Miss Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and James E. Smith, ”A Performance Counter Architecture for Computing Accurate CPI Components”, ASPLOS 2006, pp. 175-184. 1. load instruction enters the window (ROB) 2. the load issues from the instruction buffer (RS) 3. the load blocks the ROB head 4. ROB eventually fills 5. dispatch stops, instruction window drains 6. eventually issue and commit stop
  108. Independent Cache Misses in ROB - Overlapped Stijn Eyerman, Lieven

    Eeckhout, Tejas Karkhanis, and James E. Smith, ”A Top-Down Approach to Architecting CPI Component Performance Counters”, IEEE Micro, Special Issue on Top Picks from 2006 Microarchitecture Conferences, Vol 27, No 1, pp. 84-93. 149
  109. Miss-Dependent Mispredicted Branch - Penalties Serialization S. Eyerman, J.E. Smith

    and L. Eeckhout, ”Characterizing the branch misprediction penalty”, Performance Analysis of Systems and Software 2006 IEEE International Symposium on 2006, pp. 48-58. 150
  110. Dependent Cache Misses - Non-Overlapped - Serialized Milad Hashemi, Khubaib,

    Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. ”Accelerating Dependent Cache Misses with an Enhanced Memory Controller.” In ISCA, 2016. 151
  111. Independent Misses Connected by a Pending Cache Hit • MLP

    - supported by non-blocking caches, out-of-order execution • multiple outstanding cache-misses - Miss Status Holding Registers (MSHRs) / Line Fill Buffers (LFBs) • MSHR file entries - merging redundant (same cache line) memory requests Xi E. Chen and Tor M. Aamodt. 2008. ”Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.” In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO 41). 152
  112. Independent Misses Connected by a Pending Cache Hit Xi E.

    Chen and Tor M. Aamodt. 2011. ”Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.” ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 153
  113. Finite MSHRs => Finite MLP Xi E. Chen and Tor

    M. Aamodt. 2011. ”Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs.” ACM Transactions on Architecture and Code Optimization (TACO) 8, 3, Article 10. 154
  114. Multicore Processors & Shared Memory Memory utilization even more important

    - contention for capacity & bandwidth! ”Disaggregated Memory Architectures for Blade Servers,” Kevin Te-Ming Lim, Ph.D. Thesis, The University of Michigan, 2010. 155
  115. Cache Miss Penalty: Leading Edge & Trailing Edge ”The End

    of Scaling? Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node,” Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 156
  116. Cache Miss Penalty: Bandwidth Utilization Impact ”The End of Scaling?

    Revolutions in Technology and Microarchitecture as we pass the 90 Nanometer Node,” Philip Emma, IBM T. J. Watson Research Center, 33rd International Symposium on Computer Architecture (ISCA 2006) Keynote Address http://www.hpcaconf.org/hpca12/Phil_HPCA_06.pdf 157
  117. Multicore: Sequential / Parallel Execution Model L. Yavits, A. Morad,

    R. Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 158
  118. Multicore: Amdahl's Law, Strong Scaling ”Reevaluating Amdahl’s Law,” John L.

    Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 159
  119. Multicore: Gustafson's Law, Weak Scaling ”Reevaluating Amdahl’s Law,” John L.

    Gustafson, Communications of the ACM 31(5), 1988. pp. 532-533. 160
  120. Amdahl's Law Optimistic Assumes perfect parallelism of the parallel portion:

    Only Serial Bottlenecks, No Parallel Bottlenecks Counterpoint: https://blogs.msdn.microsoft.com/ddperf/2009/04/29/parallel-scalability-isnt-childs-play-part-2-amdahls-law-vs- gunthers-law/ 161
  121. Multicore: Synchronization, Actual Scaling M. A. Suleman, M. K. Qureshi,

    and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 162
  122. Multicore: Communication, Actual Scaling M. A. Suleman, M. K. Qureshi,

    and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs,” in Proc. 13th Archit. Support Program. Lang. Oper. Syst., 2008, pp. 277–286. 163
  123. Multicore & DRAM: AoS I #include <cstddef> #include <cstdlib> #include

    <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> struct contract { double K; double T; double P; }; using element = contract; using container = std::vector<element>; 164
  124. Multicore & DRAM: AoS II double sum_if(const container & a,

    const container & b, const std::vector<std::size_t> & index) { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a[j].K == b[j].K) sum += a[j].K; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) average += f() / m; return average; } 165
  125. Multicore & DRAM: AoS III std::vector<std::size_t> index_stream(std::size_t n) { std::vector<std::size_t>

    index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); for (std::size_t i = 0; i != n; ++i) index.push_back(u(g)); return index; } 166
  126. Multicore & DRAM: AoS IV int main(int argc, char *

    argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { thread_type[thread] = (argc > 3 + thread) ? std::atoll(argv[3 + thread]) : 0; std::cout << "thread_type[" << thread << "] = " << thread_type[thread] << '\n'; } 167
  127. Multicore & DRAM: AoS V endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t

    thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } const container v1(n, {1.0, 0.5, 3.0}); const container v2(n, {1.0, 2.0, 1.0}); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 168
  128. Multicore & DRAM: AoS VI boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);

    for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 169
  129. Multicore & DRAM: AoS Timings 1 thread, sequential access $

    ./DRAM_CMP 10000000 10 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.395408s wall, 0.406250s user + 0.000000s system = 0.406250s CPU (102.7%) 170
  130. Multicore & DRAM: AoS Timings 1 thread, random access $

    ./DRAM_CMP 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 5.348314s wall, 5.343750s user + 0.000000s system = 5.343750s CPU (99.9%) 171
  131. Multicore & DRAM: AoS Timings 4 threads, sequential access $

    ./DRAM_CMP 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.508894s wall, 2.000000s user + 0.000000s system = 2.000000s CPU (393.0%) 172
  132. Multicore & DRAM: AoS Timings 4 threads: 3 sequential access

    + 1 random access $ ./DRAM_CMP 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 5.666049s wall, 7.265625s user + 0.000000s system = 7.265625s CPU (128.2%) 173
  133. Multicore & DRAM: AoS Timings Memory Access Patterns & Multicore:

    Interactions Matter Inter-thread Interference Sharing - Contention - Interference - Slowdown Threads using a shared resource (like on-chip/off-chip interconnects and memory) contend for it, interfering with each other’s progress, resulting in slowdown (and thus negative returns to increased threads count). cf. Thomas Moscibroda and Onur Mutlu, ”Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems,” Microsoft Research Technical Report, MSR-TR-2007-15, February 2007. 174
  134. Multicore & DRAM: SoA I #include <cstddef> #include <cstdlib> #include

    <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> // SoA (structure-of-arrays) struct data { std::vector<double> K; std::vector<double> T; std::vector<double> P; }; 175
  135. Multicore & DRAM: SoA II double sum_if(const data & a,

    const data & b, const std::vector<std::size_t { double sum = 0.0; for (std::size_t i = 0, n = index.size(); i != n; ++i) { std::size_t j = index[i]; if (a.K[j] == b.K[j]) sum += a.K[j]; } return sum; } template <typename F> double average(F f, std::size_t m) { double average = 0.0; for (std::size_t i = 0; i != m; ++i) { average += f() / m; } 176
  136. Multicore & DRAM: SoA III return average; } std::vector<std::size_t> index_stream(std::size_t

    n) { std::vector<std::size_t> index; index.reserve(n); for (std::size_t i = 0; i != n; ++i) index.push_back(i); return index; } std::vector<std::size_t> index_random(std::size_t n) { std::vector<std::size_t> index; index.reserve(n); std::random_device rd; static std::mt19937 g(rd()); std::uniform_int_distribution<std::size_t> u(0, n - 1); 177
  137. Multicore & DRAM: SoA IV for (std::size_t i = 0;

    i != n; ++i) index.push_back(u(g)); return index; } int main(int argc, char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; const std::size_t m = (argc > 2) ? std::atoll(argv[2]) : 10; std::cout << "n = " << n << '\n'; std::cout << "m = " << m << '\n'; const std::size_t threads_count = 4; // thread access locality type // 0: none (default); 1: stream; 2: random std::vector<std::size_t> thread_type(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { 178
  138. Multicore & DRAM: SoA V thread_type[thread] = (argc > 3

    + thread) ? std::atoll(argv[3 + thr std::cout << "thread_type[" << thread << "] = " << thread_type[thre } endl(std::cout); std::vector<std::vector<std::size_t>> index(threads_count); for (std::size_t thread = 0; thread != threads_count; ++thread) { index[thread].resize(n); if (thread_type[thread] == 1) index[thread] = index_stream(n); else if (thread_type[thread] == 2) index[thread] = index_random(n); } data v1; v1.K.resize(n, 1.0); v1.T.resize(n, 0.5); v1.P.resize(n, 3.0); 179
  139. Multicore & DRAM: SoA VI data v2; v2.K.resize(n, 1.0); v2.T.resize(n,

    2.0); v2.P.resize(n, 1.0); const auto thread_work = [m, &v1, &v2](const auto & thread_index) { const auto f = [&v1, &v2, &thread_index] { return sum_if(v1, v2, thread_index); }; return average(f, m); }; 180
  140. Multicore & DRAM: SoA VII boost::timer::auto_cpu_timer timer; std::vector<std::future<double>> results; results.reserve(threads_count);

    for (std::size_t thread = 0; thread != threads_count; ++thread) { results.emplace_back(std::async(std::launch::async, [thread, &thread_work, &index] { return thread_work(index[thread]); ); } for (auto && result : results) if (result.valid()) result.wait(); for (auto && result : results) std::cout << result.get() << '\n'; } 181
  141. Multicore & DRAM: SoA Timings 1 thread, sequential access $

    ./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 1e+007 0.211877s wall, 0.203125s user + 0.000000s system = 0.203125s CPU (95.9%) 182
  142. Multicore & DRAM: SoA Timings 1 thread, random access $

    ./DRAM_CMP.SoA 10000000 10 2 n = 10000000 m = 10 thread_type[0] = 2 1e+007 4.534646s wall, 4.546875s user + 0.000000s system = 4.546875s CPU (100.3%) 183
  143. Multicore & DRAM: SoA Timings 4 threads, sequential access $

    ./DRAM_CMP.SoA 10000000 10 1 1 1 1 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 1 1e+007 1e+007 1e+007 1e+007 0.256391s wall, 1.031250s user + 0.000000s system = 1.031250s CPU (402.2%) 184
  144. Multicore & DRAM: SoA Timings 4 threads: 3 sequential access

    + 1 random access $ ./DRAM_CMP.SoA 10000000 10 1 1 1 2 n = 10000000 m = 10 thread_type[0] = 1 thread_type[1] = 1 thread_type[2] = 1 thread_type[3] = 2 1e+007 1e+007 1e+007 1e+007 4.581033s wall, 5.265625s user + 0.000000s system = 5.265625s CPU (114.9%) 185
  145. Multicore & DRAM: SoA Timings Better Access Patterns yield Better

    Single-core Performance but also Reduced Interference and thus Better Multi-core Performance 186
  146. Multicore: Arithmetic Intensity L. Yavits, A. Morad, R. Ginosar, The

    effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 187
  147. Multicore: Synchronization & Connectivity Intensity L. Yavits, A. Morad, R.

    Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 188
  148. Speedup: Synchronization and Connectivity Bottlenecks f: parallelizable fraction f1 :

    connectivity intensity f2 : synchronization intensity L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 189
  149. Speedup: Synchronization & Connectivity Bottlenecks Speedup - affected by sequential-to-parallel

    data synchronization and inter-core communication. L. Yavits, A. Morad, R. Ginosar, The effect of communication and synchronization on Amdahl’s law in multicore systems, Parallel Computing, v. 40 n. 1, p. 1-16, January, 2014 190
  150. Pipelining & Temporal Parallelism D. Sima, ”Decisive aspects in the

    evolution of microprocessors”, Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 191
  151. Pipelining: Base N. P. Jouppi and D. W. Wall. 1989.

    ”Available instruction-level parallelism for superscalar and superpipelined machines.” In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 192
  152. Pipelining: Superscalar N. P. Jouppi and D. W. Wall. 1989.

    ”Available instruction-level parallelism for superscalar and superpipelined machines.” In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS III). ACM, New York, NY, USA, 272-282. 193
  153. Pipelining & Branches P. Emma and E. Davidson, ”Characterization of

    Branch and Data Dependencies in Programs for Evaluating Pipeline Performance,” IEEE Trans. Computers C-36, No. 7, 859-875 (July 1987) 194
  154. Branch (Mis)Prediction Example I #include <cmath> #include <cstddef> #include <cstdlib>

    #include <future> #include <iostream> #include <random> #include <vector> #include <boost/timer/timer.hpp> double sum1(const std::vector<double> & x, const std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::cos(x[i]) : std::sin(x[i]); } return sum; } 195
  155. Branch (Mis)Prediction Example II double sum2(const std::vector<double> & x, const

    std::vector<bool> & which) { double sum = 0.0; for (std::size_t i = 0, n = which.size(); i != n; ++i) { sum += which[i] ? std::sin(x[i]) : std::cos(x[i]); } return sum; } std::vector<bool> inclusion_random(std::size_t n, double p) { std::vector<bool> which; which.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution decision(p); for (std::size_t i = 0; i != n; ++i) which.push_back(decision(g)); 196
  156. Branch (Mis)Prediction Example III return which; } int main(int argc,

    char * argv[]) { const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // branch takenness / predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\n'; // takenness probability // 0.0: never; 1.0: always const double p = (argc > 3) ? std::atof(argv[3]) : 0.5; std::cout << "p = " << p << '\n'; 197
  157. Branch (Mis)Prediction Example IV std::vector<bool> which; if (type == 0)

    which.resize(n, false); else if (type == 1) which.resize(n, true); else if (type == 2) which = inclusion_random(n, p); const std::vector<double> x(n, 1.1); boost::timer::auto_cpu_timer timer; std::cout << sum1(x, which) + sum2(x, which) << '\n'; } 198
  158. Timing: Branch (Mis)Prediction Example $ make BP CXXFLAGS="-std=c++14 -O3 -march=native"

    LDLIBS=-lboost_timer-mt $ ./BP 10000000 0 n = 10000000 type = 0 1.3448e+007 1.190391s wall, 1.187500s user + 0.000000s system = 1.187500s CPU (99.8%) $ ./BP 10000000 1 n = 10000000 type = 1 1.3448e+007 1.172734s wall, 1.156250s user + 0.000000s system = 1.156250s CPU (98.6%) $ ./BP 10000000 2 n = 10000000 type = 2 1.3448e+007 1.296455s wall, 1.296875s user + 0.000000s system = 1.296875s CPU (100.0%) 199
  159. Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH

    -f ./BP 10000000 0 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 0 1.3448e+07 0.445464s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177597 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167613066 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167632206 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952380 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14796 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4586 | | Runtime unhalted [s] | 0.4505 | | Clock [MHz] | 2591.5373 | | CPI | 0.4679 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.929838e-06 | | Branch misprediction ratio | 3.967263e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 200
  160. Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH

    -f ./BP 10000000 1 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 1 1.3448e+07 0.445354s wall, 0.440000s user + 0.000000s system = 0.440000s CPU (98.8%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2495177490 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 1167125701 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1167146162 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 372952366 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 14720 | +------------------------------+---------+------------+ +----------------------------+--------------+ | Metric | Core 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.4584 | | Runtime unhalted [s] | 0.4504 | | Clock [MHz] | 2591.5345 | | CPI | 0.4678 | | Branch rate | 0.1495 | | Branch misprediction rate | 5.899380e-06 | | Branch misprediction ratio | 3.946885e-05 | | Instructions per branch | 6.6903 | +----------------------------+--------------+ 201
  161. Likwid: Branch (Mis)Prediction Example $ likwid-perfctr -C S0:1 -g BRANCH

    -f ./BP 10000000 2 -------------------------------------------------------------------------------- CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz CPU type: Intel Core IvyBridge processor CPU clock: 2.59 GHz -------------------------------------------------------------------------------- n = 10000000 type = 2 1.3448e+07 0.509917s wall, 0.510000s user + 0.000000s system = 0.510000s CPU (100.0%) -------------------------------------------------------------------------------- Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | Core 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 3191479747 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2264945099 | | CPU_CLK_UNHALTED_REF | FIXC2 | 2264967068 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 468135649 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 15326586 | +------------------------------+---------+------------+ +----------------------------+-----------+ | Metric | Core 1 | +----------------------------+-----------+ | Runtime (RDTSC) [s] | 0.8822 | | Runtime unhalted [s] | 0.8740 | | Clock [MHz] | 2591.5589 | | CPI | 0.7097 | | Branch rate | 0.1467 | | Branch misprediction rate | 0.0048 | | Branch misprediction ratio | 0.0327 | | Instructions per branch | 6.8174 | +----------------------------+-----------+ 202
  162. Perf: Branch (Mis)Prediction Example $ perf stat -e branches,branch-misses -r

    10 ./BP 10000000 0 Performance counter stats for './BP 10000000 0' (10 runs): 374,121,213 branches ( +- 0.02% ) 23,260 branch-misses # 0.01% of all branches ( +- 0.35% ) 0.460392835 seconds time elapsed ( +- 0.50% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 1 Performance counter stats for './BP 10000000 1' (10 runs): 374,040,282 branches ( +- 0.01% ) 23,124 branch-misses # 0.01% of all branches ( +- 0.45% ) 0.457583418 seconds time elapsed ( +- 0.04% ) $ perf stat -e branches,branch-misses -r 10 ./BP 10000000 2 Performance counter stats for './BP 10000000 2' (10 runs): 469,331,762 branches ( +- 0.01% ) 15,326,501 branch-misses # 3.27% of all branches ( +- 0.01% ) 0.884858777 seconds time elapsed ( +- 0.30% ) 203
  163. Branch Prediction & Speculative Execution D. Sima, ”Decisive aspects in

    the evolution of microprocessors”, Proceedings of the IEEE, vol. 92, pp. 1896-1926, 2004 213
  164. Block Enlargement Fisher, J. A. (1983). ”Very Long Instruction Word

    architectures and the ELI-512.” Proceedings of the 10th Annual International Symposium on Computer Architecture. 214
  165. Block Enlargement Joseph A. Fisher and John J. O’Donnell, ”VLIW

    Machines: Multiprocessors We Can Actually Program,” CompCon 84 Proceedings, pp. 299-305, IEEE, 1984. 215
  166. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) 216
  167. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) 216
  168. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 216
  169. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 216
  170. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: 216
  171. Branch Predictability • takenness rate? • transition rate? • compare:

    • 01010101 (i % 2) • 01101101 (i % 3) • 10101010 !(i % 2) • 10010010 !(i % 3) • 00110011 (i / 2) % 2 • 00011100 (i / 3) % 2 • what they have in common: • all predictable! 216
  172. Branch Predictability & Marker API https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#using- the-marker-api https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerC g++ -Ofast

    -march=native source.cpp -o application -std=c++14 -DLIKWID_PERFMON -lpthread -llikwid likwid-perfctr -f -C 0-3 -g BRANCH -m ./application #include <likwid.h> // . . . LIKWID_MARKER_START("branch"); // branch code LIKWID_MARKER_STOP("branch"); 217
  173. Branch Entropy linear entropy: EL(p) = 2 × min(p, 1

    − p) intuition: miss rate proportional to the probability of the least frequent outcome 218
  174. Branch Takenness Probability Sander De Pestel, Stijn Eyerman and Lieven

    Eeckhout, ”Micro-Architecture Independent Branch Behavior Characterization”, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 219
  175. Branch Entropy & Miss Rate: Linear Relationship Sander De Pestel,

    Stijn Eyerman and Lieven Eeckhout, ”Micro-Architecture Independent Branch Behavior Characterization”, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, March 2015. 220
  176. Branches & Expectations: Code I #include <chrono> #include <cmath> #include

    <cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <numeric> #include <random> #include <string> #include <vector> #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable((x))) 221
  177. Branches & Expectations: Code II using T = int; void

    f(T z, T & x, T & y) { ((z < 0) ? x : y) = 5; } void generate_never(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(10, 19); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 222
  178. Branches & Expectations: Code III void generate_always(std::size_t n, std::vector<T> &

    zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-19, -10); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } void generate_random(std::size_t n, std::vector<T> & zs) { zs.reserve(n); static std::mt19937 g(1); std::uniform_int_distribution<T> z(-5, 4); for (std::size_t i = 0; i != n; ++i) zs.push_back(z(g)); return; } 223
  179. Branches & Expectations: Code IV int main(int argc, char *

    argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random const std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n), zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } else if (type == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); 224
  180. Branches & Expectations: Code V const auto time_start = std::chrono::steady_clock::now();

    T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(zs[i], xs[i], ys[i]); } const auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 225
  181. Branches & Expectations: Compiling & Timing g++ -ggdb -std=c++14 -march=native

    -Ofast ./branches.cpp -o branches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./branches.cpp -o branches_c time ./branches_g 1000000 0 time ./branches_g 1000000 1 time ./branches_g 1000000 2 time ./branches_c 1000000 0 time ./branches_c 1000000 1 time ./branches_c 1000000 2 226
  182. Branches & Expectations: Timings (GCC) $ time ./branches_g 1000000 0

    n = 1000000 type = 0 never duration: 0.00082991 sum(xs): 0 sum(ys): 5000000 real 0m0.034s user 0m0.033s sys 0m0.003s $ time ./branches_g 1000000 1 n = 1000000 type = 1 always duration: 0.000839488 sum(xs): 5000000 sum(ys): 0 real 0m0.031s user 0m0.030s sys 0m0.000s $ time ./branches_g 1000000 2 n = 1000000 type = 2 random duration: 0.0052968 sum(xs): 2498105 sum(ys): 2501895 real 0m0.038s user 0m0.033s sys 0m0.003s 227
  183. Branches & Expectations: Timings (Clang) $ time ./branches_c 1000000 0

    n = 1000000 type = 0 never duration: 0.00091161 sum(xs): 0 sum(ys): 5000000 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 1 n = 1000000 type = 1 always duration: 0.000765925 sum(xs): 5000000 sum(ys): 0 real 0m0.036s user 0m0.033s sys 0m0.000s $ time ./branches_c 1000000 2 n = 1000000 type = 2 random duration: 0.00554585 sum(xs): 2498105 sum(ys): 2501895 real 0m0.041s user 0m0.040s sys 0m0.000s 228
  184. So many performance events, so little time ”So many performance

    events, so little time,” Gerd Zellweger, Denny Lin, Timothy Roscoe. Proceedings of the 7th Asia-Pacific Workshop on Systems (APSys, Hong Kong, China, August 2016). 229
  185. Hierarchical cycle accounting Andrzej Nowak, David Levinthal, Willy Zwaenepoel: ”Hierarchical

    cycle accounting: a new method for application performance tuning.” ISPASS 2015. https://github.com/David-Levinthal/gooda 230
  186. Top-down Microarchitecture Analysis Method (TMAM) https://github.com/andikleen/pmu-tools/wiki/toplev-manual https://sites.google.com/site/analysismethods/yasin-pubs ”A Top-Down Method

    for Performance Analysis and Counters Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 231
  187. TMAM: Bottlenecks ”A Top-Down Method for Performance Analysis and Counters

    Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 232
  188. TMAM: Breakdown ”A Top-Down Method for Performance Analysis and Counters

    Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 233
  189. TMAM: Meaning Updates: https://download.01.org/perfmon/ ”A Top-Down Method for Performance Analysis

    and Counters Architecture,” Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014. 234
  190. Branches & Expectations: TMAM, Level 1 (GCC) $ ~/builds/pmu-tools/toplev.py -l1

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu n = 1000000 type = 2 random duration: 0.00523105 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.92 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_g 1000000 2 235
  191. Branches & Expectations: TMAM, Level 2 (GCC) $ ~/builds/pmu-tools/toplev.py -l2

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_g 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x n = 1000000 type = 2 random duration: 0.00528841 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u, n = 1000000 type = 2 random duration: 0.00550316 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 53.94 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 47.54 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 16.41 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./branches_g 1000000 2 236
  192. Branches & Expectations: TMAM, Level 2, perf (GCC) perf record

    -g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period -o perf.data ./branches_g 1000000 2 perf report -Mintel 237
  193. Branches & Expectations: TMAM, Level 1 (Clang) $ ~/builds/pmu-tools/toplev.py -l1

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu n = 1000000 type = 2 random duration: 0.00555177 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.53 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./branches_c 1000000 2 238
  194. Branches & Expectations: TMAM, Level 2 (Clang) $ ~/builds/pmu-tools/toplev.py -l2

    --long-desc --no-multiplex --show-sample --single-thread --user ./branches_c 1000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umas n = 1000000 type = 2 random duration: 0.0055571 sum(xs): 2498105 sum(ys): 2501895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instruction n = 1000000 type = 2 random duration: 0.00556777 sum(xs): 2498105 sum(ys): 2501895 FE Frontend_Bound: 45.54 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 39.20 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 15.18 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,p 239
  195. Branches & Expectations: TMAM, Level 2, perf (Clang) perf record

    -g -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./branches_c 1000000 2 perf report -Mintel 240
  196. Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and

    James E. Smith. 2009. ”A mechanistic performance model for superscalar out-of-order processors.” ACM Trans. Comput. Syst. 27, 2, Article 3. 241
  197. Branch Misprediction Penalty Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis and

    James E. Smith, ”A Performance Counter Architecture for Computing Accurate CPI Components”, ASPLOS 2006, pp. 175-184. 242
  198. Virtual Functions & Indirect Branches: Code I #include <chrono> #include

    <cmath> #include <cstdint> #include <cstdlib> #include <iostream> #include <iterator> #include <memory> #include <numeric> #include <random> #include <string> #include <vector> #define str(s) #s #define likely(x) (__builtin_expect(!!(x), 1)) #define unlikely(x) (__builtin_expect(!!(x), 0)) #define unpredictable(x) (__builtin_unpredictable(!!(x))) 243
  199. Virtual Functions & Indirect Branches: Code II using T =

    int; struct base { virtual T f() const { return 0; } }; struct derived_taken : base { T f() const override { return -1; } }; struct derived_untaken : base { T f() const override { return 1; } }; void f(const base & b, T & x, T & y) { ((b.f() < 0) ? x : y) = 119; } void generate_never(std::size_t n, std::vector<std::unique_ptr<base>> & zs) { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_untaken>()); return; 244
  200. Virtual Functions & Indirect Branches: Code III } void generate_always(std::size_t

    n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); for (std::size_t i = 0; i != n; ++i) zs.push_back(std::make_unique<derived_taken>()); return; } void generate_random(std::size_t n, std::vector<std::unique_ptr<base>> & zs { zs.reserve(n); static std::mt19937 g(1); std::bernoulli_distribution z(0.5); for (std::size_t i = 0; i != n; ++i) { if (z(g)) zs.emplace_back(std::make_unique<derived_taken>()); else zs.emplace_back(std::make_unique<derived_untaken>()); 245
  201. Virtual Functions & Indirect Branches: Code IV } return; }

    int main(int argc, char * argv[]) { // sample size const std::size_t n = (argc > 1) ? std::atoll(argv[1]) : 1000; std::cout << "n = " << n << '\n'; // takenness predictability type // 0: never; 1: always; 2: random std::size_t type = (argc > 2) ? std::atoll(argv[2]) : 0; std::cout << "type = " << type << '\t'; std::vector<T> xs(n), ys(n); std::vector<std::unique_ptr<base>> zs; if (type == 0) { std::cout << "never"; generate_never(n, zs); } else if (type == 1) { std::cout << "always"; generate_always(n, zs); } 246
  202. Virtual Functions & Indirect Branches: Code V else if (type

    == 2) { std::cout << "random"; generate_random(n, zs); } endl(std::cout); auto time_start = std::chrono::steady_clock::now(); T sum = 0; for (std::size_t i = 0; i != n; ++i) { f(*zs[i], xs[i], ys[i]); } auto time_end = std::chrono::steady_clock::now(); std::chrono::duration<double> duration = time_end - time_start; std::cout << "duration: " << duration.count() << '\n'; endl(std::cout); std::cout << "sum(xs): " << accumulate(begin(xs), end(xs), T{}) << '\n'; std::cout << "sum(ys): " << accumulate(begin(ys), end(ys), T{}) << '\n'; } 247
  203. Virtual Functions & Indirect Branches: Compiling & Timing g++ -ggdb

    -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_g clang++ -ggdb -std=c++14 -march=native -Ofast ./vbranches.cpp -o vbranches_c time ./vbranches_g 10000000 0 time ./vbranches_g 10000000 1 time ./vbranches_g 10000000 2 time ./vbranches_c 10000000 0 time ./vbranches_c 10000000 1 time ./vbranches_c 10000000 2 248
  204. Virtual Functions & Indirect Branches: Timings (GCC) $ time ./vbranches_g

    10000000 0 n = 10000000 type = 0 never duration: 0.0338749 sum(xs): 0 sum(ys): 1190000000 real 0m0.645s user 0m0.573s sys 0m0.070s $ time ./vbranches_g 10000000 1 n = 10000000 type = 1 always duration: 0.0406144 sum(xs): 1190000000 sum(ys): 0 real 0m0.648s user 0m0.563s sys 0m0.083s $ time ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.131803 sum(xs): 595154105 sum(ys): 594845895 real 0m0.956s user 0m0.863s sys 0m0.090s 249
  205. Branches & Expectations: Timings (Clang) $ time ./vbranches_c 10000000 0

    n = 10000000 type = 0 never duration: 0.0314749 sum(xs): 0 sum(ys): 1190000000 real 0m0.623s user 0m0.530s sys 0m0.090s $ time ./vbranches_c 10000000 1 n = 10000000 type = 1 always duration: 0.0314727 sum(xs): 1190000000 sum(ys): 0 real 0m0.623s user 0m0.557s sys 0m0.063s $ time ./vbranches_c 10000000 2 n = 10000000 type = 2 random duration: 0.0854935 sum(xs): 595154105 sum(ys): 594845895 real 0m1.863s user 0m1.800s sys 0m0.063s 250
  206. Virtual Functions & Indirect Branches: TMAM, Level 1 (GCC) $

    ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu/event=0x9c,umask=0x1/u,cycles:u} n = 10000000 type = 2 random duration: 0.131386 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. BAD Bad_Speculation: 12.98 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_g 10000000 2 251
  207. Virtual Functions & Indirect Branches: TMAM, Level 2 (GCC) $

    ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umask=0x1,cmask=4/u,cpu/event=0xc5,umask=0x0/u,c n = 10000000 type = 2 random duration: 0.131247 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instructions:u,cycles:u,cpu/event=0xa3,umask=0x4,cmask= n = 10000000 type = 2 random duration: 0.131361 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 36.02 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.41 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u BAD Bad_Speculation: 12.92 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.75 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data 252
  208. Virtual Functions & Indirect Branches: TMAM, Level 3 (GCC) $

    ~/builds/pmu-tools/toplev.py -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_g 10000000 2 n = 10000000 type = 2 random duration: 0.13145 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 35.96 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 17.44 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.69 % [100.00%] This metric represents cycles fraction the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path, following all sorts of miss-predicted branches. For example, branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings. Sampling events: br_misp_retired.all_branches:u BAD Bad_Speculation: 12.97 % [100.00%] This category represents slots fraction wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example, wasted work due to miss- predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example. BAD Bad_Speculation.Branch_Mispredicts: 12.82 % [100.00%] This metric represents slots fraction the CPU has wasted due to Branch Misprediction. These slots are either wasted by uops fetched from an incorrectly speculated program path, or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.. Using profile feedback in the compiler may help. Please see the optimization manual for general strategies for addressing branch misprediction issues.. http://www.intel.com/content/www/us/en/architecture-and- technology/64-ia-32-architectures-optimization-manual.html Sampling events: br_misp_retired.all_branches:u Sampling: perf record -g -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u -o perf.data ./vbranches_g 10000000 2 253
  209. Virtual Functions: TMAM, Level 3, perf (GCC) perf record -g

    -e cycles:pp, cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0xc5,umask=0x0,name=Branch_Mispredicts_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/ppu, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_g 10000000 2 perf report -Mintel 254
  210. Virtual Functions & Indirect Branches: TMAM, Level 1 (Clang) $

    ~/builds/pmu-tools/toplev.py -l1 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 1. RUN #1 of 1 perf stat -x\; -e '{cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cpu/event=0xd,umask=0x3,cmask=1/u,cpu n = 10000000 type = 2 random duration: 0.0858722 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.66 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. Sampling: perf record -g -e cycles:pp:u -o perf.data ./vbranches_c 10000000 2 255
  211. Virtual Functions & Indirect Branches: TMAM, Level 2 (Clang) $

    ~/builds/pmu-tools/toplev.py -l2 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 Using level 2. RUN #1 of 2 perf stat -x\; -e '{cpu/event=0x9c,umask=0x1/u,cpu/event=0xc3,umask=0x1,edge=1,cmask=1/u,cpu/event=0xc2,umask=0x2/u,cpu/event=0xe,umask=0x1/u,cycles:u,cpu/event=0x79,umask=0x30/u,cpu/event=0x9c,umas n = 10000000 type = 2 random duration: 0.0859943 sum(xs): 595154105 sum(ys): 594845895 RUN #2 of 2 perf stat -x\; -e '{cpu/event=0xb1,umask=0x1,cmask=2/u,cpu/event=0xa2,umask=0x8/u,cpu/event=0xb1,umask=0x1,cmask=1/u,cpu/event=0xa3,umask=0x6,cmask=6/u,cpu/event=0x9c,umask=0x1,cmask=4/u,instruction n = 10000000 type = 2 random duration: 0.0861661 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.61 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.64 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u,cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u,cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,p 256
  212. Virtual Functions & Indirect Branches: TMAM, Level 3 (Clang) ~/builds/pmu-tools/toplev.py

    -l3 --long-desc --no-multiplex --show-sample --single-thread --user ./vbranches_c 10000000 2 sum(xs): 595154105 sum(ys): 594845895 FE Frontend_Bound: 37.65 % [100.00%] This category represents slots fraction where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend, a branch predictor predicts the next address to fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into micro-ops (uops). Ideally the Frontend can issue 4 uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example, stalls due to instruction-cache misses would be categorized under Frontend Bound. FE Frontend_Bound.Frontend_Latency: 26.63 % [100.00%] This metric represents slots fraction the CPU was stalled due to Frontend latency issues. For example, instruction- cache misses, iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases, the Frontend eventually delivers no uops for some period. Sampling events: rs_events.empty_end:u FE Frontend_Bound.Frontend_Latency.MS_Switches: 8.40 % [100.00%] This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals. Sampling events: idq.ms_switches:u RET Retiring.Microcode_Sequencer: 9.04 % [100.00%] This metric represents slots fraction the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The MS is used for CISC instructions not supported by the default decoders (like repeat move strings, or CPUID), or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.. Sampling events: idq.ms_uops:u Sampling: perf record -g -e cycles:pp:u, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=200003/u, cpu/event=0x79,umask=0x30,edge=1,cmask=1,name=MS_Switches_IDQ_MS_SWITCHES,period=2000003/u, cpu/event=0x79,umask=0x30,name=Microcode_Sequencer_IDQ_MS_UOPS,period=2000003/u -o perf.data ./vbranches_c 10000000 2 257
  213. Virtual Functions: TMAM, Level 3, perf (Clang) perf record -g

    -e cycles:pp, cpu/event=0x5e,umask=0x1,edge=1,inv=1,cmask=1,name=Frontend_Latency_RS_EVENTS_EMPTY_END,period=2 -o perf.data ./vbranches_cmov_xy_c 10000000 2 perf report -Mintel 258
  214. Compiler-Specific Built-in Functions GCC & Clang: __builtin_expect http://llvm.org/docs/BranchWeightMetadata.html#built-in- expect-instructions https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

    likely & unlikely https://kernelnewbies.org/FAQ/LikelyUnlikely Clang: __builtin_unpredictable http://clang.llvm.org/docs/LanguageExtensions.html#builtin- unpredictable 261
  215. Branch Misprediction, Speculation, and Wrong-Path Execution J. Reineke et al.,

    “A Definition and Classification of Timing Anomalies,” Proc. Int’l Workshop Worst Case Execution Time (WCET), 2006. 262
  216. Branch Misprediction Penalty & Wrong-Path Execution Tejas S. Karkhanis and

    James E. Smith. 2004. ”A First-Order Superscalar Processor Model.” In Proceedings of the 31st annual international symposium on Computer architecture (ISCA ’04). 263
  217. The Number of Cycles Sam Van den Steen; Stijn Eyerman;

    Sander De Pestel; Moncef Mechri; Trevor E. Carlson; David Black-Schaffer; Erik Hagersten; Lieven Eeckhout, “Analytical Processor Performance and Power Modeling using Micro-Architecture Independent Characteristics,” Transactions on Computers (TC) 2016. C - #cycles, N - #instructions, Deff - effective dispatch rate, mbpred - #branch mispredictions, cres - branch resolution time, cfe - front-end pipeline depth, mILi - #instruction fetch misses at each level i in the cache hierarchy, cLi - access latency to each cache level, ROB - size of the Reorder Buffer, mLLC - #number of LLC load misses, cmem - memory access time, cbus - memory bus transfer and waiting time, MLP - amount of memory-level parallelism, PhLLC - LLC hit chain penalty 264
  218. Cache-aware Roofline model ”Cache-aware Roofline model: Upgrading the loft.” Aleksandar

    Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 267
  219. Cache-aware Roofline model ”Cache-aware Roofline model: Upgrading the loft.” Aleksandar

    Ilic, Frederico Pratas, Leonel Sousa, IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 1-1, Jan.-June, 2014. 268
  220. Roofline Model: Microarchitectural Bottlenecks ”Extending the Roofline Model: Bottleneck Analysis

    with Microarchitectural Constraints.” Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 269
  221. Roofline Model: Microarchitectural Bottlenecks ”Extending the Roofline Model: Bottleneck Analysis

    with Microarchitectural Constraints.” Victoria Caparros and Markus Püschel Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014. 270
  222. C++ Standards: C++11 & C++14 Atomic Operations & Concurrent Memory

    Model http://en.cppreference.com/w/cpp/atomic http://github.com/MattPD/cpplinks/blob/master/atomics.lockfree.memory_model.md ”The C11 and C++11 Concurrency Model” by Mark John Batty: http://www.cl.cam.ac.uk/~mjb220/thesis/ Move semantics https://isocpp.org/wiki/faq/cpp11-language#rval http://thbecker.net/articles/rvalue_references/section_01.html http://kholdstare.github.io/technical/2013/11/23/moves-demystified.html scoped_allocator (stateful allocators support) https://isocpp.org/wiki/faq/cpp11-library#scoped-allocator http://en.cppreference.com/w/cpp/header/scoped_allocator https://accu.org/content/conf2012/JonathanWakely-CXX11_allocators.pdf https://accu.org/content/conf2013/Frank_Birbacher_Allocators.r210article.pdf 271
  223. C++ Standards: C++11, C++14, and C++17 reducing the need for

    conditional compilation via macros and template metaprogramming constexpr https://isocpp.org/wiki/faq/cpp11-language#cpp11-constexpr https://isocpp.org/wiki/faq/cpp14-language#extended-constexpr if constexpr http://en.cppreference.com/w/cpp/language/if#Constexpr_If 272
  224. C++17 Standard std::string_view http://en.cppreference.com/w/cpp/string/basic_string_view interoperatbility with C APIs (e.g., sockets)

    without extra allocations / copies std::aligned_alloc (C11) http://en.cppreference.com/w/cpp/memory/c/aligned_alloc aligned uninitialized storage allocation (vectorization) Hardware interference size http://eel.is/c++draft/hardware.interference http://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size portable cache line size information (e.g., padding to avoid false sharing) Extended allocators & polymorphic memory resources http://en.cppreference.com/w/cpp/memory/polymorphic_allocator http://stackoverflow.com/questions/38010544/polymorphic-allocator-when- and-why-should-i-use-it http://boost.org/doc/libs/release/doc/html/container/extended_functionality.html 273
  225. C++ Core Guidelines P: Philosophy • P.9: Don’t waste time

    or space. Per: Performance • Per.3: Don’t optimize something that’s not performance critical. • Per.6: Don’t make claims about performance without measurements. • Per.7: Design to enable optimization • Per.18: Space is time. • Per.19: Access memory predictably. • Per.30: Avoid context switches on the critical path https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-performance https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#S-performance 274
  226. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency 275
  227. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence 275
  228. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) 275
  229. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? 275
  230. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty 275
  231. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time 275
  232. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality 275
  233. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead 275
  234. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work 275
  235. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict 275
  236. Takeaway: It depends! • Memory access cost: latency / bandwidth?

    • Cache miss cost? • miss-penalty != miss-count x miss-latency • miss-penalty: overlap, independence • prefetch: convert cold misses to pending hits (tradeoffs: BW, MSHR) • Branch cost? • branch-predictability x branch-penalty • branch-penalty != branch-count x branch-resolution-time • branch-penalty: predictability (entropy), interval length / window drain, dependence (latency, critical path, how many ops feed the branch, how many are fed by the branch), instruction locality • Branches vs. Predicated Execution - penalty vs. overhead • misprediction cost vs. useless work • easy-to-predict vs. hard-to-predict • cmov & tradeoffs: converting control dependencies to data dependencies 275
  237. Takeaways Principles Data structures & data layout - fundamental part

    of design CPUs & pervasive forms parallelism • can support each other: PLP, ILP (MLP!), TLP, DLP Balanced design vs. bottlenecks Overlapping latencies Sharing-contention-interference-slowdown Yale Patt’s Phase 2: Break the layers: • break through the hardware/software interface • harness all levels of the transformation hierarchy 276
  238. Phase 2: Harnessing the Transformation Hierarchy Yale N. Patt, Microprocessor

    Performance, Phase 2: Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 277
  239. Break the Layers Yale N. Patt, Microprocessor Performance, Phase 2:

    Can We Harness the Transformation Hierarchy https://youtube.com/watch?v=0fLlDkC625Q 278
  240. Pigeonholing has to go Yale N. Patt at Yale Patt

    75 Visions of the Future Computer Architecture Workshop: ”Are you a software person or a hardware person?” I’m a person this pigeonholing has to go We must break the layers Abstractions are great - AFTER you understand what’s being abstracted Yale N. Patt, 2013 IEEE CS Harry H. Goode Award Recipient Interview — https://youtu.be/S7wXivUy-tk Yale N. Patt at Yale Patt 75 Visions of the Future Computer Architecture Workshop — https://youtu.be/x4LH1cJCvxs 279
  241. Out-of-Order Execution: Overlap R.M. Tomasulo, “An Efficient Algorithm for Exploiting

    Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 282
  242. Out-of-Order Execution: Reservation Stations R.M. Tomasulo, “An Efficient Algorithm for

    Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 283
  243. Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efficient Algorithm

    for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 284
  244. Out-of-Order Execution: Dependencies vs. Overlap R.M. Tomasulo, “An Efficient Algorithm

    for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967. 285
  245. Out-of-Order Execution of Simple Micro-Operations Y.N. Patt, W.M. Hwu, and

    M. Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 286
  246. Out-of-Order Execution: Restricted Dataflow Y.N. Patt, W.M. Hwu, and M.

    Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 287
  247. Out-of-Order Execution: Results Buffer Y.N. Patt, W.M. Hwu, and M.

    Shebanow, “HPS, A New Microarchitecture: Rationale and Introduction,” Proc. 18th Ann. Workshop Microprogramming, 1985, pp. 103–108. 288
  248. Pipelining & Precise Exceptions: Reorder Buffer (ROB) J.E. Smith and

    A.R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. 12th Ann. IEEE/ACM Int’l Symp. Computer Architecture, 1985, pp. 36–44. 289
  249. Execution: Superscalar & Out-Of-Order J.E. Smith and G.S. Sohi, ”The

    Microarchitecture of Superscalar Processors,” Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 290
  250. Superscalar CPU Organization J.E. Smith and G.S. Sohi, ”The Microarchitecture

    of Superscalar Processors,” Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 291
  251. Superscalar CPU: ROB J.E. Smith and G.S. Sohi, ”The Microarchitecture

    of Superscalar Processors,” Proc. IEEE, vol. 83, pp. 1609-1624, Dec. 1995. 292
  252. Computer Architecture: A Science of Tradeoffs ”My tongue in cheek

    phrase to emphasize the importance of tradeoffs to the discipline of computer architecture. Clearly, computer architecture is more art than science. Science, we like to think, involves a coherent body of knowledge, even though we have yet to figure out all the connections. Art, on the other hand, is the result of individual expressions of the various artists. Since each computer architecture is the result of the individual(s) who specified it, there is no such completely coherent structure. So, I opined if computer architecture is a science at all, it is a science of tradeoffs. In class, we keep coming up with design choices that involve tradeoffs. In my view, ”tradeoffs” is at the heart of computer architecture.” — Yale N. Patt 293
  253. Design Points: Dictated the Application Space The design of a

    microprocessor is about making relevant tradeoffs. We refer to the set of considerations, along with the relevant importance of each, as the “design point” for the microprocessor—that is, the characteristics that are most important to the use of the microprocessor, such that one is willing to be less concerned about other characteristics. In each case, it is usually the problem we are addressing . . . which dictates the design point for the microprocessor, and the resulting tradeoffs that must be made. Patt, Y., & Cockrell, E. (2001). ”Requirements, bottlenecks, and good fortune: Agents for microprocessor evolution.” Proceedings of the IEEE, 89(11), 1553-1559. 294
  254. A Science of Tradeoffs Software Performance Optimization - Analogous! The

    multiplicity of tradeoffs: • Multidimensional • Multiple levels • Costs and benefits 295
  255. Trade-offs - Latency & Bandwidth I Intel(R) Memory Latency Checker

    - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 60.4 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using traffic with the following read-write ratios ALL Reads : 24152.0 3:1 Reads-Writes : 22313.2 2:1 Reads-Writes : 22050.5 1:1 Reads-Writes : 21130.4 Stream-triad like: 21559.4 296
  256. Trade-offs - Latency & Bandwidth II Measuring Memory Bandwidths between

    nodes within system Using Read-only traffic type Memory node Socket 0 0 24155.0 Measuring Loaded Latencies for the system Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 122.27 24109.6 00002 121.99 24082.7 00008 120.60 23952.1 00015 119.28 23837.6 00050 70.87 17408.7 00100 64.59 12496.6 297
  257. Trade-offs - Latency & Bandwidth III Inject Latency Bandwidth Delay

    (ns) MB/sec ========================== 00200 61.76 8129.1 00300 60.75 6194.8 00400 60.63 5085.6 00500 60.12 4377.0 00700 60.51 3505.2 01000 60.60 2812.6 01300 60.66 2425.3 01700 60.51 2117.0 02500 60.36 1789.5 03500 60.33 1585.4 05000 60.29 1430.9 09000 60.31 1267.9 20000 60.32 1154.7 298
  258. Trade-offs - Latency & Size I Intel i3-2120 (Sandy Bridge),

    3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 20 9 + 16 (L3) 1 M 24 4 2 M 26 2 4 M 27 + 18 ns 1 + 18 ns + 56 ns (RAM) 8 M 28 + 38 ns 1 + 20 ns 16 M 28 + 47 ns 9 ns 32 M 28 + 52 ns 5 ns 64 M 28 + 54 ns 2 ns 128 M 36 + 55 ns 8 + 1 ns + 16 (TLB miss) 299
  259. Trade-offs - Latency & Size II Size Latency Increase Description

    256 M 40 + 56 ns 4 + 1 ns 512 M 42 + 56 ns 2 1024 M 43 + 56 ns 1 2048 M 44 + 56 ns 1 4096 M 44 + 56 ns 0 8192 M 53 + 56 ns 9 + 18 (PDPTE cache miss) Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T. http://www.7-cpu.com/cpu/SandyBridge.html 300
  260. Trade-offs - Least Squares Golub & Van Loan (2013) ”Matrix

    Computations” Trade-offs: FLOPs (FLoating-point OPerations) vs. Applicability / Numerical Stability / Speed / Accuracy Example: Catalogue of dense decompositions: http://eigen.tuxfamily.org/dox/group__TopicLinearAlgebraDecompositions.html 301
  261. Trade-offs - Multidimensional - Numerical Optimization Ben Recht, Feng Niu,

    Christopher Ré, Stephen Wright. ”Lock-Free Approaches to Parallelizing Stochastic Gradient Descent” OPT 2011: 4th International Workshop on Optimization for Machine Learning http://opt.kyb.tuebingen.mpg.de/slides/opt2011-recht.pdf 302
  262. Trade-offs - Multiple levels - Numerical Optimization Gradient computation -

    accuracy vs. function evaluations f : Rd → RN • Finite differencing: • forward-difference: O( √ ϵM) error, d O(Cost(f)) evaluations • central-difference: O(ϵ2/3 M ) error, 2d O(Cost(f)) evaluations w/ the machine epsilon ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0} • Algorithmic differentiation (AD): precision - as in hand-coded analytical gradient • rough forward-mode cost d O(Cost(f)) • rough reverse-mode cost N O(Cost(f)) 303
  263. Trade-offs: Costs and Benefits Gabriel, Richard P. (1985). ”Performance and

    Evaluation of Lisp Systems.” Cambridge, Mass: MIT Press; Computer Systems Series. 304
  264. Costs and Benefits: Implications • Important to know what to

    focus on • Optimize the optimization: so that it doesn’t always take hours or days or weeks or months... 305
  265. Superscalar CPU Model Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and

    James E. Smith. 2009. ”A mechanistic performance model for superscalar out-of-order processors.” ACM Trans. Comput. Syst. 27, 2, Article 3.306
  266. NVMs as Storage Class Memories - Bottlenecks: New & Old

    Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, ”A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory” Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, 2013. 307
  267. DBs Execution Cycles: Useful Computation vs. Stall Cycles R. Panda,

    C. Erb, M. LeBeane, J. H. Ryoo and L. K. John, ”Performance Characterization of Modern Databases on Out-of-Order CPUs,” Computer Architecture and High Performance Computing (SBAC-PAD), 2015 27th International Symposium on, Florianopolis, 2015, pp. 114-121. 308
  268. System Calls - Performance Impact Livio Soares and Michael Stumm.

    2010. ”FlexSC: flexible system call scheduling with exception-less system calls.” In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI’10). USENIX Association, Berkeley, CA, USA, 33-46. 309
  269. System Calls, Interrupts, and Asynchronous I/O Jisoo Yang, Dave B.

    Minturn, and Frank Hady. 2012. ”When poll is better than interrupt.” In Proceedings of the 10th USENIX conference on File and Storage Technologies (FAST’12). USENIX Association, Berkeley, CA, USA. 310
  270. System Calls as CPU Exceptions Craig B. Zilles, Joel S.

    Emer, and Gurindar S. Sohi. 1999. ”The use of multithreading for exception handling.” In Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture (MICRO 32). IEEE Computer Society, Washington, DC, USA, 311
  271. Pollution & Context Switch Misses Replaced Miss (D) & Reordered

    Miss (C) F. Liu, F. Guo, Y. Solihin, S. Kim and A. Eker, ”Characterizing and modeling the behavior of context switch misses”, Intl. Conf. on Parallel Architectures and Compilation Techniques, 2008. 312
  272. Beyond Mode Switch Time: Footprint & Pollution Livio Soares and

    Michael Stumm. 2010. ”FlexSC: flexible system call scheduling with exception-less system calls.” In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI’10). USENIX Association, Berkeley, CA, USA, 33-46. 313
  273. Beyond Mode Switch Time: Direct & Indirect Costs Livio Soares

    and Michael Stumm. 2010. ”FlexSC: flexible system call scheduling with exception-less system calls.” In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI’10). USENIX Association, Berkeley, CA, USA, 33-46. 314
  274. Partitioning-Sharing Tradeoffs Butler W. Lampson. 1983. ”Hints for computer system

    design.” In Proceedings of the ninth ACM symposium on Operating systems principles (SOSP ’83). ACM, New York, NY, USA, 33-48. 315
  275. Shared Resource: DRAM Heechul Yun, Renato, Zheng-Pei Wu, Rodolfo Pellizzoni.

    ”PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms,” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), 2014. https://github.com/heechul/palloc 316
  276. Shared Resource: MSHRs Heechul Yun, Rodolfo Pellizzon, and Prathap Kumar

    Valsan. 2015. ”Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems.” In Proceedings of the 2015 27th Euromicro Conference on Real-Time Systems (ECRTS ’15). 317
  277. Partitioning Multithreading • Thread affinity • POSIX: sched_getcpu, pthread_setaffinity_np •

    http://eli.thegreenplace.net/2016/c11-threads-affinity-and- hyperthreading/ • https://github.com/RRZE- HPC/likwid/blob/master/groups/skylake/FALSE_SHARE.txt • Local LLC false sharing rate = MEM_LOAD_L3_HIT_RETIRED_XSNP_HITM / MEM_INST_RETIRED_ALL • NUMA: Remote Memory Accesses (RMA), Local Memory Accesses (LMA), RMA/LMA ratio • https://01.org/numatop/ • https://github.com/01org/numatop 318
  278. Cache Partitioning: Index-Based & Way-Based Giovani Gracioli, Ahmed Alhammad, Renato

    Mancuso, Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. ”A Survey on Cache Management Mechanisms for Real-Time Embedded Systems.” ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 319
  279. Cache Partitioning: CPU Support Giovani Gracioli, Ahmed Alhammad, Renato Mancuso,

    Antônio Augusto Fröhlich, and Rodolfo Pellizzoni. 2015. ”A Survey on Cache Management Mechanisms for Real-Time Embedded Systems.” ACM Comput. Surv. 48, 2, Article 32 (November 2015), 36 pages. 320
  280. Cache Partitioning & Intel: CAT & CMT Cache Monitoring Technology

    and Cache Allocation Technology https://github.com/01org/intel-cmt-cat A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer, “Cache QoS: From concept to reality in the Intel Xeon processor E5-2600 v3 product family,” in Intl. Symp. on High Performance Computer Architecture (HPCA), Mar. 2016. 321
  281. Cache Partitioning != Cache Access Timing Isolation H. Yun and

    P. Valsan, ”Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms”, International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 322
  282. Cache Partitioning != Cache Access Timing Isolation H. Yun and

    P. Valsan, ”Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms”, International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 323
  283. Cache Partitioning != Cache Access Timing Isolation H. Yun and

    P. Valsan, ”Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms”, International Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT) 2015. 324
  284. Cache Partitioning != Cache Access Timing Isolation https://github.com/CSL-KU/IsolBench Prathap Kumar

    Valsan, Heechul Yun, Farzad Farshchi. ”Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems.” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 325
  285. Cache Partitioning != Cache Access Timing Isolation • Shared: MSHRs

    (Miss information/Status Holding Registers) / LFBs (Line Fill Buffers) • Contention => cache space partitioning != cache access timing isolation Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. ”Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems.” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 326
  286. Cache Partitioning != Cache Access Timing Isolation • mutiple MSHRs

    support multiple outstanding cache-misses • the number of MSHRs determines the MLP of the cache • local MLP - outstanding misses one core can generate • global MLP - parallelism of the entire shared memory hierarchy (i.e., shared LLC and DRAM) • ”the aggregated parallelism of the cores (the sum of local MLP) exceeds the parallelism supported by the shared LLC and DRAM (global MLP) in the out-of-order architectures” Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi. ”Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems.” IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS), IEEE, 2016. 327
  287. Shared Resource (MSHRs) & Prefetching: Xeon Phi Zhenman Fang, Sanyam

    Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. ”Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking.” ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 328
  288. Shared Resource (MSHRs) & Prefetching: SNB Zhenman Fang, Sanyam Mehta,

    Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. ”Measuring Microarchitectural Details of Multi- and Many-core Memory Systems Through Microbenchmarking.” ACM Transactions on Architecture and Code Optimization (TACO 2015). Volume 11, Issue 4, Article 55, January 2015. 329
  289. Weighted Speedup A. Snavely and D. M. Tullsen, “Symbiotic jobscheduling

    for simultaneous multithreading processor,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Nov. 2000, pp. 234– 244. S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance,” in Computer Architecture Letters, vol. 13, no. 2, 2014.330