Introduction to software optimization

Slide 1

Slide 1 text

Research Trends in Software Design and Analysis Dr. Keichi Takahashi Software Design and Analysis Laboratory Nara Institute of Science and Technology

Slide 2

Slide 2 text

What you get from this lecture 2 This lecture will teach you what it takes to write FAST code on modern computing systems.

Slide 3

Slide 3 text

Logistics Overall ‣ Four lectures in total: 4th hour (15:10–16:40) on December 8, 10, 15, 17 ‣ Hands-on exercises included, so bring your laptops! Grading ‣ 100% mini-project report ‣ Mini-project will be assigned in the last lecture Ask questions! ‣ Feel free to interrupt during class and ask questions ‣ Or ask via email: [email protected] 3

Slide 4

Slide 4 text

Recommended readings 4 Randal Bryant and David O’Hallaron, “Computer Systems: A Programmer's Perspective”, Pearson John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach”, Morgan Kaufmann.

Slide 5

Slide 5 text

(My) Golden rules for optimization Understand your hardware and software ‣ Know what your hardware is capable of. ‣ Profile your software and understand its bottlenecks. Minimize data movement across the memory hierarchy ‣ Modern computers have very deep memory hierarchies. ‣ Design memory layout and access pattern to maximize data reuse. Utilize parallelism from all levels of the computer ‣ Modern computers have many levels of parallelism. ‣ Exploit all levels of parallelism in your computer(s). 5

Slide 6

Slide 6 text

Know your hardware and software

Slide 7

Slide 7 text

Know your hardware

Slide 8

Slide 8 text

FLOP/s (Floating-point Operations per Second) A measure of computing performance ‣ especially appreciated in scientific computing. ‣ traditionally used for measuring the performance of supercomputers. When measuring/reporting FLOP/s ‣ Beware of the precision (double, single, half, etc.)! ‣ Take extra care for transcendental functions (sqrt, sin, exp, etc.) and divisions. 8 double x[N], y[N]; for (int i = 0; i < N; i++) { y[i] += a * x[i]; } How many # of floating-point operations does this code perform?

Slide 9

Slide 9 text

Examples 9 AMD EPYC 7742 DP: 2.3 TFLOPS SP: 4.6 TFLOPS NVIDIA V100 DP: 8.2 TFLOPS SP: 16.4 TFLOPS Intel Xeon Platinum 8284 DP: 1.5 TFLOPS SP: 3.0 TFLOPS Intel i7-8559U DP: 170 GFLOPS SP: 340 GFLOPS Intel i9-9900K DP: 589 GFLOPS SP: 1.2 TFLOPS

Slide 10

Slide 10 text

Moore’s law ϜʔΞͷ๏ଇ Moore’s law predicts that: ‣ The number of transistors in a chip doubles every two years. ‣ Notice that he did not predict that the performance doubles. Moore’s law is ending soon ‣ Most researchers predict the Moore’s law will end in the next 10 years. ‣ Exciting exotic devices and novel architectures are investigated for Post- Moore computing (but will not be covered in this lecture). 10 Gordon Moore

Slide 11

Slide 11 text

Dennard scaling σφʔυεέʔϦϯά Dennard observed the following scaling at every technology generation: 11 Transistor dimension: 1 k Power consumption: 1 k2 where k ≈ 1 2 ≈ 0.7 Frequency: k R. H. Dennard et al., "Design of ion-implanted MOSFET's with very small physical dimensions," in IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256-268, Oct. 1974. Voltage: 1 k Capacity: 1 k Power density: 1

Slide 12

Slide 12 text

End of Dennard scaling Dennard ignored the leakage current and threshold voltage ‣ Leakage current ࿙Εిྲྀ: Transistors leak a small current when they are turned oﬀ. This current increases as the insulation thickness decreases. ‣ Threshold voltage ᮢ஋ిѹ: Voltage where the transistor turns on. The power density started increasing ‣ Can’t increase the frequency anymore! ‣ How are we going to use the extra transistors we can put in our chip? 12

Slide 13

Slide 13 text

Singe-thread performance has plateaued ఀ଺ 13

Slide 14

Slide 14 text

The free lunch is over1 Until ~2005, Dennard Scaling was doing well ‣ Frequency and single-thread performance were steadily increasing. ‣ Just wait for new hardware to make your program run faster (free lunch). ~2005 (Pentium 4), Dennard Scaling failed ‣ Frequency and single-thread performance have plateaued. ‣ Multi-thread performance is still increasing. ‣ Software needs to be written to exploit the parallelism (no free lunch). 14 1 http://www.gotw.ca/publications/concurrency-ddj.htm

Slide 15

Slide 15 text

Two oxen vs 1,024 chickens 15 If you were plowing a field, which would you rather use: two strong oxen or 1,024 chickens? Seymour Cray

Slide 16

Slide 16 text

Two oxen vs 1,024 chickens 16 We have no choice but to work with 1,024 chickens!

Slide 17

Slide 17 text

Need to exploit multiple levels of parallelism Instruction-level parallelism ‣ A processor executes multiple independent instructions in parallel Data-level parallelism ‣ A processor executes the same instruction on multiple data (vector) Thread-level parallelism ‣ A multiprocessor comprises multiple processors that execute multiple programs in parallel Cluster-level parallelism ‣ A cluster comprises multiple computers that execute multiple programs in parallel 17

Slide 18

Slide 18 text

Example: AMD EPYC Rome 7742 18 Each core has 2 FMA (Fused Multiply-Add) pipelines 64 cores 1 CPU package Each pipeline can perform 8 singe-precision FLOPs in a cycle EPYC 7742 can perform 64*2*8*2=2,048 FLOPs in each cycle! In other words, you are only utilizing 0.05% of the total performance if you don’t use any of the parallelism available. See: https://en.wikichip.org/wiki/amd/epyc/7742 https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

Slide 19

Slide 19 text

Memory performance is lagging behind Memory performance could not keep up with CPU performance ‣ In the old days, memory ran faster than the CPU! ‣ Nowadays, CPU needs to wait (stall) for data to arrive from memory. 19 John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach”, Morgan Kaufmann. N.B. this is single- core performance

Slide 20

Slide 20 text

Memory hierarchy هԱ֊૚ Users want infinite amount of infinitely fast memory ‣ Fast memory (e.g. SRAM) is expensive ‣ Large memory (e.g. DRAM) is slow Form a hierarchy of memories ‣ Higher levels are faster but small ‣ Lower levels are slower but large ‣ Higher levels keep frequently accessed subset of the lower levels (caching) 20 Register L1 L2 L3 DRAM Faster Larger Tradeoﬀ!

Slide 21

Slide 21 text

Typical memory hierarchy in modern CPUs 21 DRAM L3 Cache L2 Cache L1 Cache Die DRAM DRAM DRAM Core L2 Cache L1 Cache Core L3 Cache L2 Cache L1 Cache Die Core L2 Cache L1 Cache Core

Slide 22

Slide 22 text

Latency numbers every programmer should know 22 What Latency Human scale L1 cache reference 0.5ns 0.5s Branch mispredict 5ns 5s L2 cache reference 7ns 7s Mutex lock/unlock 25ns 25s Main memory reference 100ns 1m 40s Compress 1K bytes with Zippy 3,000ns 50m Send 2 KB over 1 Gbps network 20,000ns 5h 33m Read 1 MB sequentially from memory 250,000ns 2d 21h 26m Read 1 MB sequentially from SSD 1,000,000ns 11d 13h 46m The Jeff Dean Facts http://www.informatika.bg/jeffdean શ੝ظͷJeff Dean఻આ https://qiita.com/umegaya/items/ef69461d6f4967d5c623

Slide 23

Slide 23 text

AMD EPYC 7742 memory latency* and bandwidth† Latency ‣ Clock cycle is ~0.4ns (2.25 GHz). ‣ Missing one cache level increases the latency by a factor of 1.4-5x. ‣ If you miss all cache levels, core stalls for ~200 cycles! Bandwidth ‣ Becomes scarce at lower levels. ‣ Note that the bandwidth is shared among all cores — DRAM B/W is only 2.4 GB/s per core! 23 Die 5ns 4.0 TB/s 17ns 1.6 TB/s 83ns 153 GB/s *https://www.7-cpu.com/cpu/Zen2.html † Measured on actual hardware using Empirical Roofline Tool (ERT) DRAM L3 Cache L2 Cache L1 Cache Core Core 1.7-3.5ns 6.6 TB/s

Slide 24

Slide 24 text

Deep memory hierarchy Utilizing the cache is crucial for achieving “good” performance ‣ If the core misses a cache level, it needs to wait for the data to arrive from lower levels. ‣ Your code needs to access the memory in such a way that maximizes cache hit (i.e. cache-friendliness) How to write cache-friendly code? ‣ Understand how the cache works (cache capacity, cache line size, false sharing, bank conflicts, etc.) ‣ Exploit spatial and temporal locality 24

Slide 25

Slide 25 text

Why software optimization matters Individual CPU cores are not becoming faster anymore ‣ Just buying a new hardware does not make your software run faster ‣ You need to parallelize your software If you want to utilize your hardware Memory hierarchy is deep ‣ Memory performance is not keeping up with computer performance ‣ Cannot get viable performance without eﬀectively utilizing the cache 25 Your code does not run any faster on your expensive new hardware without optimization.

Slide 26

Slide 26 text

Why can’t we build a 10,000-core CPU? Well, we can. Meet Cerebras CS-1. 26 Cerebras CS-1 NVIDIA V100 CS-1:V100 Die size 46,225 mm2 815 mm2 57x # of cores 40,000 5,120 78x On chip memory (SRAM) 18 GB 6 MB 3,000x Memory bandwidth 9 PB/s 900 GB/s 10,000x Power consumption 20 kW 300 W 67x https://www.cerebras.net/product/ Kamil Rocki et al., “Fast Stencil-Code Computation on a Wafer-Scale Processor”, SC20.

Slide 27

Slide 27 text

Why can’t we build a 10,000-core CPU? 27 … if you can supply 20kW of power + cooling

Slide 28

Slide 28 text

Know the bottleneck of your software

Slide 29

Slide 29 text

Ask yourself if you really need to optimize 29 “The first rule of optimization is: Don’t do it. The second rule of optimization (for experts only) is: Don’t do it yet.“ Michael A. Jackson

Slide 30

Slide 30 text

Ask yourself if you really need to optimize 30 “Premature optimization is the root of all evil.” Tony Hoare (or Donald Knuth)

Slide 31

Slide 31 text

Performance comes with a cost Optimizing a program takes time: ‣ If you are a researcher, focus on producing results and papers! ‣ If you are working at a company, optimization costs money. Buying an expensive server is probably cheaper than hiring a CS PhD. Optimizing a program reduces readability: ‣ Most optimization techniques reduces the readability of the source code. ‣ Low readability implies even more maintenance cost! ‣ Language and library abstractions to achieve both performance and readability are being studied but not quite there yet… 31

Slide 32

Slide 32 text

Consider these alternatives 32 NVIDIA DGX Station A100 $200,000 (one-oﬀ payment) Performance gain: predictable Computer Science Ph.D. $100,000 (yearly payment) Performance gain: unknown

Slide 33

Slide 33 text

Avoid reinventing the wheel There are plenty of high-quality, highly-optimized and well- maintained libraries out there! ‣ Basic linear algebra: BLAS and LAPACK implementations such as MKL, AOCL, ScaLAPACK, cuBLAS ‣ Solvers: PETSc, Trilinos, SuperLU, cuSOLVER ‣ FFT: FFTW, cuFFT ‣ File I/O: HDF5, ADIOS2, NetCDF ‣ General purpose: Eigen, Thrust, ArrayFire 33 If none of the existing libraries fit your need, then consider writing your own code

Slide 34

Slide 34 text

Focus on the bottleneck (hot-spot) The Pareto principle (80/20 rule) applies to optimization too ‣ Majority of the runtime is consumed by a very small fraction of the code ‣ Don’t waste your time by optimizing functions that don’t matter! ‣ If many functions equally contribute to the runtime, consider giving up 34 Function A B C D Rest % of Total Runtime 0 22.5 45 67.5 90 A B C D Rest % of Total Runtime 0 7.5 15 22.5 30 A B C D Rest % of Total Runtime 0 20 40 60 80

Slide 35

Slide 35 text

Always measure your code Don’t ever optimize a code without measurement ‣ Measure before attempting any optimization to identify the bottleneck Measure, optimize, and then measure again ‣ Measure after optimization since optimization techniques can often make code slower 35 Optimize Measure Measure Find the bottleneck Evaluate the eﬀect of the optimization

Slide 36

Slide 36 text

Profilers Open-source: ‣ prof: Linux’s builtin performance analysis tool. Recommended! ‣ gprof: Pretty old and outdated GCC’s profiler. Not recommended. Vendor-specific: ‣ Intel VTune and Advisor: Highly recommended on Intel CPUs. ‣ AMD uProf: Compared to VTune, not so great. ‣ NVIDIA Nsight and Nvprof: No other options if working on NVIDIA GPUs Language-specific: ‣ Python: cProfile and line_profile ‣ Ruby: stackprof and ruby-prof 36

Slide 37

Slide 37 text

Intel VTune (Hot-spot Analysis) 37

Slide 38

Slide 38 text

Intel VTune (Hot-spot Analysis) 38

Slide 39

Slide 39 text

Intel VTune (Hot-spot Analysis) 39

Slide 40

Slide 40 text

Intel VTune (Microarchitecture Exploration) 40

Slide 41

Slide 41 text

Linux perf Linux’s builtin performance analysis tool since kernel 2.6 ‣ Can profile the kernel space: useful when analyzing applications that spend significant time in the kernel space (e.g. I/O-intensive applications). ‣ Gives access to hardware performance counters: reveals hardware events such as cache misses, context switches, branch misses, etc. ‣ System-wide profiling: can profile all processes on the system. Useful when analyzing multiple processes interacting with one another. ‣ Kernel config (CONFIG_PERF_EVENTS) is enabled in most Linux distributions. Just need to install the user space utility (linux-tools- common on Ubuntu). 41 Oﬃcial wiki: https://perf.wiki.kernel.org/index.php/Main_Page Brendan Gregg’s guide: http://www.brendangregg.com/perf.html

Slide 42

Slide 42 text

perf record & report 42 Sample the number of hardware events (e.g. cache hits and misses) during execution of a program and report. perf record -e cycles ./stress-ng -c 10 -t 3 List of HW events to sample Program to measure perf report

Slide 43

Slide 43 text

perf stat 43 [keichi@keichi-ws stress-ng]$ perf stat -e cache-misses,cache-references ./stress-ng -c 10 -t 3 stress-ng: info: [14643] dispatching hogs: 10 cpu stress-ng: info: [14643] successful run completed in 3.04s Performance counter stats for './stress-ng -c 10 -t 3': 10,805,671 cache-misses # 0.640 % of all cache refs 1,687,740,535 cache-references 3.047260141 seconds time elapsed 30.328126000 seconds user 0.007996000 seconds sys Count number of hardware events (e.g. cache hits and misses) during execution of a program. perf stat -e cache-misses,cache-references ./stress-ng -c 10 -t 3 List of HW events to count Program to measure

Slide 44

Slide 44 text

perf list 44 Obtain a list of hardware events that are available on your platform.

Slide 45

Slide 45 text

perf top 45 stress-ng -c 10 stress-ng -S 10 System-wide profiling tool (think of it as the top command on steroids)

Slide 46

Slide 46 text

Strace: System call tracer 46

Slide 47

Slide 47 text

Strace: System call tracer 47

Slide 48

Slide 48 text

Timers Insert timers before and after the code you want to measure ‣ It’s simple and it always works — just like printf() debugging ‣ If you already know which part of the code is the bottleneck, very eﬀective ‣ Beware of the pitfalls! 48 // Start timer double start = read_timer(); // Code to measure for (…) { for (…) { … } } // Stop timer double end = read_timer(); // Calculate diff double runtime = end - start;

Slide 49

Slide 49 text

Use high-precision timers Use these: ‣ clock_gettime(CLOCK_MONOTONIC, …) ‣ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, …) ‣ std::chrono::steady_clock (C++11) Don’t use these: ‣ time() returns UNIX epoch seconds ‣ gettimeofday() deprecated in favor of clock_gettime() ‣ clock_gettime(CLOCK_REALTIME, …) can rewind if system time is changed (e.g. NTP) 49

Slide 50

Slide 50 text

Mind these pitfalls Pin your processes/threads to CPU cores ‣ When using clock_gettime(CLOCK_PROCESS_CPUTIME_ID, …) or RDTSC instructions Measure multiple times and perform warmup runs ‣ To see the variance of the runtime ‣ It’s common that the first few iterations are always slow (why?) … and even more pit falls ‣ System calls and function calls have overheads ‣ Out-of-order execution and compiler can reorder instructions ‣ Need barriers if measuring multi-thread or multi-process code 50

Slide 51

Slide 51 text

Memory optimization

Slide 52

Slide 52 text

How to tame the deep memory hierarchy 52 Maximize the spatial and temporal locality of your memory accesses to improve cache eﬃciency. …but how?

Slide 53

Slide 53 text

Locality ہॴੑ 53 Cache assumes that your software exhibits spatial and temporal locality. Spatial locality ۭؒతہॴੑ: Items whose addresses are near one another tend to be referenced together in time. Temporal locality ࣌ؒతہॴੑ: Recently accessed items are likely to be accessed in the near future

Slide 54

Slide 54 text

Hands-on: Matrix-vector multiplication 54 double A[N][N]; double x[N], y[N]; // Implement here Given an matrix and an vector , write a fast C code that multiplies and and stores the result to an vector . N × N A N × 1 x A x N × 1 y 1 2 3 4 5 6 7 8 9 ⋅ [ 1 1 2 ] = 1 ⋅ 1 + 2 ⋅ 1 + 3 ⋅ 2 4 ⋅ 1 + 5 ⋅ 1 + 6 ⋅ 2 7 ⋅ 1 + 8 ⋅ 1 + 9 ⋅ 2 = [ 9 21 33 ] y A x Example:

Slide 55

Slide 55 text

Setting up a C++ compiler Linux A. Install gcc or clang using your package manager. macOS A. Install the Xcode command line tools and use its bundled clang. B. Install gcc via Homebrew. Windows A. Install the Windows Subsystem for Linux (WSL). B. Install MinGW or Visual Studio. Online compiler (if none of the above works) A. Wandbox: https://wandbox.org/ 55

Slide 56

Slide 56 text

Compiling and running the example Download the example ‣ Gist URL: https://git.io/JI2DF Compile Run 56 g++ -O3 -fno-tree-vectorize -Wall -std=c++11 -o matvec main.cpp ./matvec Now implement matrix-vector multiplication in run() This is “O” (alphabet) not zero Don’t need this flag if you’re using clang

Slide 57

Slide 57 text

Which one is faster? 57 for (int i = 0; i < N; i++) { y[i] = 0.0; for (int j = 0; j < N; j++) { y[i] += A[i][j] * x[j]; } } for (int i = 0; i < N; i++) { y[i] = 0.0; } for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { y[i] += A[i][j] * x[j]; } } A. B.

Slide 58

Slide 58 text

Comparison of runtime (N=10,000) 58 Runtime [ms] 0 200 400 600 800 Core i9-10900K Core i7-8559U ver. A ver. B 3.9x 6.6x g++ 9.3.0 g++ 10.2.0

Slide 59

Slide 59 text

Memory access pattern 59 = ⋅ y A x = ⋅ y A x for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { y[i] += A[i][j] * x[j]; } } for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { y[i] += A[i][j] * x[j]; } } … … A. Row-major memory access B. Column-major memory access

Slide 60

Slide 60 text

Memory layout: row-major order 60 1 2 3 4 5 6 7 8 9 10 11 12 Address Value 0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 Elements within the same row are contiguous on memory (C, C++, etc.)

Slide 61

Slide 61 text

Memory layout: column-major order 61 1 2 3 4 5 6 7 8 9 10 11 12 Address Value 0x0000 1 0x0008 4 0x0010 7 0x0018 10 0x0020 2 0x0028 5 0x0030 8 0x0038 11 0x0040 3 0x0048 6 0x0050 9 0x0058 12 Elements within the same column are contiguous on memory (Fortran, Julia, numpy, etc.)

Slide 62

Slide 62 text

A exhibits better spatial locality than B 62 Address Value 0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 A. B. 1 2 3 4 Address Value 0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 1 2 3 4 … … Contiguous memory access (spatial locality is high) Strided memory access (spatial locality is low) … … …

Slide 63

Slide 63 text

Why contiguous memory access is faster Cache line is the smallest unit of data movement between cache levels ‣ A cache line is 64B (i.e. 8 doubles or 16 floats) on recent Intel/AMD CPUs. ‣ Even when reading a single byte, the CPU reads the whole cache line that includes that byte from memory and caches to L1-L3 cache. ‣ Use the whole data within in a cache line if possible. 63 DRAM L2 Cache L1 Cache Addr Val 0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 0x0060 13 0x0068 14 0x0070 15 0x0078 16 Cache line Cache line Cache line Cache line

Slide 64

Slide 64 text

Why contiguous memory access is faster (cont’d) CPU prefetches cache lines ‣ CPU predicts future memory accesses and moves cache lines in advance. ‣ How “smart” the prefetcher is depends on the CPU, but it generally fails to predict complex/irregular access patterns. ‣ Prefetching can be explicitly triggered from software (software prefetching), but it the programmer needs to calculate the latency of instructions and memory accesses. 64

Slide 65

Slide 65 text

Measuring cache hit ratio using perf 65 = ⋅ y A x … … = ⋅ y A x … … A. Row-major memory access B. Column-major memory access 1 out of 8 loads from A misses the cache w/o prefetching 8 out of 8 loads from A misses the cache w/o prefetching

Slide 66

Slide 66 text

Measuring cache hit ratio using perf 66 A. Row-major memory access B. Column-major memory access

Slide 67

Slide 67 text

Hands-on: Matrix-matrix multiplication 67 double A[N][N], B[N][N], C[N][N]; // Implement here Given two matrices and , write a C code that multiplies and and stores the result to an matrix . N × N A B A B N × N C [ 1 2 3 4] ⋅ [ 5 6 7 8] = [ (1 ⋅ 5 + 2 ⋅ 7) (1 ⋅ 6 + 2 ⋅ 8) (3 ⋅ 5 + 4 ⋅ 7) (3 ⋅ 6 + 4 ⋅ 8)] = [ 19 22 43 50] C A B Example:

Slide 68

Slide 68 text

Basic algorithm and naming of loop indices 68 for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) { C[i][j] += A[i][k] * B[k][j]; } } } A B C ⋅ = i j i k k j

Slide 69

Slide 69 text

Analysis of inner-most loop for version ijk (jik) 69 for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { double sum = 0.0; for (int k = 0; k < N; k++) { sum += A[i][k] * B[k][j]; } C[i][j] = sum; } } Read a row from A and a column from B. Calculate their inner product and write to C. 2 loads (A and B) 0 store A misses: 0.125 (=1/8) B misses: 1 (=8/8) C misses: 0 Total misses: 1.125 per inner-most loop iteration A B C ⋅ =

Slide 70

Slide 70 text

Analysis of inner-most loop for version jki (kji) 70 for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { double r = B[k][j]; for (int i = 0; i < N; i++) { C[i][j] += A[i][k] * r; } } } Read an element from B and multiply it with a column in A. Accumulate the result to a column in C. 2 loads (A and C) 1 store (C) A misses: 1 (=8/8) B misses: 0 C misses: 1 (=8/8) Total misses: 2 per inner- most loop iteration A B C ⋅ =

Slide 71

Slide 71 text

Analysis of inner-most loop for version kij (ikj) 71 for (int k = 0; k < N; k++) { for (int i = 0; i < N; i++) { double r = A[i][k]; for (int j = 0; j < N; j++) { C[i][j] += r * B[k][j]; } } } Read an element from A and multiply it with a row in B. Accumulate the result to a row in C. 2 loads (B and C) 1 store (C) A misses: 0 B misses: 0.125 (=1/8) C misses: 0.125 (=1/8) Total misses: 0.25 per inner-most loop iteration A B C ⋅ =

Slide 72

Slide 72 text

Benchmark result 72 Reference: Randal Bryant and David O’Hallaron, “Computer Systems: A Programmer's Perspective”, Pearson Simply interchanging loops makes 40x diﬀerence!

Slide 73

Slide 73 text

Gravitational N-body problem ॏྗଟମ໰୊ 73 d2xi dt2 = N ∑ j=1 Gmj xj − xi |xj − xi |3 for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { a[i] += calc(x[i], x[j]); } } x2 x1 x3 x4 x5 Gm2 x2 − x1 |x2 − x1 |3

Slide 74

Slide 74 text

Analysis of spatial and temporal locality We look into the memory accesses in the inner most loop: ‣ x[i] and x[j] are read contiguously: high spatial locality ‣ a[i] is read in every inner-most loop iteration: high temporal locality ‣ x[j] is read every N inner-most loop iteration: low temporal locality Cache capacity is limited (L1 < 100KB, L2 < 1MB) ‣ The cache “forgets” (evicts) if an address is referenced rarely (LRU policy) ‣ How can we increase the temporal locality of x[j]? 74 for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { a[i] += calc(x[i], x[j]); } }

Slide 75

Slide 75 text

Increasing temporal locality using cache blocking 75 for (int j = 0; j < N; j += BLOCK) for (int i = 0; i < N; i++) for (int jj = 0; jj < BLOCK; jj++) a[i] += calc(x[i], x[j]); https://software.intel.com/content/www/us/en/develop/articles/cache-blocking-techniques.html i j i j+jj x[j] is accessed every N/BLOCK inner most loop iteration (better temporal locality!) x[j] is accessed every N inner most loop iteration N=8 N=8, BLOCK=4 for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) a[i] += calc(x[i], x[j]); Original After applying cache blocking

Slide 76

Slide 76 text

The memory mountain 76 Reference: Randal Bryant and David O’Hallaron, “Computer Systems: A Programmer's Perspective”, Pearson

Slide 77

Slide 77 text

Conclusion Memory optimization matters. ‣ Choosing the right memory layout and access pattern can improve performance by many orders of magnitudes. ‣ Memory optimization is relevant than more than ever because of the deep memory hierarchy in modern computer architectures. Time complexity analysis ignores the constant factor. ‣ An implementation can be faster than an implementation. ‣ Two implementations with the same time complexity can have significant performance diﬀerence (e.g. Iterating an array vs iterating a linked list are both . Which one is faster?) O(n2) O(n) O(n) 77

Slide 78

Slide 78 text

Parallelization

Slide 79

Slide 79 text

Recommended reading 79 Victor Eijkhout, "Introduction to High Performance Scientific Computing", Lulu.com, 2012. Victor Eijkhout, "Parallel Programming in MPI and OpenMP", 2017. Both books are available online for free!

Slide 80

Slide 80 text

Parallel programming meme 80

Slide 81

Slide 81 text

Parallel programming meme 81

Slide 82

Slide 82 text

Basic strategy for parallelization Step 1. Find the bottleneck of your program ‣ Use the profiling tools we discussed earlier in this class. Step 2. Find the right decomposition of your work ‣ Decomposed tasks need to be independent with one another (except for special cases, i.e. reduction). ‣ If multiple decompositions are possible, choose the most coarse-grained decomposition to reduce the parallelization overhead. Step 3. Implement parallelism ‣ Use the programming model (OpenMP, MPI, CUDA, etc.) that suites your hardware. 82

Slide 83

Slide 83 text

Predicting the speedup of a parallel program 83 Tp = p n Ts + (1 − p)Ts Suppose a program takes to run serially and a fraction ( ) of the program can be parallelized. can be parallelized and is not. The runtime of the same program running in parallel on n processors is: Ts p pTs (1 − p)Ts pTs (1 − p)Ts p n Ts (1 − p)Ts Serial Parallel 1 n

Slide 84

Slide 84 text

Amdahl’s Law ΞϜμʔϧͷ๏ଇ 84 Ts Tp = Ts p n Ts + (1 − p)Ts = 1 p n + (1 − p) lim n→∞ Ts Tp = lim n→∞ 1 p n + (1 − p) = 1 1 − p No matter how many processors you have, this is the attainable maximum speedup! Let’s find the limit of the speedup if n approaches infinity: The parallel speedup is the ratio between and : Ts Tp

Slide 85

Slide 85 text

Example 85

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Shared memory parallelism ڞ༗ϝϞϦฒྻܭࢉ Processors share the same memory ‣ e.g. Cores within a package or computer Pros ‣ Easier programming thanks to the unified address space. ‣ Sharing data between processors is fast. Cons ‣ Scalability is limited due to access contention and false sharing. ‣ Prone to nasty bugs such as data races. 87 Memory Processor Processor Processor

Slide 88

Slide 88 text

Distributed memory parallelism ෼ࢄϝϞϦฒྻܭࢉ Each processor has its own memory ‣ e.g. Cluster of computers interconnected via a network Pros ‣ No access contention since each processors has its dedicated memory. ‣ Scalable to millions of processors. Cons ‣ Need to explicitly describe communication between processors. ‣ Exchanging data between processors is slower than shared memory systems. 88 Processor Memory Processor Memory Processor Memory Network

Slide 89

Slide 89 text

Hybrid approches Shared memory programming on distributed memory system (i.e. distributed shared memory) ‣ Middleware hides communication between processors. ‣ e.g. PGAS (Partitioned Global Address Space) programming models Distributed memory programming on shared memory system ‣ Partition the shared memory into distributed memories. ‣ e.g. MPI on a single-node Hybrid parallelism ϋΠϒϦουฒྻ ‣ Combine shared and distributed memory programming. ‣ e.g. MPI+OpenMP: OpenMP to parallelize within a single node and MPI to parallelize across multiple nodes 89

Slide 90

Slide 90 text

Shared memory parallel programming POSIX threads (a.k.a. pthreads) ‣ Library based (only need to link a library) ‣ Low-level and general-purpose API for creating multi-threaded programs ‣ Part of POSIX -- works on Linux, macOS, FreeBSD, etc. OpenMP ‣ Compiler directive based (compiler needs to support OpenMP) ‣ Higher level abstraction than pthreads -- easier to program than pthreads. ‣ De-facto standard in scientific computing (works on most supercomputers) 90

Slide 91

Slide 91 text

pthreads 91 #include #include void *thread_func(void *arg) { printf("I'm a forked thread!\n"); return NULL; } int main(void) { pthread_t thread; pthread_create(&thread, NULL, thread_func, NULL); printf("I'm the main thread!\n"); pthread_join(thread, NULL); return 0; } pthread_create pthread_join printf printf Main thread Forked thread

Slide 92

Slide 92 text

OpenMP A compiler directive-based shared memory programming model ‣ Compiler directives (#pragma omp ...) are used to tell the compiler to perform parallelization. ‣ Same source code works as serial version when treating the directives as comments (in most cases). ‣ Can incrementally increase the parallel portions during development. Recent versions introduced: ‣ Oﬄoading to co-processors (GPUs) ‣ Exploiting SIMD parallelism 92 double a[N], b[N], c[N]; #pragma omp parallel for for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; }

Slide 93

Slide 93 text

OpenMP: Parallel region Most basic OpenMP construct for parallelization ‣ Precede a block with #pragma omp parallel to a create a parallel region. ‣ A parallel region runs in parallel on multiple threads (by default, same as the # of logical cores). ‣ In practice, OpenMP runtimes use thread pools to avoid launching new threads every time 93 func_a(); #pragma omp parallel { func_b(); } func_c(); func_b() func_b() func_b() fork join func_a() func_c()

Slide 94

Slide 94 text

OpenMP: Work sharing 94 #pragma omp parallel { for (int i = 0; i < 10; i++) { func_a(i); } } #pragma omp parallel { #pragma omp for for (int i = 0; i < 10; i++) { func_a(i); } } i=0 i=0 i=0 i=0 i=1 i=1 i=1 i=1 i=2 i=2 i=2 i=2 thread 0 thread 1 thread 2 thread 3 i=3 i=3 i=3 i=3 i=0 i=3 i=6 i=8 i=1 i=4 i=7 i=9 i=2 i=5 thread 0 thread 1 thread 2 thread 3 …

Slide 95

Slide 95 text

OpenMP: Work sharing OpenMP construct for sharing work between threads ‣ The #pragma omp parallel construct runs the same block in parallel (which is not useful by itself). ‣ Need to divide the work and assign different work on different threads (work sharing) to achieve speedup. ‣ #pragma omp for divides the loop and assigns the sub-loops to different threads. 95 #pragma omp parallel for for (int i = 0; i < N; i++) { func_a(); } Shorthand for: #pragma omp parallel { #pragma omp for for (...) { } }

Slide 96

Slide 96 text

Numerical integration: first attempt Let's calculate pi using numerical integration ਺஋ੵ෼ ‣ We parallelize the integration loop using the #pragma omp parallel work sharing construct. ‣ However, this implementation gives incorrect result. Why? 96 double sum = 0.0; double dx = 1.0 / N; #pragma omp parallel for for (int i = 0; i < N; i++) { double x = i * dx; sum += sqrt(1 - x * x) * dx; } y x 1 0 π 4 = ∫ 1 0 1 − x2 dx ≈ N−1 ∑ i=0 1 − (iΔx)2 Δx

Slide 97

Slide 97 text

Numerical integration: second attempt Attempt #1 was incorrect because of a data race σʔλڝ߹ ‣ Multiple threads are read/writing the same variable concurrently. ‣ We can use the #pragma omp critical directive to mark a block as a critical section and restrict its execution by one thread at a time. ‣ However, this version is slow because of the serialized execution. 97 double sum = 0.0; double dx = 1.0 / N; #pragma omp parallel for for (int i = 0; i < N; i++) { double x = i * dx; #pragma omp critical sum += sqrt(1 - x * x) * dx; } i=0 i=3 i=6 i=8 i=1 i=4 i=7 i=9 i=2 i=5 Thread 0 Thread 1 Thread 2 Thread 3 critical section

Slide 98

Slide 98 text

Recap: Data race σʔλڝ߹ 98 Read i i=0 Write i+1 i=1 Read i Write i+1 i=2 Read i i=0 Write i+1 i=1 Read i Write i+1 What you expect What (may) happen i = 0; #pragma omp parallel { i += 1; } // What is i now? Suppose you run the following code on two processors: mem t1 t0 mem t1 t0

Slide 99

Slide 99 text

Numerical integration: third attempt OpenMP provides special help for this pattern called reduction clause ‣ Add reduction(:) to a #pragma omp parallel for directive ‣ Variable will be stored in thread-private storage and reduced using the operator after the parallelized loop finishes. ‣ This version is both correct and fast. 99 double sum = 0.0; #pragma omp parallel for reduction(+:sum) for (int i = 0; i < N; i++) { double x = i * dx; sum += dx * sqrt(1 - x * x); }

Slide 100

Slide 100 text

Hands-on: Parallelizing MatVec using OpenMP Start from the code you wrote in the previous hands-on ‣ Template: https://git.io/JI2DF Compile Run 100 g++ -O3 -fopenmp -march=native -Wall -std=c++11 -o matvec matvec.cpp ./matvec This flag enables OpenMP Add OpenMP directives and see the performance!

Slide 101

Slide 101 text

Setting up gcc Linux 1. Install gcc using your package manager (e.g. apt install g++). macOS 1. Install the Xcode command line tools (xcode-select --install). 2. Install Homebrew (https://brew.sh/). 3. Install gcc via Homebrew (brew install gcc). 4. Use g++-10 instead of g++ (which is actually clang!). Windows 1. Install the Windows Subsystem for Linux (WSL). 2. Install your favorite Linux distribution via Microsoft Store. 3. Install gcc using your package manager. 101

Slide 102

Slide 102 text

Matrix-Vector multiplication (parallel outer loop) 102 = ⋅ y A x void run() { #pragma omp parallel for for (int i = 0; i < N; i++) { double sum = 0.0; for (int j = 0; j < N; j++) { sum += A[i][j] * x[j]; } y[i] = sum; } } Thread 0 Thread 1 Thread 2 Thread 3 All Threads

Slide 103

Slide 103 text

Matrix-Vector multiplication (parallel inner loop) 103 = ⋅ y A void run() { for (int i = 0; i < N; i++) { double sum = 0.0; #pragma omp parallel for \ reduction(+:sum) for (int j = 0; j < N; j++) { sum += A[i][j] * x[j]; } y[i] = sum; } } x Thread 0 Thread 1 Thread 2 Thread 3

Slide 104

Slide 104 text

Comparison of runtime (N=10,000) 104 Runtime [ms] 0 75 150 225 300 Core i9-10900K Core i7-8559U Serial Parallel outer loop Parallel inner loop Linux 5.4.0 g++ 9.3.0 macOS 11.0.1 g++ 10.2.0 4.5x 3.1x

Slide 105

Slide 105 text

Hands-on + Assignment 1. Implement a parallel matrix-matrix multiplication using OpenMP. ‣ You can start from my template: https://git.io/JLZYZ ‣ N >= 1000 is recommended 2. Measure and report the attained FLOP/s and eﬃciency of your code. 3. Discuss how you could improve the performance of your code further. 105 Attained FLOP/s = # of FLOPs runtime [s] Eﬃciency [ % ] = Attained FLOP/s Peak FLOP/s ⋅ 100 Calculate by hand or use a profiler

Slide 106

Slide 106 text

Hands-on + Assignment How to submit: ‣ Open NRSS https://nrss.naist.jp/ and search for “Research Trends in Software Design and Analysis”. ‣ Title: Student Number-Your Name (e.g. 2012345-Keichi Takahashi) ‣ Body: Answer assignments 2 and 3. ‣ Attachment: Source code for assignment 1 (you may push your code to GitHub or Gist and provide a link instead). Deadline: ‣ December 31, 2020 23:59 (JST). 106

Slide 107

Slide 107 text

How to find the peak FLOP/s of your system 107 Peak FLOP/s = # of FLOPs cycle ⋅ # of cycles second ⋅ # of cores system Example: Intel Core i9 10900K 592 GFLOP/s = 16 FLOPs ⋅ 3.70 GHz ⋅ 10 First of all, check what CPU you are running. Then, use the following equation to calculate the peak FLOP/s value: Refer to this table: https://en.wikichip.org/wiki/flops Use the base clock (not turbo clock)

Slide 108

Slide 108 text

Roofline model

Slide 109

Slide 109 text

The roofline model answers these questions for you When am I done optimizing? ‣ My code took 100s to run before optimizing it. Now after extensively optimization my code runs in 10s. How far am I from the optimum? Why don’t I get the peak performance? ‣ My GPU should provide 14 TFLOPS but the profiler tells me that my code achieves only 1 TFLOPS. Why? 109

Slide 110

Slide 110 text

Example: vector addition 110 Load a[i] and b[i] Multiply a[i-1] and b[i-1] Store c[i-1] Multiply a[i] and b[i] Multiply a[i+1] and b[i+ Load a[i] and b[i] Store c[i] Load a[ and b[i Store c[i-2] double a[N], b[N], c[N]; for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Communication Computation Time Suppose the following code if computation and communication overlap, one of them will be the bottleneck

Slide 111

Slide 111 text

Roofline model The roofline model assumes that the performance is limited by either ‣ Computation (FLOP/s) or ‣ Communication (moving data to/from DRAM) 111 S. Williams et al., ”Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76. Runtime = max # of FLOPs Peak FLOP/s # of Bytes Peak Bytes/s Computation Communication

Slide 112

Slide 112 text

Roofline model 112 = max Peak FLOP/s # of FLOPs # of Bytes # of FLOPs Runtime FLOP/s= Peak Bytes/s ⋅ Arithmetic Intensity (AI) ԋࢉڧ౓: How many FLOPs a program can perform for each byte moved to/from memory. Runtime = max # of FLOPs Peak FLOP/s # of Bytes Peak Bytes/s

Slide 113

Slide 113 text

How to calculate the arithmetic intensity 113 double a[N], b[N], c[N]; for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } 2 loads 1 store 1 floating point operation AI = 1 / ((2+1) * 8) = 0.042 double a[N], b[N], sum; for (int i = 0; i < N; i++) { sum += a[i] * a[i] + b[i] * b[i]; } 2 loads 0 store 3 floating point operation AI = 3 / (2 * 8) = 0.1875 In practice, use hardware performance counters and measure on actual hardware. You can use Linux perf, Intel VTune, LIKWID, etc.

Slide 114

Slide 114 text

Arithmetic intensity of real-world kernels 114 https://crd.lbl.gov/departments/computer-science/par/research/roofline/introduction/

Slide 115

Slide 115 text

Visual representation of the roofline model 115 Arithmetic Intensity FLOP/s Peak FLOP/s Peak Bytes/s Computational ceiling Bandwidth ceiling

Slide 116

Slide 116 text

Predicting performance using roofline analysis 116 Arithmetic Intensity FLOP/s Peak FLOP/s Peak Bytes/s Given the arithmetic intensity of a computational kernel and the compute and memory ceilings, we can calculate the attainable FLOP/s.

Slide 117

Slide 117 text

Memory-bound vs compute-bound 117 Arithmetic Intensity FLOP/s Peak FLOP/s Peak Bytes/s Memory-bound Compute-bound Ridge point 1−10 FLOPS/Byte on modern processors Any kernel to the left of the ridge point is memory-bound: it’s bound by memory. Any kernel to the right is compute-bound: it’s bound by FPUs.

Slide 118

Slide 118 text

Ridge point on modern systems 118 Fujitsu A64FX DP FLOPS=2.7 TFLOPS B/W=1TB/s Ridge AI=2.7 NVIDIA V100 DP FLOPS=7 TFLOPS B/W=900GB/s Ridge AI=7.8 AMD EPYC 7742 DP FLOPS=2.3 TFLOPS B/W=330GB/s Ridge AI=7.7

Slide 119

Slide 119 text

Guiding optimization via roofline analysis (1/2) 119 Arithmetic Intensity FLOP/s Peak FLOP/s Peak Bytes/s These are highly optimized and cannot be optimized further These are also highly optimized. You may be able to increase AI via algorithmic optimization or improve data reuse.

Slide 120

Slide 120 text

Guiding optimization via roofline analysis (2/2) 120 Arithmetic Intensity FLOP/s Peak FLOP/s Peak Bytes/s These are optimization opportunities! Try optimization techniques in your hand.