Introduction to software optimization

Research Trends in Software Design and Analysis Dr. Keichi Takahashi
Software Design and Analysis Laboratory Nara Institute of Science and Technology

What you get from this lecture 2 This lecture will
teach you what it takes to write FAST code on modern computing systems.

Logistics Overall ‣ Four lectures in total: 4th hour (15:10–16:40)
on December 8, 10, 15, 17 ‣ Hands-on exercises included, so bring your laptops! Grading ‣ 100% mini-project report ‣ Mini-project will be assigned in the last lecture Ask questions! ‣ Feel free to interrupt during class and ask questions ‣ Or ask via email: [email protected] 3

Recommended readings 4 Randal Bryant and David O’Hallaron, “Computer Systems:
A Programmer's Perspective”, Pearson John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach”, Morgan Kaufmann.

(My) Golden rules for optimization Understand your hardware and software
‣ Know what your hardware is capable of. ‣ Profile your software and understand its bottlenecks. Minimize data movement across the memory hierarchy ‣ Modern computers have very deep memory hierarchies. ‣ Design memory layout and access pattern to maximize data reuse. Utilize parallelism from all levels of the computer ‣ Modern computers have many levels of parallelism. ‣ Exploit all levels of parallelism in your computer(s). 5

Know your hardware and software

Know your hardware

FLOP/s (Floating-point Operations per Second) A measure of computing performance
‣ especially appreciated in scientific computing. ‣ traditionally used for measuring the performance of supercomputers. When measuring/reporting FLOP/s ‣ Beware of the precision (double, single, half, etc.)! ‣ Take extra care for transcendental functions (sqrt, sin, exp, etc.) and divisions. 8 double x[N], y[N]; for (int i = 0; i < N; i++) { y[i] += a * x[i]; } How many # of floating-point operations does this code perform?

Examples 9 AMD EPYC 7742 DP: 2.3 TFLOPS SP: 4.6
TFLOPS NVIDIA V100 DP: 8.2 TFLOPS SP: 16.4 TFLOPS Intel Xeon Platinum 8284 DP: 1.5 TFLOPS SP: 3.0 TFLOPS Intel i7-8559U DP: 170 GFLOPS SP: 340 GFLOPS Intel i9-9900K DP: 589 GFLOPS SP: 1.2 TFLOPS

Moore’s law ϜʔΞͷ๏ଇ Moore’s law predicts that: ‣ The number
of transistors in a chip doubles every two years. ‣ Notice that he did not predict that the performance doubles. Moore’s law is ending soon ‣ Most researchers predict the Moore’s law will end in the next 10 years. ‣ Exciting exotic devices and novel architectures are investigated for Post- Moore computing (but will not be covered in this lecture). 10 Gordon Moore

Dennard scaling σφʔυεέʔϦϯά Dennard observed the following scaling at every
technology generation: 11 Transistor dimension: 1 k Power consumption: 1 k2 where k ≈ 1 2 ≈ 0.7 Frequency: k R. H. Dennard et al., "Design of ion-implanted MOSFET's with very small physical dimensions," in IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256-268, Oct. 1974. Voltage: 1 k Capacity: 1 k Power density: 1

End of Dennard scaling Dennard ignored the leakage current and
threshold voltage ‣ Leakage current ࿙Εిྲྀ: Transistors leak a small current when they are turned oﬀ. This current increases as the insulation thickness decreases. ‣ Threshold voltage ᮢ஋ిѹ: Voltage where the transistor turns on. The power density started increasing ‣ Can’t increase the frequency anymore! ‣ How are we going to use the extra transistors we can put in our chip? 12

Singe-thread performance has plateaued ఀ଺ 13

The free lunch is over1 Until ~2005, Dennard Scaling was
doing well ‣ Frequency and single-thread performance were steadily increasing. ‣ Just wait for new hardware to make your program run faster (free lunch). ~2005 (Pentium 4), Dennard Scaling failed ‣ Frequency and single-thread performance have plateaued. ‣ Multi-thread performance is still increasing. ‣ Software needs to be written to exploit the parallelism (no free lunch). 14 1 http://www.gotw.ca/publications/concurrency-ddj.htm

Two oxen vs 1,024 chickens 15 If you were plowing
a field, which would you rather use: two strong oxen or 1,024 chickens? Seymour Cray

Two oxen vs 1,024 chickens 16 We have no choice
but to work with 1,024 chickens!

Need to exploit multiple levels of parallelism Instruction-level parallelism ‣
A processor executes multiple independent instructions in parallel Data-level parallelism ‣ A processor executes the same instruction on multiple data (vector) Thread-level parallelism ‣ A multiprocessor comprises multiple processors that execute multiple programs in parallel Cluster-level parallelism ‣ A cluster comprises multiple computers that execute multiple programs in parallel 17

Example: AMD EPYC Rome 7742 18 Each core has 2
FMA (Fused Multiply-Add) pipelines 64 cores 1 CPU package Each pipeline can perform 8 singe-precision FLOPs in a cycle EPYC 7742 can perform 64*2*8*2=2,048 FLOPs in each cycle! In other words, you are only utilizing 0.05% of the total performance if you don’t use any of the parallelism available. See: https://en.wikichip.org/wiki/amd/epyc/7742 https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

Memory performance is lagging behind Memory performance could not keep
up with CPU performance ‣ In the old days, memory ran faster than the CPU! ‣ Nowadays, CPU needs to wait (stall) for data to arrive from memory. 19 John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach”, Morgan Kaufmann. N.B. this is single- core performance

Memory hierarchy هԱ֊૚ Users want infinite amount of infinitely fast
memory ‣ Fast memory (e.g. SRAM) is expensive ‣ Large memory (e.g. DRAM) is slow Form a hierarchy of memories ‣ Higher levels are faster but small ‣ Lower levels are slower but large ‣ Higher levels keep frequently accessed subset of the lower levels (caching) 20 Register L1 L2 L3 DRAM Faster Larger Tradeoﬀ!

Typical memory hierarchy in modern CPUs 21 DRAM L3 Cache
L2 Cache L1 Cache Die DRAM DRAM DRAM Core L2 Cache L1 Cache Core L3 Cache L2 Cache L1 Cache Die Core L2 Cache L1 Cache Core

Latency numbers every programmer should know 22 What Latency Human
scale L1 cache reference 0.5ns 0.5s Branch mispredict 5ns 5s L2 cache reference 7ns 7s Mutex lock/unlock 25ns 25s Main memory reference 100ns 1m 40s Compress 1K bytes with Zippy 3,000ns 50m Send 2 KB over 1 Gbps network 20,000ns 5h 33m Read 1 MB sequentially from memory 250,000ns 2d 21h 26m Read 1 MB sequentially from SSD 1,000,000ns 11d 13h 46m The Jeff Dean Facts http://www.informatika.bg/jeffdean શ੝ظͷJeff Dean఻આ https://qiita.com/umegaya/items/ef69461d6f4967d5c623

AMD EPYC 7742 memory latency* and bandwidth† Latency ‣ Clock
cycle is ~0.4ns (2.25 GHz). ‣ Missing one cache level increases the latency by a factor of 1.4-5x. ‣ If you miss all cache levels, core stalls for ~200 cycles! Bandwidth ‣ Becomes scarce at lower levels. ‣ Note that the bandwidth is shared among all cores — DRAM B/W is only 2.4 GB/s per core! 23 Die 5ns 4.0 TB/s 17ns 1.6 TB/s 83ns 153 GB/s *https://www.7-cpu.com/cpu/Zen2.html † Measured on actual hardware using Empirical Roofline Tool (ERT) DRAM L3 Cache L2 Cache L1 Cache Core Core 1.7-3.5ns 6.6 TB/s

Deep memory hierarchy Utilizing the cache is crucial for achieving
“good” performance ‣ If the core misses a cache level, it needs to wait for the data to arrive from lower levels. ‣ Your code needs to access the memory in such a way that maximizes cache hit (i.e. cache-friendliness) How to write cache-friendly code? ‣ Understand how the cache works (cache capacity, cache line size, false sharing, bank conflicts, etc.) ‣ Exploit spatial and temporal locality 24

Why software optimization matters Individual CPU cores are not becoming
faster anymore ‣ Just buying a new hardware does not make your software run faster ‣ You need to parallelize your software If you want to utilize your hardware Memory hierarchy is deep ‣ Memory performance is not keeping up with computer performance ‣ Cannot get viable performance without eﬀectively utilizing the cache 25 Your code does not run any faster on your expensive new hardware without optimization.

Why can’t we build a 10,000-core CPU? Well, we can.
Meet Cerebras CS-1. 26 Cerebras CS-1 NVIDIA V100 CS-1:V100 Die size 46,225 mm2 815 mm2 57x # of cores 40,000 5,120 78x On chip memory (SRAM) 18 GB 6 MB 3,000x Memory bandwidth 9 PB/s 900 GB/s 10,000x Power consumption 20 kW 300 W 67x https://www.cerebras.net/product/ Kamil Rocki et al., “Fast Stencil-Code Computation on a Wafer-Scale Processor”, SC20.

Why can’t we build a 10,000-core CPU? 27 … if
you can supply 20kW of power + cooling

Know the bottleneck of your software

Ask yourself if you really need to optimize 29 “The
first rule of optimization is: Don’t do it. The second rule of optimization (for experts only) is: Don’t do it yet.“ Michael A. Jackson

Ask yourself if you really need to optimize 30 “Premature
optimization is the root of all evil.” Tony Hoare (or Donald Knuth)

Performance comes with a cost Optimizing a program takes time:
‣ If you are a researcher, focus on producing results and papers! ‣ If you are working at a company, optimization costs money. Buying an expensive server is probably cheaper than hiring a CS PhD. Optimizing a program reduces readability: ‣ Most optimization techniques reduces the readability of the source code. ‣ Low readability implies even more maintenance cost! ‣ Language and library abstractions to achieve both performance and readability are being studied but not quite there yet… 31

Consider these alternatives 32 NVIDIA DGX Station A100 $200,000 (one-oﬀ
payment) Performance gain: predictable Computer Science Ph.D. $100,000 (yearly payment) Performance gain: unknown

Avoid reinventing the wheel There are plenty of high-quality, highly-optimized
and well- maintained libraries out there! ‣ Basic linear algebra: BLAS and LAPACK implementations such as MKL, AOCL, ScaLAPACK, cuBLAS ‣ Solvers: PETSc, Trilinos, SuperLU, cuSOLVER ‣ FFT: FFTW, cuFFT ‣ File I/O: HDF5, ADIOS2, NetCDF ‣ General purpose: Eigen, Thrust, ArrayFire 33 If none of the existing libraries fit your need, then consider writing your own code

Focus on the bottleneck (hot-spot) The Pareto principle (80/20 rule)
applies to optimization too ‣ Majority of the runtime is consumed by a very small fraction of the code ‣ Don’t waste your time by optimizing functions that don’t matter! ‣ If many functions equally contribute to the runtime, consider giving up 34 Function A B C D Rest % of Total Runtime 0 22.5 45 67.5 90 A B C D Rest % of Total Runtime 0 7.5 15 22.5 30 A B C D Rest % of Total Runtime 0 20 40 60 80

Always measure your code Don’t ever optimize a code without
measurement ‣ Measure before attempting any optimization to identify the bottleneck Measure, optimize, and then measure again ‣ Measure after optimization since optimization techniques can often make code slower 35 Optimize Measure Measure Find the bottleneck Evaluate the eﬀect of the optimization

Profilers Open-source: ‣ prof: Linux’s builtin performance analysis tool. Recommended!
‣ gprof: Pretty old and outdated GCC’s profiler. Not recommended. Vendor-specific: ‣ Intel VTune and Advisor: Highly recommended on Intel CPUs. ‣ AMD uProf: Compared to VTune, not so great. ‣ NVIDIA Nsight and Nvprof: No other options if working on NVIDIA GPUs Language-specific: ‣ Python: cProfile and line_profile ‣ Ruby: stackprof and ruby-prof 36

Intel VTune (Hot-spot Analysis) 37

Intel VTune (Microarchitecture Exploration) 40

Linux perf Linux’s builtin performance analysis tool since kernel 2.6
‣ Can profile the kernel space: useful when analyzing applications that spend significant time in the kernel space (e.g. I/O-intensive applications). ‣ Gives access to hardware performance counters: reveals hardware events such as cache misses, context switches, branch misses, etc. ‣ System-wide profiling: can profile all processes on the system. Useful when analyzing multiple processes interacting with one another. ‣ Kernel config (CONFIG_PERF_EVENTS) is enabled in most Linux distributions. Just need to install the user space utility (linux-tools- common on Ubuntu). 41 Oﬃcial wiki: https://perf.wiki.kernel.org/index.php/Main_Page Brendan Gregg’s guide: http://www.brendangregg.com/perf.html

perf record & report 42 Sample the number of hardware
events (e.g. cache hits and misses) during execution of a program and report. perf record -e cycles ./stress-ng -c 10 -t 3 List of HW events to sample Program to measure perf report

perf stat 43 [keichi@keichi-ws stress-ng]$ perf stat -e cache-misses,cache-references ./stress-ng
-c 10 -t 3 stress-ng: info: [14643] dispatching hogs: 10 cpu stress-ng: info: [14643] successful run completed in 3.04s Performance counter stats for './stress-ng -c 10 -t 3': 10,805,671 cache-misses # 0.640 % of all cache refs 1,687,740,535 cache-references 3.047260141 seconds time elapsed 30.328126000 seconds user 0.007996000 seconds sys Count number of hardware events (e.g. cache hits and misses) during execution of a program. perf stat -e cache-misses,cache-references ./stress-ng -c 10 -t 3 List of HW events to count Program to measure

perf list 44 Obtain a list of hardware events that
are available on your platform.

perf top 45 stress-ng -c 10 stress-ng -S 10 System-wide
profiling tool (think of it as the top command on steroids)

Strace: System call tracer 46

Strace: System call tracer 47

Timers Insert timers before and after the code you want
to measure ‣ It’s simple and it always works — just like printf() debugging ‣ If you already know which part of the code is the bottleneck, very eﬀective ‣ Beware of the pitfalls! 48 // Start timer double start = read_timer(); // Code to measure for (…) { for (…) { … } } // Stop timer double end = read_timer(); // Calculate diff double runtime = end - start;

Use high-precision timers Use these: ‣ clock_gettime(CLOCK_MONOTONIC, …) ‣ clock_gettime(CLOCK_PROCESS_CPUTIME_ID,
…) ‣ std::chrono::steady_clock (C++11) Don’t use these: ‣ time() returns UNIX epoch seconds ‣ gettimeofday() deprecated in favor of clock_gettime() ‣ clock_gettime(CLOCK_REALTIME, …) can rewind if system time is changed (e.g. NTP) 49

Mind these pitfalls Pin your processes/threads to CPU cores ‣
When using clock_gettime(CLOCK_PROCESS_CPUTIME_ID, …) or RDTSC instructions Measure multiple times and perform warmup runs ‣ To see the variance of the runtime ‣ It’s common that the first few iterations are always slow (why?) … and even more pit falls ‣ System calls and function calls have overheads ‣ Out-of-order execution and compiler can reorder instructions ‣ Need barriers if measuring multi-thread or multi-process code 50

Memory optimization

How to tame the deep memory hierarchy 52 Maximize the
spatial and temporal locality of your memory accesses to improve cache eﬃciency. …but how?

Locality ہॴੑ 53 Cache assumes that your software exhibits spatial
and temporal locality. Spatial locality ۭؒతہॴੑ: Items whose addresses are near one another tend to be referenced together in time. Temporal locality ࣌ؒతہॴੑ: Recently accessed items are likely to be accessed in the near future

Hands-on: Matrix-vector multiplication 54 double A[N][N]; double x[N], y[N]; //
Implement here Given an matrix and an vector , write a fast C code that multiplies and and stores the result to an vector . N × N A N × 1 x A x N × 1 y 1 2 3 4 5 6 7 8 9 ⋅ [ 1 1 2 ] = 1 ⋅ 1 + 2 ⋅ 1 + 3 ⋅ 2 4 ⋅ 1 + 5 ⋅ 1 + 6 ⋅ 2 7 ⋅ 1 + 8 ⋅ 1 + 9 ⋅ 2 = [ 9 21 33 ] y A x Example:

Setting up a C++ compiler Linux A. Install gcc or
clang using your package manager. macOS A. Install the Xcode command line tools and use its bundled clang. B. Install gcc via Homebrew. Windows A. Install the Windows Subsystem for Linux (WSL). B. Install MinGW or Visual Studio. Online compiler (if none of the above works) A. Wandbox: https://wandbox.org/ 55

Compiling and running the example Download the example ‣ Gist
URL: https://git.io/JI2DF Compile Run 56 g++ -O3 -fno-tree-vectorize -Wall -std=c++11 -o matvec main.cpp ./matvec Now implement matrix-vector multiplication in run() This is “O” (alphabet) not zero Don’t need this flag if you’re using clang

Which one is faster? 57 for (int i = 0;
i < N; i++) { y[i] = 0.0; for (int j = 0; j < N; j++) { y[i] += A[i][j] * x[j]; } } for (int i = 0; i < N; i++) { y[i] = 0.0; } for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { y[i] += A[i][j] * x[j]; } } A. B.

Comparison of runtime (N=10,000) 58 Runtime [ms] 0 200 400
600 800 Core i9-10900K Core i7-8559U ver. A ver. B 3.9x 6.6x g++ 9.3.0 g++ 10.2.0

Memory access pattern 59 = ⋅ y A x =
⋅ y A x for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { y[i] += A[i][j] * x[j]; } } for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { y[i] += A[i][j] * x[j]; } } … … A. Row-major memory access B. Column-major memory access

Memory layout: row-major order 60 1 2 3 4 5
6 7 8 9 10 11 12 Address Value 0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 Elements within the same row are contiguous on memory (C, C++, etc.)

Memory layout: column-major order 61 1 2 3 4 5
6 7 8 9 10 11 12 Address Value 0x0000 1 0x0008 4 0x0010 7 0x0018 10 0x0020 2 0x0028 5 0x0030 8 0x0038 11 0x0040 3 0x0048 6 0x0050 9 0x0058 12 Elements within the same column are contiguous on memory (Fortran, Julia, numpy, etc.)

A exhibits better spatial locality than B 62 Address Value
0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 A. B. 1 2 3 4 Address Value 0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 1 2 3 4 … … Contiguous memory access (spatial locality is high) Strided memory access (spatial locality is low) … … …

Why contiguous memory access is faster Cache line is the
smallest unit of data movement between cache levels ‣ A cache line is 64B (i.e. 8 doubles or 16 floats) on recent Intel/AMD CPUs. ‣ Even when reading a single byte, the CPU reads the whole cache line that includes that byte from memory and caches to L1-L3 cache. ‣ Use the whole data within in a cache line if possible. 63 DRAM L2 Cache L1 Cache Addr Val 0x0000 1 0x0008 2 0x0010 3 0x0018 4 0x0020 5 0x0028 6 0x0030 7 0x0038 8 0x0040 9 0x0048 10 0x0050 11 0x0058 12 0x0060 13 0x0068 14 0x0070 15 0x0078 16 Cache line Cache line Cache line Cache line

Why contiguous memory access is faster (cont’d) CPU prefetches cache
lines ‣ CPU predicts future memory accesses and moves cache lines in advance. ‣ How “smart” the prefetcher is depends on the CPU, but it generally fails to predict complex/irregular access patterns. ‣ Prefetching can be explicitly triggered from software (software prefetching), but it the programmer needs to calculate the latency of instructions and memory accesses. 64

Measuring cache hit ratio using perf 65 = ⋅ y
A x … … = ⋅ y A x … … A. Row-major memory access B. Column-major memory access 1 out of 8 loads from A misses the cache w/o prefetching 8 out of 8 loads from A misses the cache w/o prefetching

Measuring cache hit ratio using perf 66 A. Row-major memory
access B. Column-major memory access

Hands-on: Matrix-matrix multiplication 67 double A[N][N], B[N][N], C[N][N]; // Implement
here Given two matrices and , write a C code that multiplies and and stores the result to an matrix . N × N A B A B N × N C [ 1 2 3 4] ⋅ [ 5 6 7 8] = [ (1 ⋅ 5 + 2 ⋅ 7) (1 ⋅ 6 + 2 ⋅ 8) (3 ⋅ 5 + 4 ⋅ 7) (3 ⋅ 6 + 4 ⋅ 8)] = [ 19 22 43 50] C A B Example:

Basic algorithm and naming of loop indices 68 for (int
i = 0; i < N; i++) { for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) { C[i][j] += A[i][k] * B[k][j]; } } } A B C ⋅ = i j i k k j

Analysis of inner-most loop for version ijk (jik) 69 for
(int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { double sum = 0.0; for (int k = 0; k < N; k++) { sum += A[i][k] * B[k][j]; } C[i][j] = sum; } } Read a row from A and a column from B. Calculate their inner product and write to C. 2 loads (A and B) 0 store A misses: 0.125 (=1/8) B misses: 1 (=8/8) C misses: 0 Total misses: 1.125 per inner-most loop iteration A B C ⋅ =

Analysis of inner-most loop for version jki (kji) 70 for
(int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { double r = B[k][j]; for (int i = 0; i < N; i++) { C[i][j] += A[i][k] * r; } } } Read an element from B and multiply it with a column in A. Accumulate the result to a column in C. 2 loads (A and C) 1 store (C) A misses: 1 (=8/8) B misses: 0 C misses: 1 (=8/8) Total misses: 2 per inner- most loop iteration A B C ⋅ =

Analysis of inner-most loop for version kij (ikj) 71 for
(int k = 0; k < N; k++) { for (int i = 0; i < N; i++) { double r = A[i][k]; for (int j = 0; j < N; j++) { C[i][j] += r * B[k][j]; } } } Read an element from A and multiply it with a row in B. Accumulate the result to a row in C. 2 loads (B and C) 1 store (C) A misses: 0 B misses: 0.125 (=1/8) C misses: 0.125 (=1/8) Total misses: 0.25 per inner-most loop iteration A B C ⋅ =

Benchmark result 72 Reference: Randal Bryant and David O’Hallaron, “Computer
Systems: A Programmer's Perspective”, Pearson Simply interchanging loops makes 40x diﬀerence!

Gravitational N-body problem ॏྗଟମ໰୊ 73 d2xi dt2 = N ∑
j=1 Gmj xj − xi |xj − xi |3 for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { a[i] += calc(x[i], x[j]); } } x2 x1 x3 x4 x5 Gm2 x2 − x1 |x2 − x1 |3

Analysis of spatial and temporal locality We look into the
memory accesses in the inner most loop: ‣ x[i] and x[j] are read contiguously: high spatial locality ‣ a[i] is read in every inner-most loop iteration: high temporal locality ‣ x[j] is read every N inner-most loop iteration: low temporal locality Cache capacity is limited (L1 < 100KB, L2 < 1MB) ‣ The cache “forgets” (evicts) if an address is referenced rarely (LRU policy) ‣ How can we increase the temporal locality of x[j]? 74 for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { a[i] += calc(x[i], x[j]); } }

Increasing temporal locality using cache blocking 75 for (int j
= 0; j < N; j += BLOCK) for (int i = 0; i < N; i++) for (int jj = 0; jj < BLOCK; jj++) a[i] += calc(x[i], x[j]); https://software.intel.com/content/www/us/en/develop/articles/cache-blocking-techniques.html i j i j+jj x[j] is accessed every N/BLOCK inner most loop iteration (better temporal locality!) x[j] is accessed every N inner most loop iteration N=8 N=8, BLOCK=4 for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) a[i] += calc(x[i], x[j]); Original After applying cache blocking

The memory mountain 76 Reference: Randal Bryant and David O’Hallaron,
“Computer Systems: A Programmer's Perspective”, Pearson

Conclusion Memory optimization matters. ‣ Choosing the right memory layout
and access pattern can improve performance by many orders of magnitudes. ‣ Memory optimization is relevant than more than ever because of the deep memory hierarchy in modern computer architectures. Time complexity analysis ignores the constant factor. ‣ An implementation can be faster than an implementation. ‣ Two implementations with the same time complexity can have significant performance diﬀerence (e.g. Iterating an array vs iterating a linked list are both . Which one is faster?) O(n2) O(n) O(n) 77

Parallelization

Recommended reading 79 Victor Eijkhout, "Introduction to High Performance Scientific
Computing", Lulu.com, 2012. Victor Eijkhout, "Parallel Programming in MPI and OpenMP", 2017. Both books are available online for free!

Parallel programming meme 80

Parallel programming meme 81

Basic strategy for parallelization Step 1. Find the bottleneck of
your program ‣ Use the profiling tools we discussed earlier in this class. Step 2. Find the right decomposition of your work ‣ Decomposed tasks need to be independent with one another (except for special cases, i.e. reduction). ‣ If multiple decompositions are possible, choose the most coarse-grained decomposition to reduce the parallelization overhead. Step 3. Implement parallelism ‣ Use the programming model (OpenMP, MPI, CUDA, etc.) that suites your hardware. 82

Predicting the speedup of a parallel program 83 Tp =
p n Ts + (1 − p)Ts Suppose a program takes to run serially and a fraction ( ) of the program can be parallelized. can be parallelized and is not. The runtime of the same program running in parallel on n processors is: Ts p pTs (1 − p)Ts pTs (1 − p)Ts p n Ts (1 − p)Ts Serial Parallel 1 n

Amdahl’s Law ΞϜμʔϧͷ๏ଇ 84 Ts Tp = Ts p n
Ts + (1 − p)Ts = 1 p n + (1 − p) lim n→∞ Ts Tp = lim n→∞ 1 p n + (1 − p) = 1 1 − p No matter how many processors you have, this is the attainable maximum speedup! Let’s find the limit of the speedup if n approaches infinity: The parallel speedup is the ratio between and : Ts Tp

Example 85

Need to exploit multiple levels of parallelism Instruction-level parallelism ‣
A processor executes multiple independent instructions in parallel Data-level parallelism ‣ A processor executes the same instruction on multiple data (vector) Thread-level parallelism ‣ A multiprocessor comprises multiple processors that execute multiple programs in parallel Cluster-level parallelism ‣ A cluster comprises multiple computers that execute multiple programs in parallel 86

Shared memory parallelism ڞ༗ϝϞϦฒྻܭࢉ Processors share the same memory ‣
e.g. Cores within a package or computer Pros ‣ Easier programming thanks to the unified address space. ‣ Sharing data between processors is fast. Cons ‣ Scalability is limited due to access contention and false sharing. ‣ Prone to nasty bugs such as data races. 87 Memory Processor Processor Processor

Distributed memory parallelism ෼ࢄϝϞϦฒྻܭࢉ Each processor has its own memory
‣ e.g. Cluster of computers interconnected via a network Pros ‣ No access contention since each processors has its dedicated memory. ‣ Scalable to millions of processors. Cons ‣ Need to explicitly describe communication between processors. ‣ Exchanging data between processors is slower than shared memory systems. 88 Processor Memory Processor Memory Processor Memory Network

Hybrid approches Shared memory programming on distributed memory system (i.e.
distributed shared memory) ‣ Middleware hides communication between processors. ‣ e.g. PGAS (Partitioned Global Address Space) programming models Distributed memory programming on shared memory system ‣ Partition the shared memory into distributed memories. ‣ e.g. MPI on a single-node Hybrid parallelism ϋΠϒϦουฒྻ ‣ Combine shared and distributed memory programming. ‣ e.g. MPI+OpenMP: OpenMP to parallelize within a single node and MPI to parallelize across multiple nodes 89

Shared memory parallel programming POSIX threads (a.k.a. pthreads) ‣ Library
based (only need to link a library) ‣ Low-level and general-purpose API for creating multi-threaded programs ‣ Part of POSIX -- works on Linux, macOS, FreeBSD, etc. OpenMP ‣ Compiler directive based (compiler needs to support OpenMP) ‣ Higher level abstraction than pthreads -- easier to program than pthreads. ‣ De-facto standard in scientific computing (works on most supercomputers) 90

pthreads 91 #include <stdio.h> #include <pthread.h> void *thread_func(void *arg) {
printf("I'm a forked thread!\n"); return NULL; } int main(void) { pthread_t thread; pthread_create(&thread, NULL, thread_func, NULL); printf("I'm the main thread!\n"); pthread_join(thread, NULL); return 0; } pthread_create pthread_join printf printf Main thread Forked thread

OpenMP A compiler directive-based shared memory programming model ‣ Compiler
directives (#pragma omp ...) are used to tell the compiler to perform parallelization. ‣ Same source code works as serial version when treating the directives as comments (in most cases). ‣ Can incrementally increase the parallel portions during development. Recent versions introduced: ‣ Oﬄoading to co-processors (GPUs) ‣ Exploiting SIMD parallelism 92 double a[N], b[N], c[N]; #pragma omp parallel for for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; }

OpenMP: Parallel region Most basic OpenMP construct for parallelization ‣
Precede a block with #pragma omp parallel to a create a parallel region. ‣ A parallel region runs in parallel on multiple threads (by default, same as the # of logical cores). ‣ In practice, OpenMP runtimes use thread pools to avoid launching new threads every time 93 func_a(); #pragma omp parallel { func_b(); } func_c(); func_b() func_b() func_b() fork join func_a() func_c()

OpenMP: Work sharing 94 #pragma omp parallel { for (int
i = 0; i < 10; i++) { func_a(i); } } #pragma omp parallel { #pragma omp for for (int i = 0; i < 10; i++) { func_a(i); } } i=0 i=0 i=0 i=0 i=1 i=1 i=1 i=1 i=2 i=2 i=2 i=2 thread 0 thread 1 thread 2 thread 3 i=3 i=3 i=3 i=3 i=0 i=3 i=6 i=8 i=1 i=4 i=7 i=9 i=2 i=5 thread 0 thread 1 thread 2 thread 3 …

OpenMP: Work sharing OpenMP construct for sharing work between threads
‣ The #pragma omp parallel construct runs the same block in parallel (which is not useful by itself). ‣ Need to divide the work and assign different work on different threads (work sharing) to achieve speedup. ‣ #pragma omp for divides the loop and assigns the sub-loops to different threads. 95 #pragma omp parallel for for (int i = 0; i < N; i++) { func_a(); } Shorthand for: #pragma omp parallel { #pragma omp for for (...) { } }

Numerical integration: first attempt Let's calculate pi using numerical integration
਺஋ੵ෼ ‣ We parallelize the integration loop using the #pragma omp parallel work sharing construct. ‣ However, this implementation gives incorrect result. Why? 96 double sum = 0.0; double dx = 1.0 / N; #pragma omp parallel for for (int i = 0; i < N; i++) { double x = i * dx; sum += sqrt(1 - x * x) * dx; } y x 1 0 π 4 = ∫ 1 0 1 − x2 dx ≈ N−1 ∑ i=0 1 − (iΔx)2 Δx

Numerical integration: second attempt Attempt #1 was incorrect because of
a data race σʔλڝ߹ ‣ Multiple threads are read/writing the same variable concurrently. ‣ We can use the #pragma omp critical directive to mark a block as a critical section and restrict its execution by one thread at a time. ‣ However, this version is slow because of the serialized execution. 97 double sum = 0.0; double dx = 1.0 / N; #pragma omp parallel for for (int i = 0; i < N; i++) { double x = i * dx; #pragma omp critical sum += sqrt(1 - x * x) * dx; } i=0 i=3 i=6 i=8 i=1 i=4 i=7 i=9 i=2 i=5 Thread 0 Thread 1 Thread 2 Thread 3 critical section

Recap: Data race σʔλڝ߹ 98 Read i i=0 Write i+1
i=1 Read i Write i+1 i=2 Read i i=0 Write i+1 i=1 Read i Write i+1 What you expect What (may) happen i = 0; #pragma omp parallel { i += 1; } // What is i now? Suppose you run the following code on two processors: mem t1 t0 mem t1 t0

Numerical integration: third attempt OpenMP provides special help for this
pattern called reduction clause ‣ Add reduction(<op>:<var>) to a #pragma omp parallel for directive ‣ Variable <var> will be stored in thread-private storage and reduced using the operator <op> after the parallelized loop finishes. ‣ This version is both correct and fast. 99 double sum = 0.0; #pragma omp parallel for reduction(+:sum) for (int i = 0; i < N; i++) { double x = i * dx; sum += dx * sqrt(1 - x * x); }

Hands-on: Parallelizing MatVec using OpenMP Start from the code you
wrote in the previous hands-on ‣ Template: https://git.io/JI2DF Compile Run 100 g++ -O3 -fopenmp -march=native -Wall -std=c++11 -o matvec matvec.cpp ./matvec This flag enables OpenMP Add OpenMP directives and see the performance!

Setting up gcc Linux 1. Install gcc using your package
manager (e.g. apt install g++). macOS 1. Install the Xcode command line tools (xcode-select --install). 2. Install Homebrew (https://brew.sh/). 3. Install gcc via Homebrew (brew install gcc). 4. Use g++-10 instead of g++ (which is actually clang!). Windows 1. Install the Windows Subsystem for Linux (WSL). 2. Install your favorite Linux distribution via Microsoft Store. 3. Install gcc using your package manager. 101

Matrix-Vector multiplication (parallel outer loop) 102 = ⋅ y A
x void run() { #pragma omp parallel for for (int i = 0; i < N; i++) { double sum = 0.0; for (int j = 0; j < N; j++) { sum += A[i][j] * x[j]; } y[i] = sum; } } Thread 0 Thread 1 Thread 2 Thread 3 All Threads

Matrix-Vector multiplication (parallel inner loop) 103 = ⋅ y A
void run() { for (int i = 0; i < N; i++) { double sum = 0.0; #pragma omp parallel for \ reduction(+:sum) for (int j = 0; j < N; j++) { sum += A[i][j] * x[j]; } y[i] = sum; } } x Thread 0 Thread 1 Thread 2 Thread 3

Comparison of runtime (N=10,000) 104 Runtime [ms] 0 75 150
225 300 Core i9-10900K Core i7-8559U Serial Parallel outer loop Parallel inner loop Linux 5.4.0 g++ 9.3.0 macOS 11.0.1 g++ 10.2.0 4.5x 3.1x

Hands-on + Assignment 1. Implement a parallel matrix-matrix multiplication using
OpenMP. ‣ You can start from my template: https://git.io/JLZYZ ‣ N >= 1000 is recommended 2. Measure and report the attained FLOP/s and eﬃciency of your code. 3. Discuss how you could improve the performance of your code further. 105 Attained FLOP/s = # of FLOPs runtime [s] Eﬃciency [ % ] = Attained FLOP/s Peak FLOP/s ⋅ 100 Calculate by hand or use a profiler

Hands-on + Assignment How to submit: ‣ Open NRSS https://nrss.naist.jp/
and search for “Research Trends in Software Design and Analysis”. ‣ Title: Student Number-Your Name (e.g. 2012345-Keichi Takahashi) ‣ Body: Answer assignments 2 and 3. ‣ Attachment: Source code for assignment 1 (you may push your code to GitHub or Gist and provide a link instead). Deadline: ‣ December 31, 2020 23:59 (JST). 106

How to find the peak FLOP/s of your system 107
Peak FLOP/s = # of FLOPs cycle ⋅ # of cycles second ⋅ # of cores system Example: Intel Core i9 10900K 592 GFLOP/s = 16 FLOPs ⋅ 3.70 GHz ⋅ 10 First of all, check what CPU you are running. Then, use the following equation to calculate the peak FLOP/s value: Refer to this table: https://en.wikichip.org/wiki/flops Use the base clock (not turbo clock)

Roofline model

The roofline model answers these questions for you When am
I done optimizing? ‣ My code took 100s to run before optimizing it. Now after extensively optimization my code runs in 10s. How far am I from the optimum? Why don’t I get the peak performance? ‣ My GPU should provide 14 TFLOPS but the profiler tells me that my code achieves only 1 TFLOPS. Why? 109

Example: vector addition 110 Load a[i] and b[i] Multiply a[i-1]
and b[i-1] Store c[i-1] Multiply a[i] and b[i] Multiply a[i+1] and b[i+ Load a[i] and b[i] Store c[i] Load a[ and b[i Store c[i-2] double a[N], b[N], c[N]; for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Communication Computation Time Suppose the following code if computation and communication overlap, one of them will be the bottleneck

Roofline model The roofline model assumes that the performance is
limited by either ‣ Computation (FLOP/s) or ‣ Communication (moving data to/from DRAM) 111 S. Williams et al., ”Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76. Runtime = max # of FLOPs Peak FLOP/s # of Bytes Peak Bytes/s Computation Communication

Roofline model 112 = max Peak FLOP/s # of FLOPs
# of Bytes # of FLOPs Runtime FLOP/s= Peak Bytes/s ⋅ Arithmetic Intensity (AI) ԋࢉڧ౓: How many FLOPs a program can perform for each byte moved to/from memory. Runtime = max # of FLOPs Peak FLOP/s # of Bytes Peak Bytes/s

How to calculate the arithmetic intensity 113 double a[N], b[N],
c[N]; for (int i = 0; i < N; i++) { c[i] = a[i] + b[i]; } 2 loads 1 store 1 floating point operation AI = 1 / ((2+1) * 8) = 0.042 double a[N], b[N], sum; for (int i = 0; i < N; i++) { sum += a[i] * a[i] + b[i] * b[i]; } 2 loads 0 store 3 floating point operation AI = 3 / (2 * 8) = 0.1875 In practice, use hardware performance counters and measure on actual hardware. You can use Linux perf, Intel VTune, LIKWID, etc.

Arithmetic intensity of real-world kernels 114 https://crd.lbl.gov/departments/computer-science/par/research/roofline/introduction/

Visual representation of the roofline model 115 Arithmetic Intensity FLOP/s
Peak FLOP/s Peak Bytes/s Computational ceiling Bandwidth ceiling

Predicting performance using roofline analysis 116 Arithmetic Intensity FLOP/s Peak
FLOP/s Peak Bytes/s Given the arithmetic intensity of a computational kernel and the compute and memory ceilings, we can calculate the attainable FLOP/s.

Memory-bound vs compute-bound 117 Arithmetic Intensity FLOP/s Peak FLOP/s Peak
Bytes/s Memory-bound Compute-bound Ridge point 1−10 FLOPS/Byte on modern processors Any kernel to the left of the ridge point is memory-bound: it’s bound by memory. Any kernel to the right is compute-bound: it’s bound by FPUs.

Ridge point on modern systems 118 Fujitsu A64FX DP FLOPS=2.7
TFLOPS B/W=1TB/s Ridge AI=2.7 NVIDIA V100 DP FLOPS=7 TFLOPS B/W=900GB/s Ridge AI=7.8 AMD EPYC 7742 DP FLOPS=2.3 TFLOPS B/W=330GB/s Ridge AI=7.7

Guiding optimization via roofline analysis (1/2) 119 Arithmetic Intensity FLOP/s
Peak FLOP/s Peak Bytes/s These are highly optimized and cannot be optimized further These are also highly optimized. You may be able to increase AI via algorithmic optimization or improve data reuse.

Guiding optimization via roofline analysis (2/2) 120 Arithmetic Intensity FLOP/s
Peak FLOP/s Peak Bytes/s These are optimization opportunities! Try optimization techniques in your hand.

Introduction to software optimization

Introduction to software optimization

More Decks by Keichi Takahashi

Other Decks in Science

Featured

Transcript