Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to software optimization

Keichi Takahashi
December 19, 2020

Introduction to software optimization

These are the slides for my special lecture series on software optimization I taught in fall 2020 at NAIST. It covers the basics of software optimization ranging from basic computing architecture, performance analysis tools, memory optimization, parallelization, roofline analysis, and more.

Keichi Takahashi

December 19, 2020
Tweet

More Decks by Keichi Takahashi

Other Decks in Science

Transcript

  1. Research Trends in Software Design and Analysis
    Dr. Keichi Takahashi
    Software Design and Analysis Laboratory
    Nara Institute of Science and Technology

    View Slide

  2. What you get from this lecture
    2
    This lecture will teach you what it takes to write
    FAST code on modern computing systems.

    View Slide

  3. Logistics
    Overall
    ‣ Four lectures in total: 4th hour (15:10–16:40) on December 8, 10, 15, 17
    ‣ Hands-on exercises included, so bring your laptops!
    Grading
    ‣ 100% mini-project report
    ‣ Mini-project will be assigned in the last lecture
    Ask questions!
    ‣ Feel free to interrupt during class and ask questions
    ‣ Or ask via email: [email protected]
    3

    View Slide

  4. Recommended readings
    4
    Randal Bryant and David O’Hallaron,
    “Computer Systems: A Programmer's
    Perspective”, Pearson
    John L. Hennessy and David A.
    Patterson, “Computer Architecture: A
    Quantitative Approach”, Morgan
    Kaufmann.

    View Slide

  5. (My) Golden rules for optimization
    Understand your hardware and software
    ‣ Know what your hardware is capable of.
    ‣ Profile your software and understand its bottlenecks.
    Minimize data movement across the memory hierarchy
    ‣ Modern computers have very deep memory hierarchies.
    ‣ Design memory layout and access pattern to maximize data reuse.
    Utilize parallelism from all levels of the computer
    ‣ Modern computers have many levels of parallelism.
    ‣ Exploit all levels of parallelism in your computer(s).
    5

    View Slide

  6. Know your hardware and software

    View Slide

  7. Know your hardware

    View Slide

  8. FLOP/s (Floating-point Operations per Second)
    A measure of computing
    performance
    ‣ especially appreciated in scientific
    computing.
    ‣ traditionally used for measuring the
    performance of supercomputers.
    When measuring/reporting FLOP/s
    ‣ Beware of the precision (double,
    single, half, etc.)!
    ‣ Take extra care for transcendental
    functions (sqrt, sin, exp, etc.) and
    divisions.
    8
    double x[N], y[N];
    for (int i = 0; i < N; i++) {
    y[i] += a * x[i];
    }
    How many # of floating-point operations
    does this code perform?

    View Slide

  9. Examples
    9
    AMD EPYC 7742
    DP: 2.3 TFLOPS
    SP: 4.6 TFLOPS
    NVIDIA V100
    DP: 8.2 TFLOPS
    SP: 16.4 TFLOPS
    Intel Xeon Platinum 8284
    DP: 1.5 TFLOPS
    SP: 3.0 TFLOPS
    Intel i7-8559U
    DP: 170 GFLOPS
    SP: 340 GFLOPS
    Intel i9-9900K
    DP: 589 GFLOPS
    SP: 1.2 TFLOPS

    View Slide

  10. Moore’s law ϜʔΞͷ๏ଇ
    Moore’s law predicts that:
    ‣ The number of transistors in a chip
    doubles every two years.
    ‣ Notice that he did not predict that the
    performance doubles.
    Moore’s law is ending soon
    ‣ Most researchers predict the Moore’s law
    will end in the next 10 years.
    ‣ Exciting exotic devices and novel
    architectures are investigated for Post-
    Moore computing (but will not be
    covered in this lecture).
    10
    Gordon Moore

    View Slide

  11. Dennard scaling σφʔυεέʔϦϯά
    Dennard observed the following scaling
    at every technology generation:
    11
    Transistor dimension:
    1
    k
    Power consumption:
    1
    k2
    where k ≈
    1
    2
    ≈ 0.7
    Frequency: k
    R. H. Dennard et al., "Design of ion-implanted MOSFET's with very small physical dimensions,"
    in IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256-268, Oct. 1974.
    Voltage:
    1
    k
    Capacity:
    1
    k
    Power density: 1

    View Slide

  12. End of Dennard scaling
    Dennard ignored the leakage current and threshold voltage
    ‣ Leakage current ࿙Εిྲྀ: Transistors leak a small current when they are
    turned off. This current increases as the insulation thickness decreases.
    ‣ Threshold voltage ᮢ஋ిѹ: Voltage where the transistor turns on.
    The power density started increasing
    ‣ Can’t increase the frequency anymore!
    ‣ How are we going to use the extra
    transistors we can put in our chip?
    12

    View Slide

  13. Singe-thread performance has plateaued ఀ଺
    13

    View Slide

  14. The free lunch is over1
    Until ~2005, Dennard Scaling was doing well
    ‣ Frequency and single-thread performance were steadily increasing.
    ‣ Just wait for new hardware to make your program run faster (free lunch).
    ~2005 (Pentium 4), Dennard Scaling failed
    ‣ Frequency and single-thread performance have plateaued.
    ‣ Multi-thread performance is still increasing.
    ‣ Software needs to be written to exploit the parallelism (no free lunch).
    14
    1 http://www.gotw.ca/publications/concurrency-ddj.htm

    View Slide

  15. Two oxen vs 1,024 chickens
    15
    If you were plowing a field, which
    would you rather use: two strong oxen
    or 1,024 chickens?
    Seymour Cray

    View Slide

  16. Two oxen vs 1,024 chickens
    16
    We have no choice but to work with 1,024 chickens!

    View Slide

  17. Need to exploit multiple levels of parallelism
    Instruction-level parallelism
    ‣ A processor executes multiple independent instructions in parallel
    Data-level parallelism
    ‣ A processor executes the same instruction on multiple data (vector)
    Thread-level parallelism
    ‣ A multiprocessor comprises multiple processors that execute multiple
    programs in parallel
    Cluster-level parallelism
    ‣ A cluster comprises multiple computers that execute multiple programs
    in parallel
    17

    View Slide

  18. Example: AMD EPYC Rome 7742
    18
    Each core has 2 FMA (Fused
    Multiply-Add) pipelines
    64 cores
    1 CPU package
    Each pipeline can perform
    8 singe-precision FLOPs in a cycle
    EPYC 7742 can perform 64*2*8*2=2,048 FLOPs in each cycle!
    In other words, you are only utilizing 0.05% of the total performance
    if you don’t use any of the parallelism available.
    See: https://en.wikichip.org/wiki/amd/epyc/7742
    https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

    View Slide

  19. Memory performance is lagging behind
    Memory performance could not keep up with CPU performance
    ‣ In the old days, memory ran faster than the CPU!
    ‣ Nowadays, CPU needs to wait (stall) for data to arrive from memory.
    19
    John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach”,
    Morgan Kaufmann.
    N.B. this is single-
    core performance

    View Slide

  20. Memory hierarchy هԱ֊૚
    Users want infinite amount of infinitely
    fast memory
    ‣ Fast memory (e.g. SRAM) is expensive
    ‣ Large memory (e.g. DRAM) is slow
    Form a hierarchy of memories
    ‣ Higher levels are faster but small
    ‣ Lower levels are slower but large
    ‣ Higher levels keep frequently accessed
    subset of the lower levels (caching)
    20
    Register
    L1
    L2
    L3
    DRAM
    Faster
    Larger
    Tradeoff!

    View Slide

  21. Typical memory hierarchy in modern CPUs
    21
    DRAM
    L3 Cache
    L2 Cache
    L1 Cache
    Die
    DRAM DRAM
    DRAM
    Core
    L2 Cache
    L1 Cache
    Core
    L3 Cache
    L2 Cache
    L1 Cache
    Die
    Core
    L2 Cache
    L1 Cache
    Core

    View Slide

  22. Latency numbers every programmer should know
    22
    What Latency Human scale
    L1 cache reference 0.5ns 0.5s
    Branch mispredict 5ns 5s
    L2 cache reference 7ns 7s
    Mutex lock/unlock 25ns 25s
    Main memory reference 100ns 1m 40s
    Compress 1K bytes with Zippy 3,000ns 50m
    Send 2 KB over 1 Gbps network 20,000ns 5h 33m
    Read 1 MB sequentially from memory 250,000ns 2d 21h 26m
    Read 1 MB sequentially from SSD 1,000,000ns 11d 13h 46m
    The Jeff Dean Facts http://www.informatika.bg/jeffdean
    શ੝ظͷJeff Dean఻આ https://qiita.com/umegaya/items/ef69461d6f4967d5c623

    View Slide

  23. AMD EPYC 7742 memory latency* and bandwidth†
    Latency
    ‣ Clock cycle is ~0.4ns (2.25 GHz).
    ‣ Missing one cache level increases
    the latency by a factor of 1.4-5x.
    ‣ If you miss all cache levels, core
    stalls for ~200 cycles!
    Bandwidth
    ‣ Becomes scarce at lower levels.
    ‣ Note that the bandwidth is shared
    among all cores — DRAM B/W is
    only 2.4 GB/s per core!
    23
    Die
    5ns
    4.0 TB/s
    17ns
    1.6 TB/s
    83ns
    153 GB/s
    *https://www.7-cpu.com/cpu/Zen2.html
    † Measured on actual hardware using Empirical Roofline Tool (ERT)
    DRAM
    L3 Cache
    L2 Cache
    L1 Cache
    Core
    Core
    1.7-3.5ns
    6.6 TB/s

    View Slide

  24. Deep memory hierarchy
    Utilizing the cache is crucial for achieving “good” performance
    ‣ If the core misses a cache level, it needs to wait for the data to arrive from
    lower levels.
    ‣ Your code needs to access the memory in such a way that maximizes
    cache hit (i.e. cache-friendliness)
    How to write cache-friendly code?
    ‣ Understand how the cache works (cache capacity, cache line size, false
    sharing, bank conflicts, etc.)
    ‣ Exploit spatial and temporal locality
    24

    View Slide

  25. Why software optimization matters
    Individual CPU cores are not becoming faster anymore
    ‣ Just buying a new hardware does not make your software run faster
    ‣ You need to parallelize your software If you want to utilize your hardware
    Memory hierarchy is deep
    ‣ Memory performance is not keeping up with computer performance
    ‣ Cannot get viable performance without effectively utilizing the cache
    25
    Your code does not run any faster on your expensive new hardware
    without optimization.

    View Slide

  26. Why can’t we build a 10,000-core CPU?
    Well, we can. Meet Cerebras CS-1.
    26
    Cerebras CS-1
    NVIDIA
    V100
    CS-1:V100
    Die size 46,225 mm2 815 mm2 57x
    # of cores 40,000 5,120 78x
    On chip
    memory
    (SRAM)
    18 GB 6 MB 3,000x
    Memory
    bandwidth
    9 PB/s 900 GB/s 10,000x
    Power
    consumption
    20 kW 300 W 67x
    https://www.cerebras.net/product/
    Kamil Rocki et al., “Fast Stencil-Code Computation on a Wafer-Scale Processor”, SC20.

    View Slide

  27. Why can’t we build a 10,000-core CPU?
    27
    … if you can supply 20kW
    of power + cooling

    View Slide

  28. Know the bottleneck of your software

    View Slide

  29. Ask yourself if you really need to optimize
    29
    “The first rule of optimization is: Don’t do
    it.
    The second rule of optimization (for
    experts only) is: Don’t do it yet.“
    Michael A. Jackson

    View Slide

  30. Ask yourself if you really need to optimize
    30
    “Premature optimization is the root of all evil.”
    Tony Hoare (or Donald Knuth)

    View Slide

  31. Performance comes with a cost
    Optimizing a program takes time:
    ‣ If you are a researcher, focus on producing results and papers!
    ‣ If you are working at a company, optimization costs money. Buying an
    expensive server is probably cheaper than hiring a CS PhD.
    Optimizing a program reduces readability:
    ‣ Most optimization techniques reduces the readability of the source code.
    ‣ Low readability implies even more maintenance cost!
    ‣ Language and library abstractions to achieve both performance and
    readability are being studied but not quite there yet…
    31

    View Slide

  32. Consider these alternatives
    32
    NVIDIA DGX Station A100
    $200,000 (one-off payment)
    Performance gain: predictable
    Computer Science Ph.D.
    $100,000 (yearly payment)
    Performance gain: unknown

    View Slide

  33. Avoid reinventing the wheel
    There are plenty of high-quality, highly-optimized and well-
    maintained libraries out there!
    ‣ Basic linear algebra: BLAS and LAPACK implementations such as MKL,
    AOCL, ScaLAPACK, cuBLAS
    ‣ Solvers: PETSc, Trilinos, SuperLU, cuSOLVER
    ‣ FFT: FFTW, cuFFT
    ‣ File I/O: HDF5, ADIOS2, NetCDF
    ‣ General purpose: Eigen, Thrust, ArrayFire
    33
    If none of the existing libraries fit your need, then consider
    writing your own code

    View Slide

  34. Focus on the bottleneck (hot-spot)
    The Pareto principle (80/20 rule) applies to optimization too
    ‣ Majority of the runtime is consumed by a very small fraction of the code
    ‣ Don’t waste your time by optimizing functions that don’t matter!
    ‣ If many functions equally contribute to the runtime, consider giving up
    34
    Function
    A
    B
    C
    D
    Rest
    % of Total Runtime
    0 22.5 45 67.5 90
    A
    B
    C
    D
    Rest
    % of Total Runtime
    0 7.5 15 22.5 30
    A
    B
    C
    D
    Rest
    % of Total Runtime
    0 20 40 60 80

    View Slide

  35. Always measure your code
    Don’t ever optimize a code without measurement
    ‣ Measure before attempting any optimization to identify the bottleneck
    Measure, optimize, and then measure again
    ‣ Measure after optimization since optimization techniques can often make
    code slower
    35
    Optimize Measure
    Measure
    Find the bottleneck
    Evaluate the effect of
    the optimization

    View Slide

  36. Profilers
    Open-source:
    ‣ prof: Linux’s builtin performance analysis tool. Recommended!
    ‣ gprof: Pretty old and outdated GCC’s profiler. Not recommended.
    Vendor-specific:
    ‣ Intel VTune and Advisor: Highly recommended on Intel CPUs.
    ‣ AMD uProf: Compared to VTune, not so great.
    ‣ NVIDIA Nsight and Nvprof: No other options if working on NVIDIA GPUs
    Language-specific:
    ‣ Python: cProfile and line_profile
    ‣ Ruby: stackprof and ruby-prof
    36

    View Slide

  37. Intel VTune (Hot-spot Analysis)
    37

    View Slide

  38. Intel VTune (Hot-spot Analysis)
    38

    View Slide

  39. Intel VTune (Hot-spot Analysis)
    39

    View Slide

  40. Intel VTune (Microarchitecture Exploration)
    40

    View Slide

  41. Linux perf
    Linux’s builtin performance analysis tool since kernel 2.6
    ‣ Can profile the kernel space: useful when analyzing applications that
    spend significant time in the kernel space (e.g. I/O-intensive applications).
    ‣ Gives access to hardware performance counters: reveals hardware events
    such as cache misses, context switches, branch misses, etc.
    ‣ System-wide profiling: can profile all processes on the system. Useful
    when analyzing multiple processes interacting with one another.
    ‣ Kernel config (CONFIG_PERF_EVENTS) is enabled in most Linux
    distributions. Just need to install the user space utility (linux-tools-
    common on Ubuntu).
    41
    Official wiki: https://perf.wiki.kernel.org/index.php/Main_Page
    Brendan Gregg’s guide: http://www.brendangregg.com/perf.html

    View Slide

  42. perf record & report
    42
    Sample the number of hardware events (e.g. cache hits and misses) during
    execution of a program and report.
    perf record -e cycles ./stress-ng -c 10 -t 3
    List of HW events to sample Program to measure
    perf report

    View Slide

  43. perf stat
    43
    [keichi@keichi-ws stress-ng]$ perf stat -e cache-misses,cache-references ./stress-ng -c 10 -t 3
    stress-ng: info: [14643] dispatching hogs: 10 cpu
    stress-ng: info: [14643] successful run completed in 3.04s
    Performance counter stats for './stress-ng -c 10 -t 3':
    10,805,671 cache-misses # 0.640 % of all cache refs
    1,687,740,535 cache-references
    3.047260141 seconds time elapsed
    30.328126000 seconds user
    0.007996000 seconds sys
    Count number of hardware events (e.g. cache hits and misses) during
    execution of a program.
    perf stat -e cache-misses,cache-references ./stress-ng -c 10 -t 3
    List of HW events to count Program to measure

    View Slide

  44. perf list
    44
    Obtain a list of hardware events that are available on your platform.

    View Slide

  45. perf top
    45
    stress-ng -c 10
    stress-ng -S 10
    System-wide profiling tool (think of it as the top command on steroids)

    View Slide

  46. Strace: System call tracer
    46

    View Slide

  47. Strace: System call tracer
    47

    View Slide

  48. Timers
    Insert timers before and after the
    code you want to measure
    ‣ It’s simple and it always works —
    just like printf() debugging
    ‣ If you already know which part of
    the code is the bottleneck, very
    effective
    ‣ Beware of the pitfalls!
    48
    // Start timer
    double start = read_timer();
    // Code to measure
    for (…) {
    for (…) {

    }
    }
    // Stop timer
    double end = read_timer();
    // Calculate diff
    double runtime = end - start;

    View Slide

  49. Use high-precision timers
    Use these:
    ‣ clock_gettime(CLOCK_MONOTONIC, …)
    ‣ clock_gettime(CLOCK_PROCESS_CPUTIME_ID, …)
    ‣ std::chrono::steady_clock (C++11)
    Don’t use these:
    ‣ time() returns UNIX epoch seconds
    ‣ gettimeofday() deprecated in favor of clock_gettime()
    ‣ clock_gettime(CLOCK_REALTIME, …) can rewind if system time is
    changed (e.g. NTP)
    49

    View Slide

  50. Mind these pitfalls
    Pin your processes/threads to CPU cores
    ‣ When using clock_gettime(CLOCK_PROCESS_CPUTIME_ID, …) or
    RDTSC instructions
    Measure multiple times and perform warmup runs
    ‣ To see the variance of the runtime
    ‣ It’s common that the first few iterations are always slow (why?)
    … and even more pit falls
    ‣ System calls and function calls have overheads
    ‣ Out-of-order execution and compiler can reorder instructions
    ‣ Need barriers if measuring multi-thread or multi-process code
    50

    View Slide

  51. Memory optimization

    View Slide

  52. How to tame the deep memory hierarchy
    52
    Maximize the spatial and temporal locality of your
    memory accesses to improve cache efficiency.
    …but how?

    View Slide

  53. Locality ہॴੑ
    53
    Cache assumes that your software exhibits spatial and temporal locality.
    Spatial locality ۭؒతہॴੑ: Items whose addresses are near one another
    tend to be referenced together in time.
    Temporal locality ࣌ؒతہॴੑ: Recently accessed items are likely to be
    accessed in the near future

    View Slide

  54. Hands-on: Matrix-vector multiplication
    54
    double A[N][N];
    double x[N], y[N];
    // Implement here
    Given an matrix and an vector , write a fast C code that
    multiplies and and stores the result to an vector .
    N × N A N × 1 x
    A x N × 1 y
    1 2 3
    4 5 6
    7 8 9

    [
    1
    1
    2
    ]
    =
    1 ⋅ 1 + 2 ⋅ 1 + 3 ⋅ 2
    4 ⋅ 1 + 5 ⋅ 1 + 6 ⋅ 2
    7 ⋅ 1 + 8 ⋅ 1 + 9 ⋅ 2
    =
    [
    9
    21
    33
    ]
    y
    A x
    Example:

    View Slide

  55. Setting up a C++ compiler
    Linux
    A. Install gcc or clang using your package manager.
    macOS
    A. Install the Xcode command line tools and use its bundled clang.
    B. Install gcc via Homebrew.
    Windows
    A. Install the Windows Subsystem for Linux (WSL).
    B. Install MinGW or Visual Studio.
    Online compiler (if none of the above works)
    A. Wandbox: https://wandbox.org/
    55

    View Slide

  56. Compiling and running the example
    Download the example
    ‣ Gist URL: https://git.io/JI2DF
    Compile
    Run
    56
    g++ -O3 -fno-tree-vectorize -Wall -std=c++11 -o matvec main.cpp
    ./matvec
    Now implement matrix-vector multiplication in run()
    This is “O” (alphabet) not zero
    Don’t need this flag if you’re using clang

    View Slide

  57. Which one is faster?
    57
    for (int i = 0; i < N; i++) {
    y[i] = 0.0;
    for (int j = 0; j < N; j++) {
    y[i] += A[i][j] * x[j];
    }
    }
    for (int i = 0; i < N; i++) {
    y[i] = 0.0;
    }
    for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
    y[i] += A[i][j] * x[j];
    }
    }
    A. B.

    View Slide

  58. Comparison of runtime (N=10,000)
    58
    Runtime [ms]
    0
    200
    400
    600
    800
    Core i9-10900K Core i7-8559U
    ver. A ver. B
    3.9x
    6.6x
    g++ 9.3.0 g++ 10.2.0

    View Slide

  59. Memory access pattern
    59
    = ⋅
    y A x
    = ⋅
    y A x
    for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
    y[i] += A[i][j] * x[j];
    }
    }
    for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
    y[i] += A[i][j] * x[j];
    }
    }


    A. Row-major memory access B. Column-major memory access

    View Slide

  60. Memory layout: row-major order
    60
    1 2 3
    4 5 6
    7 8 9
    10 11 12
    Address Value
    0x0000 1
    0x0008 2
    0x0010 3
    0x0018 4
    0x0020 5
    0x0028 6
    0x0030 7
    0x0038 8
    0x0040 9
    0x0048 10
    0x0050 11
    0x0058 12
    Elements within the same row are contiguous on
    memory (C, C++, etc.)

    View Slide

  61. Memory layout: column-major order
    61
    1 2 3
    4 5 6
    7 8 9
    10 11 12
    Address Value
    0x0000 1
    0x0008 4
    0x0010 7
    0x0018 10
    0x0020 2
    0x0028 5
    0x0030 8
    0x0038 11
    0x0040 3
    0x0048 6
    0x0050 9
    0x0058 12
    Elements within the same column are contiguous
    on memory (Fortran, Julia, numpy, etc.)

    View Slide

  62. A exhibits better spatial locality than B
    62
    Address Value
    0x0000 1
    0x0008 2
    0x0010 3
    0x0018 4
    0x0020 5
    0x0028 6
    0x0030 7
    0x0038 8
    0x0040 9
    0x0048 10
    0x0050 11
    0x0058 12
    A. B.
    1
    2
    3
    4
    Address Value
    0x0000 1
    0x0008 2
    0x0010 3
    0x0018 4
    0x0020 5
    0x0028 6
    0x0030 7
    0x0038 8
    0x0040 9
    0x0048 10
    0x0050 11
    0x0058 12
    1
    2
    3
    4


    Contiguous
    memory access
    (spatial locality is
    high)
    Strided memory
    access (spatial
    locality is low)



    View Slide

  63. Why contiguous memory access is faster
    Cache line is the smallest unit of data movement between cache levels
    ‣ A cache line is 64B (i.e. 8 doubles or 16 floats) on recent Intel/AMD CPUs.
    ‣ Even when reading a single byte, the CPU reads the whole cache line that
    includes that byte from memory and caches to L1-L3 cache.
    ‣ Use the whole data within in a cache line if possible.
    63
    DRAM
    L2 Cache
    L1 Cache
    Addr Val
    0x0000 1
    0x0008 2
    0x0010 3
    0x0018 4
    0x0020 5
    0x0028 6
    0x0030 7
    0x0038 8
    0x0040 9
    0x0048 10
    0x0050 11
    0x0058 12
    0x0060 13
    0x0068 14
    0x0070 15
    0x0078 16
    Cache line
    Cache line
    Cache line
    Cache line

    View Slide

  64. Why contiguous memory access is faster (cont’d)
    CPU prefetches cache lines
    ‣ CPU predicts future memory accesses and moves cache lines in advance.
    ‣ How “smart” the prefetcher is depends on the CPU, but it generally fails to
    predict complex/irregular access patterns.
    ‣ Prefetching can be explicitly triggered from software (software
    prefetching), but it the programmer needs to calculate the latency of
    instructions and memory accesses.
    64

    View Slide

  65. Measuring cache hit ratio using perf
    65
    = ⋅
    y A x


    = ⋅
    y A x


    A. Row-major memory access B. Column-major memory access
    1 out of 8 loads from A misses the
    cache w/o prefetching
    8 out of 8 loads from A misses the
    cache w/o prefetching

    View Slide

  66. Measuring cache hit ratio using perf
    66
    A. Row-major memory access
    B. Column-major memory access

    View Slide

  67. Hands-on: Matrix-matrix multiplication
    67
    double A[N][N], B[N][N], C[N][N];
    // Implement here
    Given two matrices and , write a C code that multiplies and
    and stores the result to an matrix .
    N × N A B A B
    N × N C
    [
    1 2
    3 4] ⋅ [
    5 6
    7 8] = [
    (1 ⋅ 5 + 2 ⋅ 7) (1 ⋅ 6 + 2 ⋅ 8)
    (3 ⋅ 5 + 4 ⋅ 7) (3 ⋅ 6 + 4 ⋅ 8)] = [
    19 22
    43 50]
    C
    A B
    Example:

    View Slide

  68. Basic algorithm and naming of loop indices
    68
    for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
    C[i][j] = 0.0;
    for (int k = 0; k < N; k++) {
    C[i][j] += A[i][k] * B[k][j];
    }
    }
    }
    A B
    C ⋅
    =
    i
    j
    i
    k
    k
    j

    View Slide

  69. Analysis of inner-most loop for version ijk (jik)
    69
    for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
    double sum = 0.0;
    for (int k = 0; k < N; k++) {
    sum += A[i][k] * B[k][j];
    }
    C[i][j] = sum;
    }
    }
    Read a row from A and a
    column from B. Calculate
    their inner product and
    write to C.
    2 loads (A and B)
    0 store
    A misses: 0.125 (=1/8)
    B misses: 1 (=8/8)
    C misses: 0
    Total misses: 1.125 per
    inner-most loop iteration
    A B
    C ⋅
    =

    View Slide

  70. Analysis of inner-most loop for version jki (kji)
    70
    for (int j = 0; j < N; j++) {
    for (int k = 0; k < N; k++) {
    double r = B[k][j];
    for (int i = 0; i < N; i++) {
    C[i][j] += A[i][k] * r;
    }
    }
    }
    Read an element from B and
    multiply it with a column in
    A. Accumulate the result to a
    column in C.
    2 loads (A and C)
    1 store (C)
    A misses: 1 (=8/8)
    B misses: 0
    C misses: 1 (=8/8)
    Total misses: 2 per inner-
    most loop iteration
    A B
    C ⋅
    =

    View Slide

  71. Analysis of inner-most loop for version kij (ikj)
    71
    for (int k = 0; k < N; k++) {
    for (int i = 0; i < N; i++) {
    double r = A[i][k];
    for (int j = 0; j < N; j++) {
    C[i][j] += r * B[k][j];
    }
    }
    }
    Read an element from A and
    multiply it with a row in B.
    Accumulate the result to a
    row in C.
    2 loads (B and C)
    1 store (C)
    A misses: 0
    B misses: 0.125 (=1/8)
    C misses: 0.125 (=1/8)
    Total misses: 0.25 per
    inner-most loop iteration
    A B
    C ⋅
    =

    View Slide

  72. Benchmark result
    72
    Reference: Randal Bryant and David O’Hallaron, “Computer Systems: A Programmer's
    Perspective”, Pearson
    Simply interchanging
    loops makes 40x
    difference!

    View Slide

  73. Gravitational N-body problem ॏྗଟମ໰୊
    73
    d2xi
    dt2
    =
    N

    j=1
    Gmj
    xj
    − xi
    |xj
    − xi
    |3
    for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
    a[i] += calc(x[i], x[j]);
    }
    }
    x2
    x1
    x3
    x4
    x5
    Gm2
    x2
    − x1
    |x2
    − x1
    |3

    View Slide

  74. Analysis of spatial and temporal locality
    We look into the memory accesses in the inner most loop:
    ‣ x[i] and x[j] are read contiguously: high spatial locality
    ‣ a[i] is read in every inner-most loop iteration: high temporal locality
    ‣ x[j] is read every N inner-most loop iteration: low temporal locality
    Cache capacity is limited (L1 < 100KB, L2 < 1MB)
    ‣ The cache “forgets” (evicts) if an address is referenced rarely (LRU policy)
    ‣ How can we increase the
    temporal locality of x[j]?
    74
    for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
    a[i] += calc(x[i], x[j]);
    }
    }

    View Slide

  75. Increasing temporal locality using cache blocking
    75
    for (int j = 0; j < N; j += BLOCK)
    for (int i = 0; i < N; i++)
    for (int jj = 0; jj < BLOCK; jj++)
    a[i] += calc(x[i], x[j]);
    https://software.intel.com/content/www/us/en/develop/articles/cache-blocking-techniques.html
    i
    j
    i
    j+jj
    x[j] is accessed every N/BLOCK inner most
    loop iteration (better temporal locality!)
    x[j] is accessed every N inner most
    loop iteration
    N=8 N=8, BLOCK=4
    for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
    a[i] += calc(x[i], x[j]);
    Original After applying cache blocking

    View Slide

  76. The memory mountain
    76
    Reference: Randal Bryant and David O’Hallaron, “Computer Systems: A Programmer's
    Perspective”, Pearson

    View Slide

  77. Conclusion
    Memory optimization matters.
    ‣ Choosing the right memory layout and access pattern can improve
    performance by many orders of magnitudes.
    ‣ Memory optimization is relevant than more than ever because of the
    deep memory hierarchy in modern computer architectures.
    Time complexity analysis ignores the constant factor.
    ‣ An implementation can be faster than an implementation.
    ‣ Two implementations with the same time complexity can have significant
    performance difference (e.g. Iterating an array vs iterating a linked list are
    both . Which one is faster?)
    O(n2) O(n)
    O(n)
    77

    View Slide

  78. Parallelization

    View Slide

  79. Recommended reading
    79
    Victor Eijkhout, "Introduction to High
    Performance Scientific Computing",
    Lulu.com, 2012.
    Victor Eijkhout, "Parallel Programming in MPI
    and OpenMP", 2017.
    Both books are available online for free!

    View Slide

  80. Parallel programming meme
    80

    View Slide

  81. Parallel programming meme
    81

    View Slide

  82. Basic strategy for parallelization
    Step 1. Find the bottleneck of your program
    ‣ Use the profiling tools we discussed earlier in this class.
    Step 2. Find the right decomposition of your work
    ‣ Decomposed tasks need to be independent with one another (except for
    special cases, i.e. reduction).
    ‣ If multiple decompositions are possible, choose the most coarse-grained
    decomposition to reduce the parallelization overhead.
    Step 3. Implement parallelism
    ‣ Use the programming model (OpenMP, MPI, CUDA, etc.) that suites your
    hardware.
    82

    View Slide

  83. Predicting the speedup of a parallel program
    83
    Tp
    =
    p
    n
    Ts
    + (1 − p)Ts
    Suppose a program takes to run serially and a fraction ( ) of the program
    can be parallelized. can be parallelized and is not.
    The runtime of the same program running in parallel on n processors is:
    Ts
    p
    pTs
    (1 − p)Ts
    pTs
    (1 − p)Ts
    p
    n
    Ts
    (1 − p)Ts
    Serial
    Parallel
    1
    n

    View Slide

  84. Amdahl’s Law ΞϜμʔϧͷ๏ଇ
    84
    Ts
    Tp
    =
    Ts
    p
    n
    Ts
    + (1 − p)Ts
    =
    1
    p
    n
    + (1 − p)
    lim
    n→∞
    Ts
    Tp
    = lim
    n→∞
    1
    p
    n
    + (1 − p)
    =
    1
    1 − p
    No matter how many processors you have, this is the attainable maximum
    speedup!
    Let’s find the limit of the speedup if n approaches infinity:
    The parallel speedup is the ratio between and :
    Ts
    Tp

    View Slide

  85. Example
    85

    View Slide

  86. Need to exploit multiple levels of parallelism
    Instruction-level parallelism
    ‣ A processor executes multiple independent instructions in parallel
    Data-level parallelism
    ‣ A processor executes the same instruction on multiple data (vector)
    Thread-level parallelism
    ‣ A multiprocessor comprises multiple processors that execute multiple
    programs in parallel
    Cluster-level parallelism
    ‣ A cluster comprises multiple computers that execute multiple programs
    in parallel
    86

    View Slide

  87. Shared memory parallelism ڞ༗ϝϞϦฒྻܭࢉ
    Processors share the same memory
    ‣ e.g. Cores within a package or computer
    Pros
    ‣ Easier programming thanks to the unified address space.
    ‣ Sharing data between processors is fast.
    Cons
    ‣ Scalability is limited due to
    access contention and false sharing.
    ‣ Prone to nasty bugs such as data
    races.
    87
    Memory
    Processor Processor Processor

    View Slide

  88. Distributed memory parallelism ෼ࢄϝϞϦฒྻܭࢉ
    Each processor has its own memory
    ‣ e.g. Cluster of computers interconnected via a network
    Pros
    ‣ No access contention since each processors has its dedicated memory.
    ‣ Scalable to millions of processors.
    Cons
    ‣ Need to explicitly describe
    communication between processors.
    ‣ Exchanging data between processors
    is slower than shared memory
    systems.
    88
    Processor
    Memory
    Processor
    Memory
    Processor
    Memory
    Network

    View Slide

  89. Hybrid approches
    Shared memory programming on distributed memory system
    (i.e. distributed shared memory)
    ‣ Middleware hides communication between processors.
    ‣ e.g. PGAS (Partitioned Global Address Space) programming models
    Distributed memory programming on shared memory system
    ‣ Partition the shared memory into distributed memories.
    ‣ e.g. MPI on a single-node
    Hybrid parallelism ϋΠϒϦουฒྻ
    ‣ Combine shared and distributed memory programming.
    ‣ e.g. MPI+OpenMP: OpenMP to parallelize within a single node and MPI to
    parallelize across multiple nodes
    89

    View Slide

  90. Shared memory parallel programming
    POSIX threads (a.k.a. pthreads)
    ‣ Library based (only need to link a library)
    ‣ Low-level and general-purpose API for creating multi-threaded programs
    ‣ Part of POSIX -- works on Linux, macOS, FreeBSD, etc.
    OpenMP
    ‣ Compiler directive based (compiler needs to support OpenMP)
    ‣ Higher level abstraction than pthreads -- easier to program than pthreads.
    ‣ De-facto standard in scientific computing (works on most
    supercomputers)
    90

    View Slide

  91. pthreads
    91
    #include
    #include
    void *thread_func(void *arg)
    {
    printf("I'm a forked thread!\n");
    return NULL;
    }
    int main(void)
    {
    pthread_t thread;
    pthread_create(&thread, NULL,
    thread_func, NULL);
    printf("I'm the main thread!\n");
    pthread_join(thread, NULL);
    return 0;
    }
    pthread_create
    pthread_join
    printf printf
    Main thread Forked thread

    View Slide

  92. OpenMP
    A compiler directive-based shared memory programming model
    ‣ Compiler directives (#pragma omp ...) are used to tell the compiler to
    perform parallelization.
    ‣ Same source code works as serial version when treating the directives as
    comments (in most cases).
    ‣ Can incrementally increase the parallel portions during development.
    Recent versions introduced:
    ‣ Offloading to co-processors (GPUs)
    ‣ Exploiting SIMD parallelism
    92
    double a[N], b[N], c[N];
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
    }

    View Slide

  93. OpenMP: Parallel region
    Most basic OpenMP construct for parallelization
    ‣ Precede a block with #pragma omp parallel to a create a parallel region.
    ‣ A parallel region runs in parallel on multiple
    threads (by default, same as the # of logical cores).
    ‣ In practice, OpenMP runtimes use thread pools
    to avoid launching new threads every time
    93
    func_a();
    #pragma omp parallel
    {
    func_b();
    }
    func_c();
    func_b() func_b() func_b()
    fork
    join
    func_a()
    func_c()

    View Slide

  94. OpenMP: Work sharing
    94
    #pragma omp parallel
    {
    for (int i = 0; i < 10; i++) {
    func_a(i);
    }
    }
    #pragma omp parallel
    {
    #pragma omp for
    for (int i = 0; i < 10; i++) {
    func_a(i);
    }
    }
    i=0
    i=0
    i=0
    i=0
    i=1
    i=1
    i=1
    i=1
    i=2
    i=2
    i=2
    i=2
    thread 0
    thread 1
    thread 2
    thread 3
    i=3
    i=3
    i=3
    i=3
    i=0
    i=3
    i=6
    i=8
    i=1
    i=4
    i=7
    i=9
    i=2
    i=5
    thread 0
    thread 1
    thread 2
    thread 3

    View Slide

  95. OpenMP: Work sharing
    OpenMP construct for sharing work between threads
    ‣ The #pragma omp parallel construct runs the same block in parallel
    (which is not useful by itself).
    ‣ Need to divide the work and assign different work on different threads
    (work sharing) to achieve speedup.
    ‣ #pragma omp for divides the loop and assigns the sub-loops to different
    threads.
    95
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
    func_a();
    }
    Shorthand for:
    #pragma omp parallel
    {
    #pragma omp for
    for (...) {
    }
    }

    View Slide

  96. Numerical integration: first attempt
    Let's calculate pi using numerical integration ਺஋ੵ෼
    ‣ We parallelize the integration loop using the #pragma omp parallel work
    sharing construct.
    ‣ However, this implementation gives incorrect result. Why?
    96
    double sum = 0.0;
    double dx = 1.0 / N;
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
    double x = i * dx;
    sum += sqrt(1 - x * x) * dx;
    }
    y
    x
    1
    0
    π
    4
    =

    1
    0
    1 − x2 dx

    N−1

    i=0
    1 − (iΔx)2 Δx

    View Slide

  97. Numerical integration: second attempt
    Attempt #1 was incorrect because of a data race σʔλڝ߹
    ‣ Multiple threads are read/writing the same variable concurrently.
    ‣ We can use the #pragma omp critical directive to mark a block as a critical
    section and restrict its execution by one thread at a time.
    ‣ However, this version is slow because of the serialized execution.
    97
    double sum = 0.0;
    double dx = 1.0 / N;
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
    double x = i * dx;
    #pragma omp critical
    sum += sqrt(1 - x * x) * dx;
    }
    i=0
    i=3
    i=6
    i=8
    i=1
    i=4
    i=7
    i=9
    i=2
    i=5
    Thread 0
    Thread 1
    Thread 2
    Thread 3
    critical section

    View Slide

  98. Recap: Data race σʔλڝ߹
    98
    Read i
    i=0
    Write i+1
    i=1
    Read i Write i+1
    i=2
    Read i
    i=0
    Write i+1
    i=1
    Read i Write i+1
    What you expect
    What (may) happen
    i = 0;
    #pragma omp parallel
    {
    i += 1;
    }
    // What is i now?
    Suppose you run the following
    code on two processors:
    mem
    t1
    t0
    mem
    t1
    t0

    View Slide

  99. Numerical integration: third attempt
    OpenMP provides special help for this pattern called reduction clause
    ‣ Add reduction(:) to a #pragma omp parallel for directive
    ‣ Variable will be stored in thread-private storage and reduced using
    the operator after the parallelized loop finishes.
    ‣ This version is both correct and fast.
    99
    double sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < N; i++) {
    double x = i * dx;
    sum += dx * sqrt(1 - x * x);
    }

    View Slide

  100. Hands-on: Parallelizing MatVec using OpenMP
    Start from the code you wrote in the previous hands-on
    ‣ Template: https://git.io/JI2DF
    Compile
    Run
    100
    g++ -O3 -fopenmp -march=native -Wall -std=c++11 -o matvec matvec.cpp
    ./matvec
    This flag enables OpenMP
    Add OpenMP directives and see the performance!

    View Slide

  101. Setting up gcc
    Linux
    1. Install gcc using your package manager (e.g. apt install g++).
    macOS
    1. Install the Xcode command line tools (xcode-select --install).
    2. Install Homebrew (https://brew.sh/).
    3. Install gcc via Homebrew (brew install gcc).
    4. Use g++-10 instead of g++ (which is actually clang!).
    Windows
    1. Install the Windows Subsystem for Linux (WSL).
    2. Install your favorite Linux distribution via Microsoft Store.
    3. Install gcc using your package manager.
    101

    View Slide

  102. Matrix-Vector multiplication (parallel outer loop)
    102
    = ⋅
    y A x
    void run()
    {
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
    double sum = 0.0;
    for (int j = 0; j < N; j++) {
    sum += A[i][j] * x[j];
    }
    y[i] = sum;
    }
    }
    Thread 0
    Thread 1
    Thread 2
    Thread 3
    All Threads

    View Slide

  103. Matrix-Vector multiplication (parallel inner loop)
    103
    = ⋅
    y A
    void run()
    {
    for (int i = 0; i < N; i++) {
    double sum = 0.0;
    #pragma omp parallel for \
    reduction(+:sum)
    for (int j = 0; j < N; j++) {
    sum += A[i][j] * x[j];
    }
    y[i] = sum;
    }
    }
    x
    Thread 0
    Thread 1
    Thread 2
    Thread 3

    View Slide

  104. Comparison of runtime (N=10,000)
    104
    Runtime [ms]
    0
    75
    150
    225
    300
    Core i9-10900K Core i7-8559U
    Serial
    Parallel outer loop
    Parallel inner loop
    Linux 5.4.0
    g++ 9.3.0
    macOS 11.0.1
    g++ 10.2.0
    4.5x
    3.1x

    View Slide

  105. Hands-on + Assignment
    1. Implement a parallel matrix-matrix multiplication using OpenMP.
    ‣ You can start from my template: https://git.io/JLZYZ
    ‣ N >= 1000 is recommended
    2. Measure and report the attained FLOP/s and efficiency of your code.
    3. Discuss how you could improve the performance of your code
    further.
    105
    Attained FLOP/s =
    # of FLOPs
    runtime [s]
    Efficiency [ % ] =
    Attained FLOP/s
    Peak FLOP/s
    ⋅ 100
    Calculate by hand
    or use a profiler

    View Slide

  106. Hands-on + Assignment
    How to submit:
    ‣ Open NRSS https://nrss.naist.jp/ and search for “Research Trends in
    Software Design and Analysis”.
    ‣ Title: Student Number-Your Name (e.g. 2012345-Keichi Takahashi)
    ‣ Body: Answer assignments 2 and 3.
    ‣ Attachment: Source code for assignment 1 (you may push your code to
    GitHub or Gist and provide a link instead).
    Deadline:
    ‣ December 31, 2020 23:59 (JST).
    106

    View Slide

  107. How to find the peak FLOP/s of your system
    107
    Peak FLOP/s =
    # of FLOPs
    cycle

    # of cycles
    second

    # of cores
    system
    Example: Intel
    Core i9 10900K
    592 GFLOP/s = 16 FLOPs ⋅ 3.70 GHz ⋅ 10
    First of all, check what CPU you are running. Then, use the following
    equation to calculate the peak FLOP/s value:
    Refer to this table:
    https://en.wikichip.org/wiki/flops
    Use the base clock (not turbo clock)

    View Slide

  108. Roofline model

    View Slide

  109. The roofline model answers these questions for you
    When am I done optimizing?
    ‣ My code took 100s to run before optimizing it. Now after extensively
    optimization my code runs in 10s. How far am I from the optimum?
    Why don’t I get the peak performance?
    ‣ My GPU should provide 14 TFLOPS but the profiler tells me that my code
    achieves only 1 TFLOPS. Why?
    109

    View Slide

  110. Example: vector addition
    110
    Load a[i]
    and b[i]
    Multiply a[i-1] and b[i-1]
    Store c[i-1]
    Multiply a[i] and b[i] Multiply a[i+1] and b[i+
    Load a[i]
    and b[i]
    Store c[i]
    Load a[
    and b[i
    Store c[i-2]
    double a[N], b[N], c[N];
    for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
    }
    Communication
    Computation
    Time
    Suppose the following code
    if computation and communication
    overlap, one of them will be the
    bottleneck

    View Slide

  111. Roofline model
    The roofline model assumes that the performance is limited by either
    ‣ Computation (FLOP/s) or
    ‣ Communication (moving data to/from DRAM)
    111
    S. Williams et al., ”Roofline: an insightful visual performance model for multicore architectures."
    Communications of the ACM 52.4 (2009): 65-76.
    Runtime = max
    # of FLOPs
    Peak FLOP/s
    # of Bytes
    Peak Bytes/s
    Computation
    Communication

    View Slide

  112. Roofline model
    112
    = max
    Peak FLOP/s
    # of FLOPs
    # of Bytes
    # of FLOPs
    Runtime
    FLOP/s=
    Peak Bytes/s

    Arithmetic Intensity (AI) ԋࢉڧ౓:
    How many FLOPs a program can perform
    for each byte moved to/from memory.
    Runtime = max
    # of FLOPs
    Peak FLOP/s
    # of Bytes
    Peak Bytes/s

    View Slide

  113. How to calculate the arithmetic intensity
    113
    double a[N], b[N], c[N];
    for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
    }
    2 loads 1 store
    1 floating point operation
    AI = 1 / ((2+1) * 8) = 0.042
    double a[N], b[N], sum;
    for (int i = 0; i < N; i++) {
    sum += a[i] * a[i] + b[i] * b[i];
    }
    2 loads 0 store
    3 floating point operation
    AI = 3 / (2 * 8) = 0.1875
    In practice, use hardware performance counters and measure
    on actual hardware. You can use Linux perf, Intel VTune, LIKWID, etc.

    View Slide

  114. Arithmetic intensity of real-world kernels
    114
    https://crd.lbl.gov/departments/computer-science/par/research/roofline/introduction/

    View Slide

  115. Visual representation of the roofline model
    115
    Arithmetic Intensity
    FLOP/s
    Peak FLOP/s
    Peak Bytes/s
    Computational ceiling
    Bandwidth ceiling

    View Slide

  116. Predicting performance using roofline analysis
    116
    Arithmetic Intensity
    FLOP/s
    Peak FLOP/s
    Peak Bytes/s
    Given the arithmetic intensity of a computational kernel and the
    compute and memory ceilings, we can calculate the attainable FLOP/s.

    View Slide

  117. Memory-bound vs compute-bound
    117
    Arithmetic Intensity
    FLOP/s
    Peak FLOP/s
    Peak Bytes/s
    Memory-bound Compute-bound
    Ridge point
    1−10 FLOPS/Byte on modern processors
    Any kernel to the left of the ridge point is memory-bound: it’s bound by
    memory. Any kernel to the right is compute-bound: it’s bound by FPUs.

    View Slide

  118. Ridge point on modern systems
    118
    Fujitsu A64FX
    DP FLOPS=2.7 TFLOPS
    B/W=1TB/s
    Ridge AI=2.7
    NVIDIA V100
    DP FLOPS=7 TFLOPS
    B/W=900GB/s
    Ridge AI=7.8
    AMD EPYC 7742
    DP FLOPS=2.3 TFLOPS
    B/W=330GB/s
    Ridge AI=7.7

    View Slide

  119. Guiding optimization via roofline analysis (1/2)
    119
    Arithmetic Intensity
    FLOP/s
    Peak FLOP/s
    Peak Bytes/s
    These are highly optimized
    and cannot be optimized further
    These are also highly optimized. You may be
    able to increase AI via algorithmic optimization
    or improve data reuse.

    View Slide

  120. Guiding optimization via roofline analysis (2/2)
    120
    Arithmetic Intensity
    FLOP/s
    Peak FLOP/s
    Peak Bytes/s
    These are optimization opportunities!
    Try optimization techniques in your
    hand.

    View Slide