Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Molecular Dynamics Method: Theory and Implementation / MD2019 Implementations and Optimization

A10e41b0a61d59f2258d7f6172c33479?s=47 kaityo256
PRO
December 20, 2019

Molecular Dynamics Method: Theory and Implementation / MD2019 Implementations and Optimization

Molecular Dynamics Method: Theory and Implementation
「分子動力学法の理論と実装」
金沢大学集中講義 2019年 12月18日~20日

A10e41b0a61d59f2258d7f6172c33479?s=128

kaityo256
PRO

December 20, 2019
Tweet

Transcript

  1. 1 Implementations and Optimization Molecular Dynamics Method: Theory and Implementation

    Department of Applied Physics and Physico-Informatics, Keio University Hiroshi Watanabe
  2. 2 for (int i = 0; i < N -

    1; i++){ for (int j = i + 1; j < N; j++){ calculate_force(i, j); } } i-atom j-atom
  3. 3 https://qiita.com/kaityo256/items/2356fff922938ae3c87c 分子動力学法 qiita I wrote a review to implemente

    MD (written in Japanese)
  4. 4 Implementing MD is EASY! But optimizations are… 1. Architecture

    of Computer 2. Memory Optimization 3. Architecture Dependent Optimization
  5. 5 Architecture of Computer What you should know before optimization

  6. 6 CPU Memory ALU/FPU Data Store to memory Bottleneck is

    Memory Fetch from memory Computer is a machine which 1. fetches data from memory 2. calculates something on ALU/FPU 3. stores the results to memory
  7. 7 Latency Time from requesting data to receiving data Cache

    access: ~ 10 cycle Memory access: 100 ~ 1000 cycle Throughput Rate to fetch the data from memory to CPU Bytes/Flops (B/F) : Ratio of memory bandwidth and computing power Random access to memory severely degrade the performance
  8. 8 Decoder Execution Unit Cache Memory Fetch the data and

    instruction, pass to the execution unit, and store the results. 3 ~ 6 cycles are required for one floating point operation Pipeline Performance (Flops) Frequency = Belt conveyor Still 6 cycles for one calculation, but 1006 cycles for 1000 calculations. → Roughly one cycle for one calculation
  9. 9 Multi instructions in a single cycle But who should

    do it? Pipeline allows us to achieve one calculation in one cycle. CPU Frequency = Performance Increase in frequency of CPU has been saturated. http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/fulltext
  10. 10 Data Fetch Dependency Check Fetch multiple data and instructions,

    pass them to many FPUs simultaneously FPU FPU Dependency check becomes hopelessly difficult Backward Compatibility Superscalar
  11. 11 Compiler arranges data and instructions No check Simple hardware

    Dependency check is unnecessary. VLIW (Very long instruction word) Loose backward compatibility Extremely clever compiler is required
  12. 12 A programmer arranges the data No check Calculate 2

    ~ 8 operations simultaneously Current architecture: Superscalor (2) + SIMD (4 ~ 8) Simple Hardware Backward Compatibility SIMD (Single Instruction Multiple Data) Programming is very hard. Auto vectorization is still not effective.
  13. 13 Memory Optimization

  14. 14 Interaction in long distance is negligibly small. Cutoff length

  15. 15 Pairlist List of atoms whose distance are within cutoff

    length Simple Implementation O(N^2) → Achieve O(N) by grid search Grid Search Divide the simulation box into small cells. Exclusive Grid Method Hard disks High density Non-Exclusive Grid method Soft Core Low density At most one atom for each cell Many atoms in each cell
  16. 16 Data Structure Multi dimensional array int GridParticleNumber[4]; Number of

    atoms in each cell int GridParticleIndex[4][10]; Atom Indeces 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 GridParticleIndex is sparse → Cache miss 1 4 9 7 5 8 6 0 2 3 10 13 12 11 GridParticleIndex[i][j] Index of j-th atom in i-th cell
  17. 17 Integer Sort Algorithm with O(N) 1. Count the number

    of keys: O(N) 2. Find the position where the data will be stored. 3. Store the data: O(N) 3,4,2,3,5,5,1,2,1,3 1: 2 2: 2 3: 3 4: 1 5: 2 1 2 3 4 5 Very useful algorithm for optimization
  18. 18 One-dimensional Array Implementation 1. Prepare the array with size

    N 2. Count the number of atoms in each cell 3. Place the pointer of each cell 4. Put an atom and increment the pointer 5. Repeat above for all atoms 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 0 1 2 3 0 1 2 3 0 1 2 0 1 2 3 0 1 2 3 10 13 4 9 5 7 8 12 6 11 Completed Array Densely packed (High Cache Efficiency) (所属セル番号が主キー、粒子番号が副キーのソート)
  19. 19 Construction of pairlist is heavy task. → Reuse the

    pairlist with margin Cutoff Search Length Margin Margin = Lifetime of pairlist Decrease the lifetime by Vmax * h (h is a timestep) Construct pairlist when lifetime < 0 Cutoff Margin Worst Case On-head collision of the fastest atoms Expiration Check Pairlist contains the pair which are actually out of interaction → Conditional branch in the inner most loop
  20. 20 Force Calculation Calculating force with pairlist Store the pair

    in two arrays PairList Arrays of Pair 1 0 0 2 3 3 1 1 2 3 0 2 9 2 1 7 9 5 2 4 8 4 5 6 1 9 0 2 0 1 2 7 3 9 3 5 1 2 1 4 2 8 3 4 0 5 2 6 i-atom j-atom i-atom j-atom Fetch two atoms’ positions (48 bytes) Store two atoms’ momenta (48 bytes)
  21. 21 Grouping by i-atoms Pairlist 1 9 0 2 0

    1 2 7 3 9 3 5 1 2 1 4 2 8 3 4 0 5 2 6 0 1 2 5 Grouped by i-atoms 1 2 5 2 4 9 6 7 8 4 5 9 Sorted List 1 2 4 9 2 6 7 8 3 4 5 9 Array Representation 0 1 2 3 Information of i-atoms are stored on registers. Fetch j-atoms’ positions and store j-atoms’ momenta only i-atom j-atom i粒子 j粒子 Counting Sort by i-atoms
  22. 22 Close in simulation box, far in memory → High

    Cache-miss rate 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 0 1 2 3 0 2 1 4 5 3 6 7 8 12 9 13 10 11 0 1 2 3 0 1 2 3 10 13 4 9 5 7 8 12 6 11 9 0 10 11 12 13 1 2 3 4 5 6 7 8 Renumbering
  23. 23 L2 Cache (256 KB) L3 Cache (8 MB) With

    Renumbering Without Renumbering Atoms Without Renumbering:Degradation depending on cache size With Renumbering: Performance is virtually depend on # of atoms L2 L3 Higher is Better
  24. 24 Fetch Data Calc r^2 Calc Force Store Data Fetch

    Data Calc r^2 Calc Force Store Data Loop 1 Loop N ・・・ Memory Access Calculation Dependency Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 ・・・ independent Loop 1 Calc Force Store Data Loop N-1 Improving IPC (Instructions per Cycle)
  25. 25 Architecture dependent Optimization

  26. 26 const int N = 100000; void func(double a[N], double

    b[N]){ for(int i=0;i<N;i++){ if(a[i] < 0.0) a[i] +=b[i]; } } Add b[i] to a[i] when a[i] < 0 L1: (1) compare a[i] and 0 (2) if a[i] > 0 then jump to L2 (3) a[i] = a[i] + b[i] L2: (4) i = i + 1 (5) jump to L1 when i < N Pseudo Assembly Code This kind of code can be very inefficient. Since an instruction to be executed next cannot be determined until the comparison completed.
  27. 27 void func2(double a[N], double b[N]){ for(int i=0;i<N;i++){ double tmp

    = b[i]; if(a[i] >= 0.0) tmp = 0.0; a[i] += tmp; } } const int N = 100000; void func(double a[N], double b[N]){ for(int i=0;i<N;i++){ if(a[i] < 0.0) a[i] +=b[i]; } } Before After L1: (1) Store b[i] to tmp (2) if a[i] > 0 then tmp = 0 (3) Move tmp to a[i] (4) i = i + 1 (5) jump to L1 when i < N Conditional branch involving jump is eliminated.
  28. 28 const int pn = particle_number; for (int i =

    0; i < pn; i++) { const int kp = pointer[i]; const int np = number_of_partners[i]; const double qix = q[i][X]; const double qiy = q[i][Y]; const double qiz = q[i][Z]; double pix = 0.0; double piy = 0.0; double piz = 0.0; for (int k = 0; k < np; k++) { const int j = sorted_list[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); if (r2 > CL2) continue; double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; pix += df * dx; piy += df * dy; piz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } p[i][X] += pix; p[i][Y] += piy; p[i][Z] += piz; } for i-atoms for j-atoms calculate distance if r2 > cutoff then continue calculate force store momenta end for end for
  29. 29 const int pn = particle_number; for (int i =

    0; i < pn; i++) { const int kp = pointer[i]; const int np = number_of_partners[i]; const double qix = q[i][X]; const double qiy = q[i][Y]; const double qiz = q[i][Z]; double pix = 0.0; double piy = 0.0; double piz = 0.0; for (int k = 0; k < np; k++) { const int j = sorted_list[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); //if (r2 > CL2) continue; double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; if (r2 > CL2) df = 0.0; pix += df * dx; piy += df * dy; piz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } p[i][X] += pix; p[i][Y] += piy; p[i][Z] += piz; } for i-atoms for j-atoms calculate distance calculate force if r2 > cutoff then f=0 store momenta end for end for
  30. 30 0 1 2 3 4 5 6 7 8

    条件分岐削除なし 条件分岐削除あり Without elimination With elimination Execution time was reduced by 30% 119164 LJ atoms, Density 1.0, Cutoff 3.0 Xeon E5-2680 v3 @ 2.50GHz Lower is better Execution Time
  31. 31 Theoretical Performance CPU Frequency Instructions per cycle operations per

    instructions = x x Single Instruction Multiple Data 1 5 3 2 3 2 12 9 3 10 36 18 X = Calculate packed data simultaneously by a single instruction Superscalar SIMD It has not increased since around 2000. reached to the limit possibility of development only here
  32. 32 Load/Store D C B A Memory Register D C

    B A Packing is Slow D C B A Memory Register Move is Fast D C B A Other SIMD Instructions D C B A B C A D Shuffle Scatter D C B A C C C C Blending H G F E D C B A H C F A Mask D 0 0 A D C B A ◦ × × ◦ Vector Calculation a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 + =
  33. 33 SIMD strongly depends on the ISA (Instruction Set Architecture).

    → Write with assembly language (or intrinsic functions of C language) v4df vqj_a = _mm256_load_pd((double*)(q + j_a)); v4df vdq_a = (vqj_a - vqi); v4df vd1_a = vdq_a * vdq_a; v4df vd2_a = _mm256_permute4x64_pd(vd1_a, 201); v4df vd3_a = _mm256_permute4x64_pd(vd1_a, 210); v4df vr2_a = vd1_a + vd2_a + vd3_a; We call functions for Load/Store/Shuffle. We can write arithmetic operations and sbstitutions simply. Examples
  34. 34 Simple Grouping Software Pipelin Haswell Skylake ×1.14 ×1.24 ×1.44

    ×1.07 120,000 atoms, density: 1.0, cutoff 3.0 Time required to calculate 100 loops [ms] Consider the SIMD vectorization from here Simple Grouping Software Pipelin
  35. 35 • Unroll the loop by four times. • Calculate

    four j-atoms simultaneously. Simple Implementation Fetch four coordinates and store them to a register ※ Similar for y, z Packed four distances
  36. 36 What happened D C B A D C B

    A D C B A A Memory xmm0 D C B A B A xmm0 vmovsd vmovhpd D C B A C D C B A D C memory vmovsd vmovhpd B A ymm0 xmm1 xmm1 D C D C xmm1 (1) Fetch A (2) Fetch B (3) Fetch C (4) Fetch D (5) Data copy vinsertf128 ※ We have to do them for x, y, z coordinates Hopelessly slow Memory Register Memory
  37. 37 AoS with padding double q[N][4], p[N][4]; Z Y X

    Z Y X Z Y X Z Y X Why? Three coordinates (x,y,z) can be loaded to 256-bit register simultaneously. However, 64-bits are wasted. Z Y X Z Y X Z Y X Z Y X movupd/movapd ymm Z Y X ※Array of Structure
  38. 38 Z Y X i-atom Z Y X - j-atom

    dz dy dx Relative Coodinates dz1 dy1 dx1 dz2 dy2 dx2 dz3 dy3 dx3 dz4 dy4 dx4 calculate 4 pairs dx1 dx2 dx3 dx4 dy1 dy2 dy3 dy4 dz1 dz2 dz3 dz4 transpose r4 r3 r2 r1 Sum of squares packded distance v4df vdq_1 = vq_j1 - vq_i; v4df vdq_2 = vq_j2 - vq_i; v4df vdq_3 = vq_j3 - vq_i; v4df vdq_4 = vq_j4 - vq_i; v4df tmp0 = _mm256_unpacklo_pd(vdq_1, vdq_2); v4df tmp1 = _mm256_unpackhi_pd(vdq_1, vdq_2); v4df tmp2 = _mm256_unpacklo_pd(vdq_3, vdq_4); v4df tmp3 = _mm256_unpackhi_pd(vdq_3, vdq_4); v4df vdx = _mm256_permute2f128_pd(tmp0, tmp2, 0x20); v4df vdy = _mm256_permute2f128_pd(tmp1, tmp3, 0x20); v4df vdz = _mm256_permute2f128_pd(tmp0, tmp2, 0x31); v4df vr2 = vdx * vdx + vdy * vdy + vdz * vdz; Relative Coordinates Transpose Sum of squres
  39. 39 Pairlist contains the pair which are actually out of

    interaction (Bookkeeping Method) → Mask Operation D2 C2 B2 A2 D1 C1 B1 A1 N P P N src1 src2 mask D2 C1 B1 A2 vblendvpd: Choose elements by sign of mask Create mask by interaction length and cutoff length, store zero to force. 0 0 0 0 src1 src2 0 0 Pair A and D are outof interaction
  40. 40 AVX2 Haswell Skylake ×1.14 ×1.24 ×1.63 ×1.44 ×1.07 ×2.31

    SIMD vectorization worked very effectively on Skylake. Simple Grouping Software Pipelin AVX2 Simple Grouping Software Pipelin 120,000 atoms, density: 1.0, cutoff 3.0 Time required to calculate 100 loops [ms]
  41. 41 gather/scatter D C B A Memory Register D C

    B A Packing is hopelessly slow with AVX2 H G F E D C Memory Register H G F E We can gather 8 elements from memory simultaneously with AVX-512 D C B A B A We also have a scatter instruction.
  42. 42 KNL Skylake In KNL, it is faster to leave

    it to the compiler than to use AVX2 Benchmark on KNL AVX2 Simple Grouping Software Pipelin AVX2 Simple Grouping Software Pipelin
  43. 43 for (int k = 0; k < np; k++)

    { const int j = sorted_list2[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; if (r2 > CL2) df = 0.0; pfx += df * dx; pfy += df * dy; pfz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } 1. Unroll loop by 8 times 2. Fetch 8 indicies with vmovdqu32 3. gather coordinates and calculate force 4. Collision identification with vpconflictd 5. Scatter if there are no collision Inner most loop i-atom j-atoms Compiler does not know there are no collision.
  44. 44 Write gather/scatter by hand Compiler added the codes for

    collision detection. Since we know there are no collision, we can eliminate them. ×1.93 ×1.09 Performance improvement is still unsatisfactory... Simple Grouping Write gather/scatterby hand Compiler Write by hand
  45. 45 Change the Data Structure AoS Z Y X Z

    Y X Z Y X Z Y X SoA Z Z Z Z Z Y Y Y Y Y X X X X X Remainder Loop Elimination (RLE) We can perform 8 loops simultaneously. Then the reminder loop is up to 7. → Eliminate the remainder loop by masking. Software Pipelining (SWP) Improve IPC by software pipelining optimization. Data becomes continuous. (Bit-shift for gather/scatter becomes unnecessary)
  46. 46 ×1.93 ×1.09 ×1.21 ×1.09 ×1.32 SoA+RLE SoA+ RLE+SWP Results

    Performance is improved by 30% Simple Grouping Write gather/scatterby hand
  47. 47 Z Z Z Z Z Y Y Y Y

    Y X X X X X SoA VZ VZ VZ VZ VZ VY VY VY VY VY VX VX VX VX VX Coordinates Momenta Cache Line Size (512 Byte) AoS 8-packed 0 VZ VY VX 0 Z Y X 0 VZ VY VX 0 Z Y X Information of one atom Cache Line Size (512 Byte) Cache Line Size (512 Byte) Change the data structure from SoA to AoS 8-packed Information of one atom matches to cache line size. 75% usage of cache line
  48. 48 SoA+ RLE+ SWP AoS 8 + RLE+SWP ×1.93 ×1.09

    ×1.21 ×1.09 ×1.05 ×1.39 51% faster than automatic vectorization of compiler Results ×1.51 SoA+RLE Simple Grouping Write gather/scatterby hand
  49. 49 ×1.22 ×1.28 ×1.39 ×1.03 Simple Grouping AVX-512: Write gather/scatterby

    hand AVX-512: AoS 8 + RLE + SWP AVX2: AoS 8 + RLE + SWP Skylake (Xeon) On Skylake (Xeon), AVX2 is faster than AVX-512. Compiler did not use gather/scatter.
  50. 50 Explicit SIMD vectorization may give us substantial increase in

    performance. Effective SIMD vectorization involves change of data structure SIMD vectorization strongly depends on architecture. → Different code for different architecture