Slide 1

Slide 1 text

1 Implementations and Optimization Molecular Dynamics Method: Theory and Implementation Department of Applied Physics and Physico-Informatics, Keio University Hiroshi Watanabe

Slide 2

Slide 2 text

2 for (int i = 0; i < N - 1; i++){ for (int j = i + 1; j < N; j++){ calculate_force(i, j); } } i-atom j-atom

Slide 3

Slide 3 text

3 https://qiita.com/kaityo256/items/2356fff922938ae3c87c 分子動力学法 qiita I wrote a review to implemente MD (written in Japanese)

Slide 4

Slide 4 text

4 Implementing MD is EASY! But optimizations are… 1. Architecture of Computer 2. Memory Optimization 3. Architecture Dependent Optimization

Slide 5

Slide 5 text

5 Architecture of Computer What you should know before optimization

Slide 6

Slide 6 text

6 CPU Memory ALU/FPU Data Store to memory Bottleneck is Memory Fetch from memory Computer is a machine which 1. fetches data from memory 2. calculates something on ALU/FPU 3. stores the results to memory

Slide 7

Slide 7 text

7 Latency Time from requesting data to receiving data Cache access: ~ 10 cycle Memory access: 100 ~ 1000 cycle Throughput Rate to fetch the data from memory to CPU Bytes/Flops (B/F) : Ratio of memory bandwidth and computing power Random access to memory severely degrade the performance

Slide 8

Slide 8 text

8 Decoder Execution Unit Cache Memory Fetch the data and instruction, pass to the execution unit, and store the results. 3 ~ 6 cycles are required for one floating point operation Pipeline Performance (Flops) Frequency = Belt conveyor Still 6 cycles for one calculation, but 1006 cycles for 1000 calculations. → Roughly one cycle for one calculation

Slide 9

Slide 9 text

9 Multi instructions in a single cycle But who should do it? Pipeline allows us to achieve one calculation in one cycle. CPU Frequency = Performance Increase in frequency of CPU has been saturated. http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/fulltext

Slide 10

Slide 10 text

10 Data Fetch Dependency Check Fetch multiple data and instructions, pass them to many FPUs simultaneously FPU FPU Dependency check becomes hopelessly difficult Backward Compatibility Superscalar

Slide 11

Slide 11 text

11 Compiler arranges data and instructions No check Simple hardware Dependency check is unnecessary. VLIW (Very long instruction word) Loose backward compatibility Extremely clever compiler is required

Slide 12

Slide 12 text

12 A programmer arranges the data No check Calculate 2 ~ 8 operations simultaneously Current architecture: Superscalor (2) + SIMD (4 ~ 8) Simple Hardware Backward Compatibility SIMD (Single Instruction Multiple Data) Programming is very hard. Auto vectorization is still not effective.

Slide 13

Slide 13 text

13 Memory Optimization

Slide 14

Slide 14 text

14 Interaction in long distance is negligibly small. Cutoff length

Slide 15

Slide 15 text

15 Pairlist List of atoms whose distance are within cutoff length Simple Implementation O(N^2) → Achieve O(N) by grid search Grid Search Divide the simulation box into small cells. Exclusive Grid Method Hard disks High density Non-Exclusive Grid method Soft Core Low density At most one atom for each cell Many atoms in each cell

Slide 16

Slide 16 text

16 Data Structure Multi dimensional array int GridParticleNumber[4]; Number of atoms in each cell int GridParticleIndex[4][10]; Atom Indeces 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 GridParticleIndex is sparse → Cache miss 1 4 9 7 5 8 6 0 2 3 10 13 12 11 GridParticleIndex[i][j] Index of j-th atom in i-th cell

Slide 17

Slide 17 text

17 Integer Sort Algorithm with O(N) 1. Count the number of keys: O(N) 2. Find the position where the data will be stored. 3. Store the data: O(N) 3,4,2,3,5,5,1,2,1,3 1: 2 2: 2 3: 3 4: 1 5: 2 1 2 3 4 5 Very useful algorithm for optimization

Slide 18

Slide 18 text

18 One-dimensional Array Implementation 1. Prepare the array with size N 2. Count the number of atoms in each cell 3. Place the pointer of each cell 4. Put an atom and increment the pointer 5. Repeat above for all atoms 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 0 1 2 3 0 1 2 3 0 1 2 0 1 2 3 0 1 2 3 10 13 4 9 5 7 8 12 6 11 Completed Array Densely packed (High Cache Efficiency) (所属セル番号が主キー、粒子番号が副キーのソート)

Slide 19

Slide 19 text

19 Construction of pairlist is heavy task. → Reuse the pairlist with margin Cutoff Search Length Margin Margin = Lifetime of pairlist Decrease the lifetime by Vmax * h (h is a timestep) Construct pairlist when lifetime < 0 Cutoff Margin Worst Case On-head collision of the fastest atoms Expiration Check Pairlist contains the pair which are actually out of interaction → Conditional branch in the inner most loop

Slide 20

Slide 20 text

20 Force Calculation Calculating force with pairlist Store the pair in two arrays PairList Arrays of Pair 1 0 0 2 3 3 1 1 2 3 0 2 9 2 1 7 9 5 2 4 8 4 5 6 1 9 0 2 0 1 2 7 3 9 3 5 1 2 1 4 2 8 3 4 0 5 2 6 i-atom j-atom i-atom j-atom Fetch two atoms’ positions (48 bytes) Store two atoms’ momenta (48 bytes)

Slide 21

Slide 21 text

21 Grouping by i-atoms Pairlist 1 9 0 2 0 1 2 7 3 9 3 5 1 2 1 4 2 8 3 4 0 5 2 6 0 1 2 5 Grouped by i-atoms 1 2 5 2 4 9 6 7 8 4 5 9 Sorted List 1 2 4 9 2 6 7 8 3 4 5 9 Array Representation 0 1 2 3 Information of i-atoms are stored on registers. Fetch j-atoms’ positions and store j-atoms’ momenta only i-atom j-atom i粒子 j粒子 Counting Sort by i-atoms

Slide 22

Slide 22 text

22 Close in simulation box, far in memory → High Cache-miss rate 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 0 1 2 3 0 2 1 4 5 3 6 7 8 12 9 13 10 11 0 1 2 3 0 1 2 3 10 13 4 9 5 7 8 12 6 11 9 0 10 11 12 13 1 2 3 4 5 6 7 8 Renumbering

Slide 23

Slide 23 text

23 L2 Cache (256 KB) L3 Cache (8 MB) With Renumbering Without Renumbering Atoms Without Renumbering:Degradation depending on cache size With Renumbering: Performance is virtually depend on # of atoms L2 L3 Higher is Better

Slide 24

Slide 24 text

24 Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 Calc Force Store Data Loop 1 Loop N ・・・ Memory Access Calculation Dependency Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 ・・・ independent Loop 1 Calc Force Store Data Loop N-1 Improving IPC (Instructions per Cycle)

Slide 25

Slide 25 text

25 Architecture dependent Optimization

Slide 26

Slide 26 text

26 const int N = 100000; void func(double a[N], double b[N]){ for(int i=0;i 0 then jump to L2 (3) a[i] = a[i] + b[i] L2: (4) i = i + 1 (5) jump to L1 when i < N Pseudo Assembly Code This kind of code can be very inefficient. Since an instruction to be executed next cannot be determined until the comparison completed.

Slide 27

Slide 27 text

27 void func2(double a[N], double b[N]){ for(int i=0;i= 0.0) tmp = 0.0; a[i] += tmp; } } const int N = 100000; void func(double a[N], double b[N]){ for(int i=0;i 0 then tmp = 0 (3) Move tmp to a[i] (4) i = i + 1 (5) jump to L1 when i < N Conditional branch involving jump is eliminated.

Slide 28

Slide 28 text

28 const int pn = particle_number; for (int i = 0; i < pn; i++) { const int kp = pointer[i]; const int np = number_of_partners[i]; const double qix = q[i][X]; const double qiy = q[i][Y]; const double qiz = q[i][Z]; double pix = 0.0; double piy = 0.0; double piz = 0.0; for (int k = 0; k < np; k++) { const int j = sorted_list[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); if (r2 > CL2) continue; double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; pix += df * dx; piy += df * dy; piz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } p[i][X] += pix; p[i][Y] += piy; p[i][Z] += piz; } for i-atoms for j-atoms calculate distance if r2 > cutoff then continue calculate force store momenta end for end for

Slide 29

Slide 29 text

29 const int pn = particle_number; for (int i = 0; i < pn; i++) { const int kp = pointer[i]; const int np = number_of_partners[i]; const double qix = q[i][X]; const double qiy = q[i][Y]; const double qiz = q[i][Z]; double pix = 0.0; double piy = 0.0; double piz = 0.0; for (int k = 0; k < np; k++) { const int j = sorted_list[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); //if (r2 > CL2) continue; double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; if (r2 > CL2) df = 0.0; pix += df * dx; piy += df * dy; piz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } p[i][X] += pix; p[i][Y] += piy; p[i][Z] += piz; } for i-atoms for j-atoms calculate distance calculate force if r2 > cutoff then f=0 store momenta end for end for

Slide 30

Slide 30 text

30 0 1 2 3 4 5 6 7 8 条件分岐削除なし 条件分岐削除あり Without elimination With elimination Execution time was reduced by 30% 119164 LJ atoms, Density 1.0, Cutoff 3.0 Xeon E5-2680 v3 @ 2.50GHz Lower is better Execution Time

Slide 31

Slide 31 text

31 Theoretical Performance CPU Frequency Instructions per cycle operations per instructions = x x Single Instruction Multiple Data 1 5 3 2 3 2 12 9 3 10 36 18 X = Calculate packed data simultaneously by a single instruction Superscalar SIMD It has not increased since around 2000. reached to the limit possibility of development only here

Slide 32

Slide 32 text

32 Load/Store D C B A Memory Register D C B A Packing is Slow D C B A Memory Register Move is Fast D C B A Other SIMD Instructions D C B A B C A D Shuffle Scatter D C B A C C C C Blending H G F E D C B A H C F A Mask D 0 0 A D C B A ○ × × ○ Vector Calculation a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 + =

Slide 33

Slide 33 text

33 SIMD strongly depends on the ISA (Instruction Set Architecture). → Write with assembly language (or intrinsic functions of C language) v4df vqj_a = _mm256_load_pd((double*)(q + j_a)); v4df vdq_a = (vqj_a - vqi); v4df vd1_a = vdq_a * vdq_a; v4df vd2_a = _mm256_permute4x64_pd(vd1_a, 201); v4df vd3_a = _mm256_permute4x64_pd(vd1_a, 210); v4df vr2_a = vd1_a + vd2_a + vd3_a; We call functions for Load/Store/Shuffle. We can write arithmetic operations and sbstitutions simply. Examples

Slide 34

Slide 34 text

34 Simple Grouping Software Pipelin Haswell Skylake ×1.14 ×1.24 ×1.44 ×1.07 120,000 atoms, density: 1.0, cutoff 3.0 Time required to calculate 100 loops [ms] Consider the SIMD vectorization from here Simple Grouping Software Pipelin

Slide 35

Slide 35 text

35 • Unroll the loop by four times. • Calculate four j-atoms simultaneously. Simple Implementation Fetch four coordinates and store them to a register ※ Similar for y, z Packed four distances

Slide 36

Slide 36 text

36 What happened D C B A D C B A D C B A A Memory xmm0 D C B A B A xmm0 vmovsd vmovhpd D C B A C D C B A D C memory vmovsd vmovhpd B A ymm0 xmm1 xmm1 D C D C xmm1 (1) Fetch A (2) Fetch B (3) Fetch C (4) Fetch D (5) Data copy vinsertf128 ※ We have to do them for x, y, z coordinates Hopelessly slow Memory Register Memory

Slide 37

Slide 37 text

37 AoS with padding double q[N][4], p[N][4]; Z Y X Z Y X Z Y X Z Y X Why? Three coordinates (x,y,z) can be loaded to 256-bit register simultaneously. However, 64-bits are wasted. Z Y X Z Y X Z Y X Z Y X movupd/movapd ymm Z Y X ※Array of Structure

Slide 38

Slide 38 text

38 Z Y X i-atom Z Y X - j-atom dz dy dx Relative Coodinates dz1 dy1 dx1 dz2 dy2 dx2 dz3 dy3 dx3 dz4 dy4 dx4 calculate 4 pairs dx1 dx2 dx3 dx4 dy1 dy2 dy3 dy4 dz1 dz2 dz3 dz4 transpose r4 r3 r2 r1 Sum of squares packded distance v4df vdq_1 = vq_j1 - vq_i; v4df vdq_2 = vq_j2 - vq_i; v4df vdq_3 = vq_j3 - vq_i; v4df vdq_4 = vq_j4 - vq_i; v4df tmp0 = _mm256_unpacklo_pd(vdq_1, vdq_2); v4df tmp1 = _mm256_unpackhi_pd(vdq_1, vdq_2); v4df tmp2 = _mm256_unpacklo_pd(vdq_3, vdq_4); v4df tmp3 = _mm256_unpackhi_pd(vdq_3, vdq_4); v4df vdx = _mm256_permute2f128_pd(tmp0, tmp2, 0x20); v4df vdy = _mm256_permute2f128_pd(tmp1, tmp3, 0x20); v4df vdz = _mm256_permute2f128_pd(tmp0, tmp2, 0x31); v4df vr2 = vdx * vdx + vdy * vdy + vdz * vdz; Relative Coordinates Transpose Sum of squres

Slide 39

Slide 39 text

39 Pairlist contains the pair which are actually out of interaction (Bookkeeping Method) → Mask Operation D2 C2 B2 A2 D1 C1 B1 A1 N P P N src1 src2 mask D2 C1 B1 A2 vblendvpd: Choose elements by sign of mask Create mask by interaction length and cutoff length, store zero to force. 0 0 0 0 src1 src2 0 0 Pair A and D are outof interaction

Slide 40

Slide 40 text

40 AVX2 Haswell Skylake ×1.14 ×1.24 ×1.63 ×1.44 ×1.07 ×2.31 SIMD vectorization worked very effectively on Skylake. Simple Grouping Software Pipelin AVX2 Simple Grouping Software Pipelin 120,000 atoms, density: 1.0, cutoff 3.0 Time required to calculate 100 loops [ms]

Slide 41

Slide 41 text

41 gather/scatter D C B A Memory Register D C B A Packing is hopelessly slow with AVX2 H G F E D C Memory Register H G F E We can gather 8 elements from memory simultaneously with AVX-512 D C B A B A We also have a scatter instruction.

Slide 42

Slide 42 text

42 KNL Skylake In KNL, it is faster to leave it to the compiler than to use AVX2 Benchmark on KNL AVX2 Simple Grouping Software Pipelin AVX2 Simple Grouping Software Pipelin

Slide 43

Slide 43 text

43 for (int k = 0; k < np; k++) { const int j = sorted_list2[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; if (r2 > CL2) df = 0.0; pfx += df * dx; pfy += df * dy; pfz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } 1. Unroll loop by 8 times 2. Fetch 8 indicies with vmovdqu32 3. gather coordinates and calculate force 4. Collision identification with vpconflictd 5. Scatter if there are no collision Inner most loop i-atom j-atoms Compiler does not know there are no collision.

Slide 44

Slide 44 text

44 Write gather/scatter by hand Compiler added the codes for collision detection. Since we know there are no collision, we can eliminate them. ×1.93 ×1.09 Performance improvement is still unsatisfactory... Simple Grouping Write gather/scatterby hand Compiler Write by hand

Slide 45

Slide 45 text

45 Change the Data Structure AoS Z Y X Z Y X Z Y X Z Y X SoA Z Z Z Z Z Y Y Y Y Y X X X X X Remainder Loop Elimination (RLE) We can perform 8 loops simultaneously. Then the reminder loop is up to 7. → Eliminate the remainder loop by masking. Software Pipelining (SWP) Improve IPC by software pipelining optimization. Data becomes continuous. (Bit-shift for gather/scatter becomes unnecessary)

Slide 46

Slide 46 text

46 ×1.93 ×1.09 ×1.21 ×1.09 ×1.32 SoA+RLE SoA+ RLE+SWP Results Performance is improved by 30% Simple Grouping Write gather/scatterby hand

Slide 47

Slide 47 text

47 Z Z Z Z Z Y Y Y Y Y X X X X X SoA VZ VZ VZ VZ VZ VY VY VY VY VY VX VX VX VX VX Coordinates Momenta Cache Line Size (512 Byte) AoS 8-packed 0 VZ VY VX 0 Z Y X 0 VZ VY VX 0 Z Y X Information of one atom Cache Line Size (512 Byte) Cache Line Size (512 Byte) Change the data structure from SoA to AoS 8-packed Information of one atom matches to cache line size. 75% usage of cache line

Slide 48

Slide 48 text

48 SoA+ RLE+ SWP AoS 8 + RLE+SWP ×1.93 ×1.09 ×1.21 ×1.09 ×1.05 ×1.39 51% faster than automatic vectorization of compiler Results ×1.51 SoA+RLE Simple Grouping Write gather/scatterby hand

Slide 49

Slide 49 text

49 ×1.22 ×1.28 ×1.39 ×1.03 Simple Grouping AVX-512: Write gather/scatterby hand AVX-512: AoS 8 + RLE + SWP AVX2: AoS 8 + RLE + SWP Skylake (Xeon) On Skylake (Xeon), AVX2 is faster than AVX-512. Compiler did not use gather/scatter.

Slide 50

Slide 50 text

50 Explicit SIMD vectorization may give us substantial increase in performance. Effective SIMD vectorization involves change of data structure SIMD vectorization strongly depends on architecture. → Different code for different architecture