Molecular Dynamics Method: Theory and Implementation / MD2019 Implementations and Optimization

1 Implementations and Optimization Molecular Dynamics Method: Theory and Implementation
Department of Applied Physics and Physico-Informatics, Keio University Hiroshi Watanabe

2 for (int i = 0; i < N -
1; i++){ for (int j = i + 1; j < N; j++){ calculate_force(i, j); } } i-atom j-atom

3 https://qiita.com/kaityo256/items/2356fff922938ae3c87c 分子動力学法 qiita I wrote a review to implemente
MD (written in Japanese)

4 Implementing MD is EASY! But optimizations are… 1. Architecture
of Computer 2. Memory Optimization 3. Architecture Dependent Optimization

5 Architecture of Computer What you should know before optimization

6 CPU Memory ALU/FPU Data Store to memory Bottleneck is
Memory Fetch from memory Computer is a machine which 1. fetches data from memory 2. calculates something on ALU/FPU 3. stores the results to memory

7 Latency Time from requesting data to receiving data Cache
access: ~ 10 cycle Memory access: 100 ~ 1000 cycle Throughput Rate to fetch the data from memory to CPU Bytes/Flops (B/F) : Ratio of memory bandwidth and computing power Random access to memory severely degrade the performance

8 Decoder Execution Unit Cache Memory Fetch the data and
instruction, pass to the execution unit, and store the results. 3 ~ 6 cycles are required for one floating point operation Pipeline Performance (Flops) Frequency ＝ Belt conveyor Still 6 cycles for one calculation, but 1006 cycles for 1000 calculations. → Roughly one cycle for one calculation

9 Multi instructions in a single cycle But who should
do it? Pipeline allows us to achieve one calculation in one cycle. CPU Frequency = Performance Increase in frequency of CPU has been saturated. http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/fulltext

10 Data Fetch Dependency Check Fetch multiple data and instructions,
pass them to many FPUs simultaneously FPU FPU Dependency check becomes hopelessly difficult Backward Compatibility Superscalar

11 Compiler arranges data and instructions No check Simple hardware
Dependency check is unnecessary. VLIW (Very long instruction word) Loose backward compatibility Extremely clever compiler is required

12 A programmer arranges the data No check Calculate 2
~ 8 operations simultaneously Current architecture: Superscalor (2) + SIMD (4 ~ 8) Simple Hardware Backward Compatibility SIMD (Single Instruction Multiple Data) Programming is very hard. Auto vectorization is still not effective.

13 Memory Optimization

14 Interaction in long distance is negligibly small. Cutoff length

15 Pairlist List of atoms whose distance are within cutoff
length Simple Implementation O(N^2) → Achieve O(N) by grid search Grid Search Divide the simulation box into small cells. Exclusive Grid Method Hard disks High density Non-Exclusive Grid method Soft Core Low density At most one atom for each cell Many atoms in each cell

16 Data Structure Multi dimensional array int GridParticleNumber[4]; Number of
atoms in each cell int GridParticleIndex[4][10]; Atom Indeces 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 GridParticleIndex is sparse → Cache miss 1 4 9 7 5 8 6 0 2 3 10 13 12 11 GridParticleIndex[i][j] Index of j-th atom in i-th cell

17 Integer Sort Algorithm with O(N) 1. Count the number
of keys: O(N) 2. Find the position where the data will be stored. 3. Store the data: O(N) 3,4,2,3,5,5,1,2,1,3 1: 2 2: 2 3: 3 4: 1 5: 2 1 2 3 4 5 Very useful algorithm for optimization

18 One-dimensional Array Implementation 1. Prepare the array with size
N 2. Count the number of atoms in each cell 3. Place the pointer of each cell 4. Put an atom and increment the pointer 5. Repeat above for all atoms 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 0 1 2 3 0 1 2 3 0 1 2 0 1 2 3 0 1 2 3 10 13 4 9 5 7 8 12 6 11 Completed Array Densely packed (High Cache Efficiency) (所属セル番号が主キー、粒子番号が副キーのソート)

19 Construction of pairlist is heavy task. → Reuse the
pairlist with margin Cutoff Search Length Margin Margin = Lifetime of pairlist Decrease the lifetime by Vmax * h (h is a timestep) Construct pairlist when lifetime < 0 Cutoff Margin Worst Case On-head collision of the fastest atoms Expiration Check Pairlist contains the pair which are actually out of interaction → Conditional branch in the inner most loop

20 Force Calculation Calculating force with pairlist Store the pair
in two arrays PairList Arrays of Pair 1 0 0 2 3 3 1 1 2 3 0 2 9 2 1 7 9 5 2 4 8 4 5 6 1 9 0 2 0 1 2 7 3 9 3 5 1 2 1 4 2 8 3 4 0 5 2 6 i-atom j-atom i-atom j-atom Fetch two atoms’ positions (48 bytes) Store two atoms’ momenta (48 bytes)

21 Grouping by i-atoms Pairlist 1 9 0 2 0
1 2 7 3 9 3 5 1 2 1 4 2 8 3 4 0 5 2 6 0 1 2 5 Grouped by i-atoms 1 2 5 2 4 9 6 7 8 4 5 9 Sorted List 1 2 4 9 2 6 7 8 3 4 5 9 Array Representation 0 1 2 3 Information of i-atoms are stored on registers. Fetch j-atoms’ positions and store j-atoms’ momenta only i-atom j-atom i粒子 j粒子 Counting Sort by i-atoms

22 Close in simulation box, far in memory → High
Cache-miss rate 0 1 2 3 1 4 9 7 8 5 12 6 11 10 0 2 3 13 0 1 2 3 0 2 1 4 5 3 6 7 8 12 9 13 10 11 0 1 2 3 0 1 2 3 10 13 4 9 5 7 8 12 6 11 9 0 10 11 12 13 1 2 3 4 5 6 7 8 Renumbering

23 L2 Cache (256 KB) L3 Cache (8 MB) With
Renumbering Without Renumbering Atoms Without Renumbering：Degradation depending on cache size With Renumbering: Performance is virtually depend on # of atoms L2 L3 Higher is Better

24 Fetch Data Calc r^2 Calc Force Store Data Fetch
Data Calc r^2 Calc Force Store Data Loop 1 Loop N ・・・ Memory Access Calculation Dependency Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 ・・・ independent Loop 1 Calc Force Store Data Loop N-1 Improving IPC (Instructions per Cycle)

25 Architecture dependent Optimization

26 const int N = 100000; void func(double a[N], double
b[N]){ for(int i=0;i<N;i++){ if(a[i] < 0.0) a[i] +=b[i]; } } Add b[i] to a[i] when a[i] < 0 L1: (1) compare a[i] and 0 (2) if a[i] > 0 then jump to L2 (3) a[i] = a[i] + b[i] L2: (4) i = i + 1 (5) jump to L1 when i < N Pseudo Assembly Code This kind of code can be very inefficient. Since an instruction to be executed next cannot be determined until the comparison completed.

27 void func2(double a[N], double b[N]){ for(int i=0;i<N;i++){ double tmp
= b[i]; if(a[i] >= 0.0) tmp = 0.0; a[i] += tmp; } } const int N = 100000; void func(double a[N], double b[N]){ for(int i=0;i<N;i++){ if(a[i] < 0.0) a[i] +=b[i]; } } Before After L1: (1) Store b[i] to tmp (2) if a[i] > 0 then tmp = 0 (3) Move tmp to a[i] (4) i = i + 1 (5) jump to L1 when i < N Conditional branch involving jump is eliminated.

28 const int pn = particle_number; for (int i =
0; i < pn; i++) { const int kp = pointer[i]; const int np = number_of_partners[i]; const double qix = q[i][X]; const double qiy = q[i][Y]; const double qiz = q[i][Z]; double pix = 0.0; double piy = 0.0; double piz = 0.0; for (int k = 0; k < np; k++) { const int j = sorted_list[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); if (r2 > CL2) continue; double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; pix += df * dx; piy += df * dy; piz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } p[i][X] += pix; p[i][Y] += piy; p[i][Z] += piz; } for i-atoms for j-atoms calculate distance if r2 > cutoff then continue calculate force store momenta end for end for

29 const int pn = particle_number; for (int i =
0; i < pn; i++) { const int kp = pointer[i]; const int np = number_of_partners[i]; const double qix = q[i][X]; const double qiy = q[i][Y]; const double qiz = q[i][Z]; double pix = 0.0; double piy = 0.0; double piz = 0.0; for (int k = 0; k < np; k++) { const int j = sorted_list[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); //if (r2 > CL2) continue; double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; if (r2 > CL2) df = 0.0; pix += df * dx; piy += df * dy; piz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } p[i][X] += pix; p[i][Y] += piy; p[i][Z] += piz; } for i-atoms for j-atoms calculate distance calculate force if r2 > cutoff then f=0 store momenta end for end for

30 0 1 2 3 4 5 6 7 8
条件分岐削除なし条件分岐削除あり Without elimination With elimination Execution time was reduced by 30% 119164 LJ atoms, Density 1.0, Cutoff 3.0 Xeon E5-2680 v3 @ 2.50GHz Lower is better Execution Time

31 Theoretical Performance CPU Frequency Instructions per cycle operations per
instructions ＝ x x Single Instruction Multiple Data 1 5 3 2 3 2 12 9 3 10 36 18 X = Calculate packed data simultaneously by a single instruction Superscalar SIMD It has not increased since around 2000. reached to the limit possibility of development only here

32 Load/Store D C B A Memory Register D C
B A Packing is Slow D C B A Memory Register Move is Fast D C B A Other SIMD Instructions D C B A B C A D Shuffle Scatter D C B A C C C C Blending H G F E D C B A H C F A Mask D 0 0 A D C B A ◦ × × ◦ Vector Calculation a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 + =

33 SIMD strongly depends on the ISA (Instruction Set Architecture).
→ Write with assembly language (or intrinsic functions of C language) v4df vqj_a = _mm256_load_pd((double*)(q + j_a)); v4df vdq_a = (vqj_a - vqi); v4df vd1_a = vdq_a * vdq_a; v4df vd2_a = _mm256_permute4x64_pd(vd1_a, 201); v4df vd3_a = _mm256_permute4x64_pd(vd1_a, 210); v4df vr2_a = vd1_a + vd2_a + vd3_a; We call functions for Load/Store/Shuffle. We can write arithmetic operations and sbstitutions simply. Examples

34 Simple Grouping Software Pipelin Haswell Skylake ×1.14 ×1.24 ×1.44
×1.07 120,000 atoms, density: 1.0, cutoff 3.0 Time required to calculate 100 loops [ms] Consider the SIMD vectorization from here Simple Grouping Software Pipelin

35 • Unroll the loop by four times. • Calculate
four j-atoms simultaneously. Simple Implementation Fetch four coordinates and store them to a register ※ Similar for y, z Packed four distances

36 What happened D C B A D C B
A D C B A A Memory xmm0 D C B A B A xmm0 vmovsd vmovhpd D C B A C D C B A D C memory vmovsd vmovhpd B A ymm0 xmm1 xmm1 D C D C xmm1 (1) Fetch A (2) Fetch B (3) Fetch C (4) Fetch D (5) Data copy vinsertf128 ※ We have to do them for x, y, z coordinates Hopelessly slow Memory Register Memory

37 AoS with padding double q[N][4], p[N][4]; Z Y X
Z Y X Z Y X Z Y X Why? Three coordinates (x,y,z) can be loaded to 256-bit register simultaneously. However, 64-bits are wasted. Z Y X Z Y X Z Y X Z Y X movupd/movapd ymm Z Y X ※Array of Structure

38 Z Y X i-atom Z Y X - j-atom
dz dy dx Relative Coodinates dz1 dy1 dx1 dz2 dy2 dx2 dz3 dy3 dx3 dz4 dy4 dx4 calculate 4 pairs dx1 dx2 dx3 dx4 dy1 dy2 dy3 dy4 dz1 dz2 dz3 dz4 transpose r4 r3 r2 r1 Sum of squares packded distance v4df vdq_1 = vq_j1 - vq_i; v4df vdq_2 = vq_j2 - vq_i; v4df vdq_3 = vq_j3 - vq_i; v4df vdq_4 = vq_j4 - vq_i; v4df tmp0 = _mm256_unpacklo_pd(vdq_1, vdq_2); v4df tmp1 = _mm256_unpackhi_pd(vdq_1, vdq_2); v4df tmp2 = _mm256_unpacklo_pd(vdq_3, vdq_4); v4df tmp3 = _mm256_unpackhi_pd(vdq_3, vdq_4); v4df vdx = _mm256_permute2f128_pd(tmp0, tmp2, 0x20); v4df vdy = _mm256_permute2f128_pd(tmp1, tmp3, 0x20); v4df vdz = _mm256_permute2f128_pd(tmp0, tmp2, 0x31); v4df vr2 = vdx * vdx + vdy * vdy + vdz * vdz; Relative Coordinates Transpose Sum of squres

39 Pairlist contains the pair which are actually out of
interaction (Bookkeeping Method) → Mask Operation D2 C2 B2 A2 D1 C1 B1 A1 N P P N src1 src2 mask D2 C1 B1 A2 vblendvpd: Choose elements by sign of mask Create mask by interaction length and cutoff length, store zero to force. 0 0 0 0 src1 src2 0 0 Pair A and D are outof interaction

40 AVX2 Haswell Skylake ×1.14 ×1.24 ×1.63 ×1.44 ×1.07 ×2.31
SIMD vectorization worked very effectively on Skylake. Simple Grouping Software Pipelin AVX2 Simple Grouping Software Pipelin 120,000 atoms, density: 1.0, cutoff 3.0 Time required to calculate 100 loops [ms]

41 gather/scatter D C B A Memory Register D C
B A Packing is hopelessly slow with AVX2 H G F E D C Memory Register H G F E We can gather 8 elements from memory simultaneously with AVX-512 D C B A B A We also have a scatter instruction.

42 KNL Skylake In KNL, it is faster to leave
it to the compiler than to use AVX2 Benchmark on KNL AVX2 Simple Grouping Software Pipelin AVX2 Simple Grouping Software Pipelin

43 for (int k = 0; k < np; k++)
{ const int j = sorted_list2[kp + k]; double dx = q[j][X] - qix; double dy = q[j][Y] - qiy; double dz = q[j][Z] - qiz; double r2 = (dx * dx + dy * dy + dz * dz); double r6 = r2 * r2 * r2; double df = ((24.0 * r6 - 48.0) / (r6 * r6 * r2)) * dt; if (r2 > CL2) df = 0.0; pfx += df * dx; pfy += df * dy; pfz += df * dz; p[j][X] -= df * dx; p[j][Y] -= df * dy; p[j][Z] -= df * dz; } 1. Unroll loop by 8 times 2. Fetch 8 indicies with vmovdqu32 3. gather coordinates and calculate force 4. Collision identification with vpconflictd 5. Scatter if there are no collision Inner most loop i-atom j-atoms Compiler does not know there are no collision.

44 Write gather/scatter by hand Compiler added the codes for
collision detection. Since we know there are no collision, we can eliminate them. ×1.93 ×1.09 Performance improvement is still unsatisfactory... Simple Grouping Write gather/scatterby hand Compiler Write by hand

45 Change the Data Structure AoS Z Y X Z
Y X Z Y X Z Y X SoA Z Z Z Z Z Y Y Y Y Y X X X X X Remainder Loop Elimination (RLE) We can perform 8 loops simultaneously. Then the reminder loop is up to 7. → Eliminate the remainder loop by masking. Software Pipelining (SWP) Improve IPC by software pipelining optimization. Data becomes continuous. (Bit-shift for gather/scatter becomes unnecessary)

46 ×1.93 ×1.09 ×1.21 ×1.09 ×1.32 SoA+RLE SoA+ RLE+SWP Results
Performance is improved by 30% Simple Grouping Write gather/scatterby hand

47 Z Z Z Z Z Y Y Y Y
Y X X X X X SoA VZ VZ VZ VZ VZ VY VY VY VY VY VX VX VX VX VX Coordinates Momenta Cache Line Size (512 Byte) AoS 8-packed 0 VZ VY VX 0 Z Y X 0 VZ VY VX 0 Z Y X Information of one atom Cache Line Size (512 Byte) Cache Line Size (512 Byte) Change the data structure from SoA to AoS 8-packed Information of one atom matches to cache line size. 75% usage of cache line

48 SoA+ RLE+ SWP AoS 8 + RLE+SWP ×1.93 ×1.09
×1.21 ×1.09 ×1.05 ×1.39 51% faster than automatic vectorization of compiler Results ×1.51 SoA+RLE Simple Grouping Write gather/scatterby hand

49 ×1.22 ×1.28 ×1.39 ×1.03 Simple Grouping AVX-512: Write gather/scatterby
hand AVX-512: AoS 8 + RLE + SWP AVX2: AoS 8 + RLE + SWP Skylake (Xeon) On Skylake (Xeon), AVX2 is faster than AVX-512. Compiler did not use gather/scatter.

50 Explicit SIMD vectorization may give us substantial increase in
performance. Effective SIMD vectorization involves change of data structure SIMD vectorization strongly depends on architecture. → Different code for different architecture

Molecular Dynamics Method: Theory and Implement...

Molecular Dynamics Method: Theory and Implementation / MD2019 Implementations and Optimization

More Decks by kaityo256

Other Decks in Education

Featured

Transcript