access: ~ 10 cycle Memory access: 100 ~ 1000 cycle Throughput Rate to fetch the data from memory to CPU Bytes/Flops (B/F) : Ratio of memory bandwidth and computing power Random access to memory severely degrade the performance
instruction, pass to the execution unit, and store the results. 3 ~ 6 cycles are required for one floating point operation Pipeline Performance (Flops) Frequency = Belt conveyor Still 6 cycles for one calculation, but 1006 cycles for 1000 calculations. → Roughly one cycle for one calculation
do it? Pipeline allows us to achieve one calculation in one cycle. CPU Frequency = Performance Increase in frequency of CPU has been saturated. http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/fulltext
~ 8 operations simultaneously Current architecture: Superscalor (2) + SIMD (4 ~ 8) Simple Hardware Backward Compatibility SIMD (Single Instruction Multiple Data) Programming is very hard. Auto vectorization is still not effective.
length Simple Implementation O(N^2) → Achieve O(N) by grid search Grid Search Divide the simulation box into small cells. Exclusive Grid Method Hard disks High density Non-Exclusive Grid method Soft Core Low density At most one atom for each cell Many atoms in each cell
of keys: O(N) 2. Find the position where the data will be stored. 3. Store the data: O(N) 3,4,2,3,5,5,1,2,1,3 1: 2 2: 2 3: 3 4: 1 5: 2 1 2 3 4 5 Very useful algorithm for optimization
pairlist with margin Cutoff Search Length Margin Margin = Lifetime of pairlist Decrease the lifetime by Vmax * h (h is a timestep) Construct pairlist when lifetime < 0 Cutoff Margin Worst Case On-head collision of the fastest atoms Expiration Check Pairlist contains the pair which are actually out of interaction → Conditional branch in the inner most loop
Renumbering Without Renumbering Atoms Without Renumbering:Degradation depending on cache size With Renumbering: Performance is virtually depend on # of atoms L2 L3 Higher is Better
Data Calc r^2 Calc Force Store Data Loop 1 Loop N ・・・ Memory Access Calculation Dependency Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 Calc Force Store Data Fetch Data Calc r^2 ・・・ independent Loop 1 Calc Force Store Data Loop N-1 Improving IPC (Instructions per Cycle)
b[N]){ for(int i=0;i<N;i++){ if(a[i] < 0.0) a[i] +=b[i]; } } Add b[i] to a[i] when a[i] < 0 L1: (1) compare a[i] and 0 (2) if a[i] > 0 then jump to L2 (3) a[i] = a[i] + b[i] L2: (4) i = i + 1 (5) jump to L1 when i < N Pseudo Assembly Code This kind of code can be very inefficient. Since an instruction to be executed next cannot be determined until the comparison completed.
= b[i]; if(a[i] >= 0.0) tmp = 0.0; a[i] += tmp; } } const int N = 100000; void func(double a[N], double b[N]){ for(int i=0;i<N;i++){ if(a[i] < 0.0) a[i] +=b[i]; } } Before After L1: (1) Store b[i] to tmp (2) if a[i] > 0 then tmp = 0 (3) Move tmp to a[i] (4) i = i + 1 (5) jump to L1 when i < N Conditional branch involving jump is eliminated.
条件分岐削除なし 条件分岐削除あり Without elimination With elimination Execution time was reduced by 30% 119164 LJ atoms, Density 1.0, Cutoff 3.0 Xeon E5-2680 v3 @ 2.50GHz Lower is better Execution Time
instructions = x x Single Instruction Multiple Data 1 5 3 2 3 2 12 9 3 10 36 18 X = Calculate packed data simultaneously by a single instruction Superscalar SIMD It has not increased since around 2000. reached to the limit possibility of development only here
B A Packing is Slow D C B A Memory Register Move is Fast D C B A Other SIMD Instructions D C B A B C A D Shuffle Scatter D C B A C C C C Blending H G F E D C B A H C F A Mask D 0 0 A D C B A ◦ × × ◦ Vector Calculation a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 + =
×1.07 120,000 atoms, density: 1.0, cutoff 3.0 Time required to calculate 100 loops [ms] Consider the SIMD vectorization from here Simple Grouping Software Pipelin
A D C B A A Memory xmm0 D C B A B A xmm0 vmovsd vmovhpd D C B A C D C B A D C memory vmovsd vmovhpd B A ymm0 xmm1 xmm1 D C D C xmm1 (1) Fetch A (2) Fetch B (3) Fetch C (4) Fetch D (5) Data copy vinsertf128 ※ We have to do them for x, y, z coordinates Hopelessly slow Memory Register Memory
Z Y X Z Y X Z Y X Why? Three coordinates (x,y,z) can be loaded to 256-bit register simultaneously. However, 64-bits are wasted. Z Y X Z Y X Z Y X Z Y X movupd/movapd ymm Z Y X ※Array of Structure
interaction (Bookkeeping Method) → Mask Operation D2 C2 B2 A2 D1 C1 B1 A1 N P P N src1 src2 mask D2 C1 B1 A2 vblendvpd: Choose elements by sign of mask Create mask by interaction length and cutoff length, store zero to force. 0 0 0 0 src1 src2 0 0 Pair A and D are outof interaction
B A Packing is hopelessly slow with AVX2 H G F E D C Memory Register H G F E We can gather 8 elements from memory simultaneously with AVX-512 D C B A B A We also have a scatter instruction.
collision detection. Since we know there are no collision, we can eliminate them. ×1.93 ×1.09 Performance improvement is still unsatisfactory... Simple Grouping Write gather/scatterby hand Compiler Write by hand
Y X Z Y X Z Y X SoA Z Z Z Z Z Y Y Y Y Y X X X X X Remainder Loop Elimination (RLE) We can perform 8 loops simultaneously. Then the reminder loop is up to 7. → Eliminate the remainder loop by masking. Software Pipelining (SWP) Improve IPC by software pipelining optimization. Data becomes continuous. (Bit-shift for gather/scatter becomes unnecessary)
Y X X X X X SoA VZ VZ VZ VZ VZ VY VY VY VY VY VX VX VX VX VX Coordinates Momenta Cache Line Size (512 Byte) AoS 8-packed 0 VZ VY VX 0 Z Y X 0 VZ VY VX 0 Z Y X Information of one atom Cache Line Size (512 Byte) Cache Line Size (512 Byte) Change the data structure from SoA to AoS 8-packed Information of one atom matches to cache line size. 75% usage of cache line
hand AVX-512: AoS 8 + RLE + SWP AVX2: AoS 8 + RLE + SWP Skylake (Xeon) On Skylake (Xeon), AVX2 is faster than AVX-512. Compiler did not use gather/scatter.
performance. Effective SIMD vectorization involves change of data structure SIMD vectorization strongly depends on architecture. → Different code for different architecture