Slide 1

Slide 1 text

Research Paper Introduction #50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for E ff i cient Data Reduction” ௨ࢉ#123 @cafenero_777 2024/01/25 1

Slide 2

Slide 2 text

ର৅࿦จ •Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for E ff i cient Data Reduction • Richard L. Graham, et al • Mellanox Technologies, Inc. • COM-HPC '16: Proceedings of the First Workshop on Optimization of Communication in HPC in SC • https://dl.acm.org/doi/10.5555/3018058.3018059 • https://network.nvidia.com/sites/default/ fi les/related-docs/solutions/hpc/paperieee_copyright.pdf • https://dl.acm.org/doi/proceedings/10.5555/3018058 2

Slide 3

Slide 3 text

Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. PREVIOUS WORK 3. AGGREGATION PROTOCOL 4. MPI IMPLEMENTATION 5. EXPERIMENTS 6. MICRO-BENCHMARK RESULTS 7. OPENFOAM PREFORMANCE 8. DISCUSSION 9. CONCLUSIONS 3

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • ΑΓେن໛ͳγϛϡϨʔγϣϯ༻ͷΞʔΩςΫνϟΛ໛ࡧ • DC಺ͷେن໛ฒྻ௨৴༻ʢCollective Communication༻ʣͷco- processerΛεΠονʹ࣮૷ • 2~3.2ഒੑೳ޲্ •ಡ΋͏ͱͨ͠ཧ༝ • SmartNIC/DPUͰ΋ݴٴ͞Ε͍ͯͨͷͰɻ • ੲMPIΛ࢖ͬͨ͜ͱ͕͋ΔͷͰɻ 4 େن໛ฒྻܭࢉʢΠϝʔδਤʣ

Slide 5

Slide 5 text

Collective Communication? •ฒྻܭࢉ༻ͷܭࢉॲཧ •Map/Shu ffl e/Reduce, AllReduce, AllGather, ͳͲͳͲɻ • https://www.janog.gr.jp/meeting/janog53/dcqos/ P.35-39 • https://www.janog.gr.jp/meeting/janog53/ainw/ P.15-17 •NCCL • https://developer.nvidia.com/nccl 5 ౎߹ͷͨΊɺreductionΛred.ͱུه͠·͢ɻ CCO: Operation CCL: Library

Slide 6

Slide 6 text

1. INTRODUCTION •HPCੑೳ޲্ͷྺ࢙: ϕΫτϧܕܭࢉػ -> PCΫϥελʢࢢൢCPUʣΫϥελ -> ϚϧνίΞCPU •Compute Engine਺ͷ૿Ճ͕伴: CPU (X86, Power, ARM), GPGPU, FPGA • C.E.ؒΛNWॏཁ͕ͩɺdata͸C.E.ͷlocality͕ڧ͗͢ɻదٓ෼ࢄ͍ͤͨ͞ -> Co-Designඞཁ •Mellanox: CPUΦϑϩʔυʹϑΥʔΧε͠ɺpipeline࠷దͰ௿ϨΠςϯγԽΛ໨ࢦͨ͠ • MPI΁APIఏڙ͢Δ͜ͱͰརศੑΛ࣮ݱ •Scalable Hierarchical Aggregation ProtocolʢSHArPʣ • ReductionΦϖϨʔγϣϯ༻ͷϓϩτίϧఆٛͱSwitchIB-2σόΠε࣮૷ • Reduction treeΛ༻͍Δ͜ͱͰɺCPUܭࢉΦϑϩʔυͭͭ͠ɺreductionࣗମ΋࡟ݮ 6

Slide 7

Slide 7 text

2. PREVIOUS WORK •ઌߦݚڀ: blocking/non-blockingͷόϦΞɾϦμΫγϣϯΞϧΰϦζϜվળ • ΞϧΰϦζϜ޻෉: Sequential, chain, binary, binomial tree • τϙϩδ޻෉: ϝογϡɺϋΠύʔΩϡʔϒ •HW࣮૷Ͱੑೳ޲্: broadcast, barrier, reduction, torus interconnect w/all- to-all CCͰ2ഒੑೳ޲্ @ IBM Blue Gene 7

Slide 8

Slide 8 text

3. AGGREGATION PROTOCOL •NW co-processor architectureͷ໨త • ʢߴස౓ͳʣ௨৴׬ྃ࣌ؒͷ࠷దԽͰCPU࢖༻཰ͷ࠷খԽ • ࠷ॳͷλʔήοτ͸global reduction, barrierಉظ •SHArP: reductionॲཧΛந৅Խ • AN: Aggregation Node • (ॲཧ͞ΕΔtree಺Ͱ)ANͰred.͞Εͳ͕ΒπϦʔΛ্Δɻ • RootʹͳΔͱglobal aggregate׬ྃɻू໿ύλʔϯʹґଘ͠ͳ͍ɻ • NW্Ͱॲཧ͞Εɺσʔλͷಈ͖΋࠷খԽɻHigh-radix (shallow red. tree)ͷԸܙɻ 8

Slide 9

Slide 9 text

3. AGGREGATION PROTOCOL A: Aggregation Nodes. •AN: ू໿πϦʔͷϊʔυ • ࢠϊʔυ͔ΒσʔλΛड͚औΓɺred.͠ɺ਌ (or ֤ϊʔυ)ʹసૹ • σʔλίώʔϨϯεҡ࣋ʢbarrierಉظ΍vector reductionʣ • ࣮૷ํ๏: αʔό΍εΠον্ͷϓϩηεɺεΠονͷHW 9

Slide 10

Slide 10 text

3. AGGREGATION PROTOCOL B: Aggregation Tree. •End-node͸ANʹ઀ଓ͞ΕΔ • ඞͣ͠΋෺ཧτϙϩδʢ࠷୹઀ଓʣʹै͏ඞཁແ͠ •AN͸ANʹ઀ଓ͞ΕΔ •IB/RoCE΍losslessͳτϥϯεϙʔτલఏ •πϦʔީิ: Fat-tree, Dragon fl y+, hypercube 10

Slide 11

Slide 11 text

3. AGGREGATION PROTOCOL C: Aggregation Group. •AG: AN͕௨৴͢Δ૬खͷఆٛʢGroupʣΛ࡞ΓɺϦιʔε࢖༻཰Λ޲্ • ෳ਺job, ෳ਺groupΛαϙʔτɻ • ʢ਌ࢠʣSub-treeΛ࡞Γɺrootʹసૹ͢ΔσʔλΛ࡟ݮ͢Δ • ෼ࢄάϧʔϓੜ੒ͷΞϧΰϦζϜ͸ຊߘൣғ֎ • Dead lockͷՄೳੑΛճආ͢ΔͨΊʹANͰόοϑΝΛอ࣋ •ิ଍: “௨৴͢Δ૬ख” • communicator: MPIจ຺Ͱͷ௨৴૬ख: ϓϩηε΍ϊʔυͳͲɻMPI_Commܕɻ 11

Slide 12

Slide 12 text

3. AGGREGATION PROTOCOL D: Aggregation Operations. 12 •AO: ֤AGͷϝϯόʔͷϦΫΤετͰ։࢝ • SHArPϦΫΤετϝοηʔδͱred.ϖΠϩʔυ • Data type, size, # of element, operation (ex. Min) • ׬ྃͨ͠Β਌AN΁݁ՌΛసૹ •MPI Reduce/AllReduceͷ৔߹͸Agg TreeΛ༻͍ͯIBϚϧνΩϟε΋ɻ

Slide 13

Slide 13 text

3. AGGREGATION PROTOCOL E: Fault Handling. 13 •Errorछผ: τϥϯεϙʔτɺΤϯυϊʔυɺSHArPϓϩτίϧ • ANϦιʔε։์͠ɺϗετϓϩηεʹ௨஌ɻ • ΞϓϦଆͰΤϥʔճ෮orதஅͤ͞Δ •Aggregation Manager (AM)͕Τϥʔ؅ཧ͠Ϧιʔε։์ • SHArP INTͱIB NW؂ࢹΛར༻ɻMPI/SHMEM࢓༷Ͱtimeout͸ར༻Ͱ͖ͣɻ

Slide 14

Slide 14 text

3. AGGREGATION PROTOCOL F: SwitchIB-2-Based Aggregation Support. •ANϩδοΫ͸εΠονASICʹ౷߹͞ΕͨIB TCAͱͯ͠AGΛ࣮૷ •SHArpϓϩτίϧʹج͍ͮͯಈ࡞ •ANػೳ • ALU: 32/64bit, un/signed int/ fl oat • Sum, Min/Max, Min/MaxLoc, per bit OR/AND/XOR •request౸ணॱͰܭࢉ͸”ग़དྷͳ͍”໰୊ • Floatͷ࿨ੵ͸Մ׵͕ͩassociative(݁߹ଇ)Λຬͨ͞ͳ͍ɻ fl oatingܭࢉͰ0 or not໰୊ɻ • ݻఆॱংͰܭࢉ͢Δ: SwitchIB-2͸༧ଌՄೳͳԋࢉॱংΛ࣮૷͠ɺܭࢉʹ࠶ݱੑΛ࣋ͨͤͨɻ 14 1020 − (1020 + ϵ) = 0 (1020 − 1020) + ϵ = ϵ

Slide 15

Slide 15 text

4. MPI IMPLEMENTATION •SHArPάϧʔϓΛMPIίϛϡχέʔλʹϚοϐϯά • άϧʔϓੜ੒ɺάϧʔϓτϦϜૢ࡞ɺϦιʔεׂ౰ • Non-blocking I/Fܦ༝ͰLeaf SWʹRC QPཁٻ • MPI Barrier() : SHArP barrier() • MPI Reduce(), MPI Allreduce() : allreduce SHArP work request • ະ׬ྃCCOͱϦιʔεΛτϥοΩϯά͠ɺ৽نॲཧΛ஗Ԇͤ͞Δɻ 15

Slide 16

Slide 16 text

5. EXPERIMENTS •128 nodes, dual socket 2.6GHz Intel E5-2697 v3 Haswell •SwitchIB-2, ConnectX-4 HCA •RHEL 7.2, OFED v3.2-1.0.1.1 IB verbs •Latencyͷ҆ఆԽ • C-Stateݻఆɺdisable THP • BIOSͰSMIׂΓࠐΈແޮԽɺHT/TBແޮԽ 16 $cat /etc/cmdline intel pstate=disable, isolcpus=1, nohz full=1, rcu nocbs=1 $ sysctl -a … vm_stat interval=1200 kernel.numa balancing=0 kernel.nmi watchdog=0 and disable THP …

Slide 17

Slide 17 text

5. EXPERIMENTS •Library • HPC-XϓϨϦϦʔε൛ w/ SHArP: Mellanox੡HPCιϑτ΢ΣΞελοΫ • MPI collective FCA library v3.4 w/ SHArP • MVAPICH-2 v2.2: OSSͳMPI࣮૷ •Performance Tests • Low-level verbs test, OSU collective latency test v5.2 for MPI Barrier() & MPI Allreduce() • Single process per host • SHArPόϦΞϝοηʔδΛHCAʹϙετ -> SHAaPΩϡʔΛpollingͯ͠red.࣮ߦ • 1 MPIϝοηʔδ͕ෳ਺SHArPϝοηʔδʹ෼ׂ͞Εɺಉ࣌ʹin- fl ight͢Δ 17

Slide 18

Slide 18 text

6. MICRO-BENCHMARK RESULTS A. Native SHArP Measurements 18 ϕʔεϥΠϯͷଌఆ -ϗετ਺͕ଟ͍΄Ͳ஗Ԇ͸ඍ૿ -ϝοηʔδ਺͕ଟ͍΄Ͳ஗Ԇ͸ඍ૿ OS/BIOSͷ࠷దԽ༗Γແ͠ൺֱ -64ϗετͩͱࠩ͸5%ఔ౓ -128ϗετͩͱࠩ͸10-20%΋ʂ

Slide 19

Slide 19 text

6. MICRO-BENCHMARK RESULTS B. MPI-Level SHArP Measurements 19 MPIܦ༝͸20-40%ఔ౓ͷΦʔόʔϔου ʢ200-400nsͷΦʔόʔϔουʣ Red.ͷαΠζʹґଘ͠ͳ͍ɻ SHArP w/ HWΦϑϩʔυ͸ɺ HPC-X (IB w/ multicast)ΑΓҰ؏ͯ͠଎͍ɻ ಉ͘͡ϗετϕʔεͷIB MVAPICH2͸ඇৗʹ஗͍ɻɻ ࠷େαΠζΛ௒͑ͨ৔߹ɺ ஗͘͸ͳΔ͕ɺred.ΛฒྻԽɾ pipelineԽͰ஗Ԇ૿ՃΛ཈੍

Slide 20

Slide 20 text

7. OPENFOAM PERFORMANCE •ΞϓϦͰͷੑೳධՁ • OpenFOAMʢྲྀମ਺஋ղੳΞϓϦʣ • 100ສηϧͰicoFoamιϧόʔ࢓༷ • όϦΞΦϖϨʔγϣϯ͕খ͍͞΋ͷʢ଴ͪ ͕ଟ͍ͱ”଴ͪ”Λଌͬͯ͠·͏ʣ • ΞϓϦෛՙͷภΓ͸collective operation durationΑΓখ͍͞ํ͕ྑ͍ʢͦ͏͠ͳ͍ ͱ”ภΓ”Λଌͬͯ͠·͏ʣ • 8byte MPI Allreduce()͕Ұ൪࢖༻͞ΕΔ 20 Host baseͱൺ΂ͯ΋ɺ -0.5%~15%ఔ౓ੑೳ޲্ɻ - ΑΓεέʔϧ͠΍͍͢ɻ - ੑೳ޲্͸5%ఔ౓্͕ݶɻ - OpenMPIͱൺֱ͢Δͱ4-50%޲্ʂ

Slide 21

Slide 21 text

8. DISCUSSION •খɾதن໛red.ॲཧ͸Պֶٕज़ܭࢉͰසൟʹར༻ • ಉظ͕໌ࣔతɾඇ໌ࣔతʹߦΘΕΔɻ • SHArPͰͷੑೳ޲্͕ߩݙͰ͖Δ఺ •SwitchIB-2ͷ36port (High-radix)ͰANؒΛ௚઀ऩ༰͠ϨΠςϯγίετͱऩ༰཰Λ࠷దԽɻ2 level FAT-TreeͰଌఆ Λ࣮ݱɻ •SHArPͷNWΞϓϩʔνΛڧௐ͢ΔͨΊɺطଘͷϗετϕʔεͱൺֱ͍ͯ͠Δ •BG/PͷϨΠςϯγʹඖఢ • εΠονόοϑΝ֬อͷͨΊʹԾ૝ϨʔϯΛઐ༗͢Δඞཁͳ͠ɻ • ΑΓ޿͍πϦʔɺNode਺ΛαϙʔτՄೳɻ 21

Slide 22

Slide 22 text

9. CONCLUSION •SwitchIB-2 ASIC SHArP architecture/implementation • Small-size data reductionͰੑೳ޲্ • AggTreeΛܗ੒͠ฒྻԽ͠ɺεέʔϧΞ΢τΛظ଴ • reduction operationͷpipelineԽͰreduction sizeΛ޿͛ΒΕΔ • HCA (ΤϯυϙΠϯτ)ΞϓϩʔνΑΓɺhigh-radixͳIBεΠονͰߴू໿ ʢσʔλύε୹ॖʣɾߴฒྻԽ͕༰қ 22

Slide 23

Slide 23 text

EoP 23