Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”

#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”

cafenero_777

January 25, 2024
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #50 “Scalable Hierarchical Aggregation Protocol (SHArP): A

    Hardware Architecture for E ff i cient Data Reduction” ௨ࢉ#123 @cafenero_777 2024/01/25 1
  2. ର৅࿦จ •Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for

    E ff i cient Data Reduction • Richard L. Graham, et al • Mellanox Technologies, Inc. • COM-HPC '16: Proceedings of the First Workshop on Optimization of Communication in HPC in SC • https://dl.acm.org/doi/10.5555/3018058.3018059 • https://network.nvidia.com/sites/default/ fi les/related-docs/solutions/hpc/paperieee_copyright.pdf • https://dl.acm.org/doi/proceedings/10.5555/3018058 2
  3. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. PREVIOUS WORK 3. AGGREGATION

    PROTOCOL 4. MPI IMPLEMENTATION 5. EXPERIMENTS 6. MICRO-BENCHMARK RESULTS 7. OPENFOAM PREFORMANCE 8. DISCUSSION 9. CONCLUSIONS 3
  4. Collective Communication? •ฒྻܭࢉ༻ͷܭࢉॲཧ •Map/Shu ffl e/Reduce, AllReduce, AllGather, ͳͲͳͲɻ •

    https://www.janog.gr.jp/meeting/janog53/dcqos/ P.35-39 • https://www.janog.gr.jp/meeting/janog53/ainw/ P.15-17 •NCCL • https://developer.nvidia.com/nccl 5 ౎߹ͷͨΊɺreductionΛred.ͱུه͠·͢ɻ CCO: Operation CCL: Library
  5. 1. INTRODUCTION •HPCੑೳ޲্ͷྺ࢙: ϕΫτϧܕܭࢉػ -> PCΫϥελʢࢢൢCPUʣΫϥελ -> ϚϧνίΞCPU •Compute Engine਺ͷ૿Ճ͕伴:

    CPU (X86, Power, ARM), GPGPU, FPGA • C.E.ؒΛNWॏཁ͕ͩɺdata͸C.E.ͷlocality͕ڧ͗͢ɻదٓ෼ࢄ͍ͤͨ͞ -> Co-Designඞཁ •Mellanox: CPUΦϑϩʔυʹϑΥʔΧε͠ɺpipeline࠷దͰ௿ϨΠςϯγԽΛ໨ࢦͨ͠ • MPI΁APIఏڙ͢Δ͜ͱͰརศੑΛ࣮ݱ •Scalable Hierarchical Aggregation ProtocolʢSHArPʣ • ReductionΦϖϨʔγϣϯ༻ͷϓϩτίϧఆٛͱSwitchIB-2σόΠε࣮૷ • Reduction treeΛ༻͍Δ͜ͱͰɺCPUܭࢉΦϑϩʔυͭͭ͠ɺreductionࣗମ΋࡟ݮ 6
  6. 2. PREVIOUS WORK •ઌߦݚڀ: blocking/non-blockingͷόϦΞɾϦμΫγϣϯΞϧΰϦζϜվળ • ΞϧΰϦζϜ޻෉: Sequential, chain, binary,

    binomial tree • τϙϩδ޻෉: ϝογϡɺϋΠύʔΩϡʔϒ •HW࣮૷Ͱੑೳ޲্: broadcast, barrier, reduction, torus interconnect w/all- to-all CCͰ2ഒੑೳ޲্ @ IBM Blue Gene 7
  7. 3. AGGREGATION PROTOCOL •NW co-processor architectureͷ໨త • ʢߴස౓ͳʣ௨৴׬ྃ࣌ؒͷ࠷దԽͰCPU࢖༻཰ͷ࠷খԽ • ࠷ॳͷλʔήοτ͸global

    reduction, barrierಉظ •SHArP: reductionॲཧΛந৅Խ • AN: Aggregation Node • (ॲཧ͞ΕΔtree಺Ͱ)ANͰred.͞Εͳ͕ΒπϦʔΛ্Δɻ • RootʹͳΔͱglobal aggregate׬ྃɻू໿ύλʔϯʹґଘ͠ͳ͍ɻ • NW্Ͱॲཧ͞Εɺσʔλͷಈ͖΋࠷খԽɻHigh-radix (shallow red. tree)ͷԸܙɻ 8
  8. 3. AGGREGATION PROTOCOL A: Aggregation Nodes. •AN: ू໿πϦʔͷϊʔυ • ࢠϊʔυ͔ΒσʔλΛड͚औΓɺred.͠ɺ਌

    (or ֤ϊʔυ)ʹసૹ • σʔλίώʔϨϯεҡ࣋ʢbarrierಉظ΍vector reductionʣ • ࣮૷ํ๏: αʔό΍εΠον্ͷϓϩηεɺεΠονͷHW 9
  9. 3. AGGREGATION PROTOCOL C: Aggregation Group. •AG: AN͕௨৴͢Δ૬खͷఆٛʢGroupʣΛ࡞ΓɺϦιʔε࢖༻཰Λ޲্ • ෳ਺job,

    ෳ਺groupΛαϙʔτɻ • ʢ਌ࢠʣSub-treeΛ࡞Γɺrootʹసૹ͢ΔσʔλΛ࡟ݮ͢Δ • ෼ࢄάϧʔϓੜ੒ͷΞϧΰϦζϜ͸ຊߘൣғ֎ • Dead lockͷՄೳੑΛճආ͢ΔͨΊʹANͰόοϑΝΛอ࣋ •ิ଍: “௨৴͢Δ૬ख” • communicator: MPIจ຺Ͱͷ௨৴૬ख: ϓϩηε΍ϊʔυͳͲɻMPI_Commܕɻ 11
  10. 3. AGGREGATION PROTOCOL D: Aggregation Operations. 12 •AO: ֤AGͷϝϯόʔͷϦΫΤετͰ։࢝ •

    SHArPϦΫΤετϝοηʔδͱred.ϖΠϩʔυ • Data type, size, # of element, operation (ex. Min) • ׬ྃͨ͠Β਌AN΁݁ՌΛసૹ •MPI Reduce/AllReduceͷ৔߹͸Agg TreeΛ༻͍ͯIBϚϧνΩϟε΋ɻ
  11. 3. AGGREGATION PROTOCOL E: Fault Handling. 13 •Errorछผ: τϥϯεϙʔτɺΤϯυϊʔυɺSHArPϓϩτίϧ •

    ANϦιʔε։์͠ɺϗετϓϩηεʹ௨஌ɻ • ΞϓϦଆͰΤϥʔճ෮orதஅͤ͞Δ •Aggregation Manager (AM)͕Τϥʔ؅ཧ͠Ϧιʔε։์ • SHArP INTͱIB NW؂ࢹΛར༻ɻMPI/SHMEM࢓༷Ͱtimeout͸ར༻Ͱ͖ͣɻ
  12. 3. AGGREGATION PROTOCOL F: SwitchIB-2-Based Aggregation Support. •ANϩδοΫ͸εΠονASICʹ౷߹͞ΕͨIB TCAͱͯ͠AGΛ࣮૷ •SHArpϓϩτίϧʹج͍ͮͯಈ࡞

    •ANػೳ • ALU: 32/64bit, un/signed int/ fl oat • Sum, Min/Max, Min/MaxLoc, per bit OR/AND/XOR •request౸ணॱͰܭࢉ͸”ग़དྷͳ͍”໰୊ • Floatͷ࿨ੵ͸Մ׵͕ͩassociative(݁߹ଇ)Λຬͨ͞ͳ͍ɻ fl oatingܭࢉͰ0 or not໰୊ɻ • ݻఆॱংͰܭࢉ͢Δ: SwitchIB-2͸༧ଌՄೳͳԋࢉॱংΛ࣮૷͠ɺܭࢉʹ࠶ݱੑΛ࣋ͨͤͨɻ 14 1020 − (1020 + ϵ) = 0 (1020 − 1020) + ϵ = ϵ
  13. 4. MPI IMPLEMENTATION •SHArPάϧʔϓΛMPIίϛϡχέʔλʹϚοϐϯά • άϧʔϓੜ੒ɺάϧʔϓτϦϜૢ࡞ɺϦιʔεׂ౰ • Non-blocking I/Fܦ༝ͰLeaf SWʹRC

    QPཁٻ • MPI Barrier() : SHArP barrier() • MPI Reduce(), MPI Allreduce() : allreduce SHArP work request • ະ׬ྃCCOͱϦιʔεΛτϥοΩϯά͠ɺ৽نॲཧΛ஗Ԇͤ͞Δɻ 15
  14. 5. EXPERIMENTS •128 nodes, dual socket 2.6GHz Intel E5-2697 v3

    Haswell •SwitchIB-2, ConnectX-4 HCA •RHEL 7.2, OFED v3.2-1.0.1.1 IB verbs •Latencyͷ҆ఆԽ • C-Stateݻఆɺdisable THP • BIOSͰSMIׂΓࠐΈແޮԽɺHT/TBແޮԽ 16 $cat /etc/cmdline intel pstate=disable, isolcpus=1, nohz full=1, rcu nocbs=1 $ sysctl -a … vm_stat interval=1200 kernel.numa balancing=0 kernel.nmi watchdog=0 and disable THP …
  15. 5. EXPERIMENTS •Library • HPC-XϓϨϦϦʔε൛ w/ SHArP: Mellanox੡HPCιϑτ΢ΣΞελοΫ • MPI

    collective FCA library v3.4 w/ SHArP • MVAPICH-2 v2.2: OSSͳMPI࣮૷ •Performance Tests • Low-level verbs test, OSU collective latency test v5.2 for MPI Barrier() & MPI Allreduce() • Single process per host • SHArPόϦΞϝοηʔδΛHCAʹϙετ -> SHAaPΩϡʔΛpollingͯ͠red.࣮ߦ • 1 MPIϝοηʔδ͕ෳ਺SHArPϝοηʔδʹ෼ׂ͞Εɺಉ࣌ʹin- fl ight͢Δ 17
  16. 6. MICRO-BENCHMARK RESULTS A. Native SHArP Measurements 18 ϕʔεϥΠϯͷଌఆ -ϗετ਺͕ଟ͍΄Ͳ஗Ԇ͸ඍ૿

    -ϝοηʔδ਺͕ଟ͍΄Ͳ஗Ԇ͸ඍ૿ OS/BIOSͷ࠷దԽ༗Γແ͠ൺֱ -64ϗετͩͱࠩ͸5%ఔ౓ -128ϗετͩͱࠩ͸10-20%΋ʂ
  17. 6. MICRO-BENCHMARK RESULTS B. MPI-Level SHArP Measurements 19 MPIܦ༝͸20-40%ఔ౓ͷΦʔόʔϔου ʢ200-400nsͷΦʔόʔϔουʣ

    Red.ͷαΠζʹґଘ͠ͳ͍ɻ SHArP w/ HWΦϑϩʔυ͸ɺ HPC-X (IB w/ multicast)ΑΓҰ؏ͯ͠଎͍ɻ ಉ͘͡ϗετϕʔεͷIB MVAPICH2͸ඇৗʹ஗͍ɻɻ ࠷େαΠζΛ௒͑ͨ৔߹ɺ ஗͘͸ͳΔ͕ɺred.ΛฒྻԽɾ pipelineԽͰ஗Ԇ૿ՃΛ཈੍
  18. 7. OPENFOAM PERFORMANCE •ΞϓϦͰͷੑೳධՁ • OpenFOAMʢྲྀମ਺஋ղੳΞϓϦʣ • 100ສηϧͰicoFoamιϧόʔ࢓༷ • όϦΞΦϖϨʔγϣϯ͕খ͍͞΋ͷʢ଴ͪ

    ͕ଟ͍ͱ”଴ͪ”Λଌͬͯ͠·͏ʣ • ΞϓϦෛՙͷภΓ͸collective operation durationΑΓখ͍͞ํ͕ྑ͍ʢͦ͏͠ͳ͍ ͱ”ภΓ”Λଌͬͯ͠·͏ʣ • 8byte MPI Allreduce()͕Ұ൪࢖༻͞ΕΔ 20 Host baseͱൺ΂ͯ΋ɺ -0.5%~15%ఔ౓ੑೳ޲্ɻ - ΑΓεέʔϧ͠΍͍͢ɻ - ੑೳ޲্͸5%ఔ౓্͕ݶɻ - OpenMPIͱൺֱ͢Δͱ4-50%޲্ʂ
  19. 8. DISCUSSION •খɾதن໛red.ॲཧ͸Պֶٕज़ܭࢉͰසൟʹར༻ • ಉظ͕໌ࣔతɾඇ໌ࣔతʹߦΘΕΔɻ • SHArPͰͷੑೳ޲্͕ߩݙͰ͖Δ఺ •SwitchIB-2ͷ36port (High-radix)ͰANؒΛ௚઀ऩ༰͠ϨΠςϯγίετͱऩ༰཰Λ࠷దԽɻ2 level

    FAT-TreeͰଌఆ Λ࣮ݱɻ •SHArPͷNWΞϓϩʔνΛڧௐ͢ΔͨΊɺطଘͷϗετϕʔεͱൺֱ͍ͯ͠Δ •BG/PͷϨΠςϯγʹඖఢ • εΠονόοϑΝ֬อͷͨΊʹԾ૝ϨʔϯΛઐ༗͢Δඞཁͳ͠ɻ • ΑΓ޿͍πϦʔɺNode਺ΛαϙʔτՄೳɻ 21
  20. 9. CONCLUSION •SwitchIB-2 ASIC SHArP architecture/implementation • Small-size data reductionͰੑೳ޲্

    • AggTreeΛܗ੒͠ฒྻԽ͠ɺεέʔϧΞ΢τΛظ଴ • reduction operationͷpipelineԽͰreduction sizeΛ޿͛ΒΕΔ • HCA (ΤϯυϙΠϯτ)ΞϓϩʔνΑΓɺhigh-radixͳIBεΠονͰߴू໿ ʢσʔλύε୹ॖʣɾߴฒྻԽ͕༰қ 22