$30 off During Our Annual Pro Sale. View Details »

#15 “An In-Network Architecture for Acceleratin...

#15 “An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives”

2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
https://dl.acm.org/doi/10.1109/ISCA45697.2020.00085

cafenero_777

June 14, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #15 “An In-Network Architecture for Accelerating Shared-Memory

    Multiprocessor Collectives” ௨ࢉ#58 @cafenero_777 2020/11/12
  2. Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. BACKGROUND AND

    MOTIVATION 3. COLLECTIVE COMMUNICATION PRIMITIVES 4. IN-NETWORK REDUCTIONS 5. METHODOLOGY 6. EVALUATION 7. Related Work 8. Conclusion
  3. $ which • An In-Network Architecture for Accelerating Shared-Memory Multiprocessor

    Collectives • Benjamin Klenk, Nan Jiang, Greg Thorson, Larry Dennison • NVIDIA, USA • 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • CPUͰ͸ͳ͘ɺHWΞΫηϥϨʔλʔͰML/DL All-ReduceॲཧΛߴ଎Խ • ڞ༗ϝϞϦNWͷख๏ͷ঺հͱಋೖ • 128GPUsΛ࢖ͬͯɺAll-ReduceॲཧΛ࠷େ18ഒߴ଎ԽΛγϛϡϨʔγϣϯͰ࣮ূ

    • ಡ΋͏ͱͨ͠ཧ༝ • IN-Network Architectureͱ͍͏ݴ༿͕ؾʹͳ͍ͬͯͨ • HW+෼ࢄίϯϐϡʔςΟϯάࣄྫ͕૿͑ͨʢؾ͕͢ΔʣͷͰGPU/NWͷ࠷৽ಈ޲Λ஌Γ͔ͨͬͨ • ෼ࢄGPU/RDMA on large scale Close NW, Edge computing, etc • Ҏલௐ΂ͨMicrosoftͷcatapult (FPGAؒ௨৴)ͷGPU൛ʁ • https://www.microsoft.com/en-us/research/publication/con fi gurable-cloud-acceleration/
  5. In-network computing? • Endpointؒͷpath্ʹ͋ΔʢCPUҎ֎ͷʣ༷ʑͳϦιʔε΋computingʹ࢖͓͏ɺσʔλΛத৺ʹߟ͑Α͏ɺͱ͍͏࿩ • GPU (node಺GPUs, nodeؒGPUs) • NW

    endpoint (NIC, SW) • HW acceleration (NW, RDMA, P4, SmartNIC) • ༻్ɿετϦʔϜॲཧɺKVS, StrageΦϑϩʔυɺNWΦϑϩʔυɺ޿ଳҬʢover 100Gbpsʣ & ిྗޮ཰Խ http://nowlab.cse.ohio-state.edu/static/media/workshops/presentations/exacomm17/exacomm17-invited-talk-gilad-shainer.pdf
  6. 1. Introduction • Ϗοάσʔλղੳ • σʔλྔ͕ଟ͍ -> GPU, FPGA, TPU,

    etc • ॲཧੑೳ޲্ + ిྗޮ཰࠷େԽ • εέʔϥϏϦςΟ໰୊ • ಉ͡σʔλΛϒϩʔυΩϟετ͞Εͯॲཧ • All-Reduceͷ؆୯ͳղઆʢ࣍ͷηΫγϣϯʣ • NW o ff l oad (RDMA?)΋Ұ෦ͷusecaseͰ༗ޮ͕ͩɺϝοηʔδύογϯάͰ͸தԝCPUओಋͷॲཧ͕ίετ • ෼ࢄܕڞ༗ϝϞϦ͸ϝϞϦૢ࡞͕short packet͔ͭύέοτଟ͗͢ɺίετߴ͍ • ΞΫηϥϨʔλʹڞ༗ϝϞϦࡌͤΔ
  7. 3. COLLECTIVE COMMUNICATION PRIMITIVES (1/2) • ͓͞Β͍ʢͦͷ̍ʣ • Multicast: ෳ਺ͷϓϩηεʹσʔλૹ৴

    • Broadcast (Multicastͷಛྫ) • Gather: σʔλͷऔಘ • All-to-All: શϓϩηεͰGather • Reduce: Multicastͷٯ • ෳ਺ϓϩηε͔ΒσʔλΛऔಘɾԋࢉ • Reduce-Scatter: Reduce + All-to-all • All-Reduce: શϓϩηεͰReduce • Reduce + Multicast http://web.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf
  8. 3. COLLECTIVE COMMUNICATION PRIMITIVES (2/2) • ͓͞Β͍ʢͦͷ̎ʣ • NVCC͸All-ReduceʹringΞϧΰϦζϜΛ࠾༻ •

    ଳҬ͸࠷దԽɻϨΠςϯγ͸CPUʢϓϩηεʣ ʹൺྫɻ • ڞ༗ϝϞϦɿ • খن໛ɿϝϞϦϑΣϯεʹΑΓCPU૿͑Δͱ ಉظԽίετ͕ൺྫ͢Δ • େن໛ɿringΛߋʹtreeʹ͢Δɻdouble binary tree΍two-treeΞϧΰϦζϜͰ୳ࡧ (NCCL, MPI) https://tech.preferred.jp/ja/blog/prototype-allreduce-library/
  9. 2. BACKGROUND AND MOTIVATION (1/2) • ΞΫηϥϨʔγϣϯγεςϜɿDGX-2 • 16GPU w/

    6NVLink + 16port NVSW * 12 (fat-tree) • ୯ҰͷάϩʔόϧϝϞϦۭؒ • ϝϞϦͷ΍ΓऔΓɺطଘϓϩτίϧͰ͸Φʔόʔϔουେ͖͗͢Δ • γϯϓϧͳDMAͰଟ࣍ݩσʔλసૹΛΦϑϩʔυʢϝϞϦ or NWʣ https://www.hotchips.org/hc30/2conf/2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf https://www.hotchips.org/hc30/2conf/2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf
  10. 2. BACKGROUND AND MOTIVATION (2/2) • DLτϨʔχϯάେมɺɺ·ͣ͸ϞσϧԽͯ͠ࢼࢉ • Volta (120TFLOPS,

    150GB/s NVLink)Λݩʹ
 2ഒɺ4ഒΛ૝ఆɻ10G, 100GΛ૝ఆɻ • All-Reduceʹֻ͔Δ࣌ؒ • DP: Ethernet͸খ͞ͳαϒόον͸ෆར • MP: ܭࢉ͕େ͖͍΄Ͳޮ཰త->ଳҬʹහײʹɻ
  11. 4. IN-NETWORK REDUCTIONS (1/3) • ϚϧνΩϟετ • MCR (Multi Cast

    Regions) ϝϞϦྖҬΛ࡞੒ • εΠον͕సૹઌGPUͱʢಉ͡ʣΞυϨεΛࢦఆ • Lookup & Forwarding • ύέοτͷΞυϨεʹϚϧνΩϟετ৘ใΛຒΊࠐΉ ొ࿥ ׂ౰ tableࢦఆ ετΞ Lookup/FWD Lookup/FWD =Router Daemon
  12. 4. IN-NETWORK REDUCTIONS (1/3) • Pull ReductionʢPullܕʣ: multicastͯ͠΋Β͏ (ࣗ਎͸load) •

    ࣄલͷಉظ͕ඞཁɻ҆શɻ • εΠονͷreduction table͕͍ͬͺ͍ͷ৔߹͸ετʔϧ ʢ࠶౓pullʣͤ͞Δ • σουϩοΫͷՄೳੑ͸ແ͍ • Push reductionʢPushܕʣ: multicast͢Δ(ࣗ਎͸ॻ͖ࠐΈrequest) • ࣄલಉظඞཁͳ͠ɻൺֱత҆શͰ͸ͳ͍ɻ • εΠονͷreduction table͕ᷓΕΔ৔߹͸LRUͰ࡟আ -> σʔλফࣦͰσουϩοΫͷՄೳੑɻ • ෦෼తͳ్த݁ՌΛmulticast͓ͯ͘͠ • NWෛՙ૿͕ͩɺʢτʔλϧͷʣϨΠςϯγ͸վળ • ෦෼తͳ݁ՌΛݩͷGPUʹૹΓɺ࠶ूܭ͔ͯ͠Β݁ՌΛϚϧνΩϟετ • ϨΠςϯγ͕ѱԽ R0, R1: Request P1, P2: Response (1), (2)ɿAdd Reduction entry for result (5): (4)͔Βೖ͖ͬͯͨ௨৴Λ(2)Ͱreductionͯ͠సૹ͢Δ (3): ֬อͰ͖ͳ͔ͬͨΒɺreductionͤͣʹ(7)ͦͷ··సૹ͢Δ(8), ετʔϧͤ͞Δ D0, D1: Data (1), (2)ɿAdd Reduction entry for result (3): ݁ՌΛϚϧνΩϟετ (4): table֬อͰ͖ͳ͔ͬͨΒݩͷGPU(GPU1)ʹ໭͠ɺGPU͕reductionͯ͠సૹ(5)
  13. 4. IN-NETWORK REDUCTIONS (3/3) • ઃܭ্ͷߟྀ • All-Reduce (=Reduce+Multicast)ͷҝɺ಺෦εΠονଳҬ͸2ഒඞཁ •

    ϚϧνΩϟετɾϢχΩϟετ͸ඞͣಉ͡pathΛ௨Δ • σʔλసૹϨΠϠʔͰͷΤϥʔॲཧʢγʔέϯε൪߸ɺഁଛɾෳ੡Ͱ࠶ૹ৴౳ʣ • σʔλू໿ॱং͸อূ͞Εͳ͍ʢʹුಈখ਺ԋࢉ݁Ռ͕ҟͳΔʣ • ΤϯυϙΠϯτ • ಉ࣌ʹreductionॲཧ਺Λ੍ݶ+ύΠϓϥΠϯॲཧʹ෼ׂʢwave෼ׂʣ • DMAΤϯδϯͷར༻ɻ • Pullํࣜɿ׬ྃͨ͠Βಉظ • pushํࣜɿԠ౴਺Λ਺͑Δ
  14. 5. METHODOLOGY • Bksim2ͰHWΛγϛϡϨʔγϣϯ • ϕʔεϥΠϯ͸NCCL࣮૷ࡁΈͷringϕʔεAll-ReduceΞϧΰϦζϜ • ͭ·Γιϑτ΢ΤΞ࣮૷ͱൺֱ • 150ns,

    25GB/sεΠονΛ૝ఆ • RX/TXͰ̎ͭͷΩϡʔΛ࣋ͭ • ύέοταΠζ͸144B (ϖΠϩʔυ͸128B) • GPUϝϞϦΞΫηεʹ180cycleඞཁɻ180ճʹҰճϨΠςϯγૠೖ • DGX-2 (16GPU)ͱ128GPUΛ૝ఆͯ͠ௐࠪ • 128GPUͷํͷ಺෦εΠονߏ੒͸ӈਤ
  15. 6. EVALUATION (1/4) • All-ReduceʢଳҬʣ • Pullͷํ͕ଳҬ࠷దԽ͞ΕΔ • Pull w/

    sync͸large messageʹͳΔͱແࢹͰ͖Δ • Table size: Push (256B) V.S. Pull (4kB) • All-Reduceʢ࣌ؒʣ • short message/ଟendpoint࣌ʹringΞϧΰϦζϜ͕ಛ ʹ༗ޮ • ࠷େ18ഒͷੑೳ޲্ʂ࠷খͰ΋NCCLൺֱͰ2ഒ޲্
  16. 6. EVALUATION (2/4) • NW scalability • ઌ΄Ͳͱಉ͡܏޲ • ཧ࿦஋100GB/sʹಧ͔ͳ͍

    • ଟendpoint࣌ʹ᫔᫓ൃੜʁ • ΞϧΰϦζϜվྑ • Ring + ֊૚TreeΞϧΰϦζϜʁ • Reduction Table size sensitivity • Table (cache) size͋Ε͹༗Δ͚ͩྑ͍ • Pushܕͷํ͕҆ఆ͠ͳ͍ • pullܕ͸ετʔϧʢ଴ͭʣ • Pushܕ͸࠶ૹʢଳҬফඅʣ
  17. 6. EVALUATION (4/4) • DL training (pushܕͰNCCL (ring)ͰγϛϡϨʔτ) • EthernetͰ࠷େ1.8ഒɺNVLinkͰ࠷େ1.4ഒߴ଎Խ

    • ResNet-50͸ϞσϧαΠζ͕খ͍͞ʢ50MBʣͷͰվળখ͍͞ • αϒόοναΠζ͕খ͍͞΄Ͳޮ཰Խ • Pull/Pushͷࠩ͸খ͍͞ʢΘ͔ͣʹPushྑ͍ɺ࣮૷͸Pullܕָ͕ʣ • 4xͷHWੑೳਐԽ͸”ॱ౰”
  18. 7. ELATED WORK • IBM BlueGeneͳͲ • Collectivesߴ଎Խ • Anton2

    • Pullܕ͔ͭGPU/ڞ༗ϝϞϦͷ༧໿ػߏ • Gradient All-Reduce for ML/DL • Programable switchͰ͏·͘΍ΔʢC++/DPDK/P4ʣ • Mellanox SHArP • In fi niBand SW NW಺Ͱ͏·͘΍Δ • εΠονͰΞΫηϥϨʔγϣϯ͢Δ • ϥοΫ୯ҐͳͲͰ࠷దԽʁ
  19. 8. CONCLUSION • ML/DL Training΍Պֶٕज़ܭࢉ • All-Reduceॲཧ͕ॏཁ • GPUؒ௨৴ʢNVLink/NVSwitchʣΛߴ଎Խ •

    ڞ༗ϝϞϦγεςϜߏஙͱPullܕ/Pushܕൺֱ • εέʔϥϏϦςΟ • ࠷େͰ18ഒͷੑೳվળʢ࠷খͰ΋2ഒվળʣ • Training଎౓1.4ഒ޲্
  20. ࢀߟจݙ • In-Network Computing • https://www.youtube.com/watch?v=ueDOu-2uAXw • ϝϞϦόϦΞ ʢϝϞϦϑΣϯεʣ •

    https://ja.wikipedia.org/wiki/ %E3%83%A1%E3%83%A2%E3%83%AA%E3%83%90%E3%83%A A%E3%82%A2 • Two-tree broadcast • https://en.wikipedia.org/wiki/Two-tree_broadcast • म࢜࿦จ ωοτϫʔΫΦϯνοϓͷޮ཰తͳ γϛϡϨʔγϣϯʹؔ͢Δݚڀ • https://www.arch.cs.titech.ac.jp/a/thesis/Mthesis-2014-02-ueno.pdf • In-Network Computing • https://www.youtube.com/watch?v=ueDOu-2uAXw • ෼ࢄਂ૚ֶशΛࢧ͑Δٕज़ɿAllReduceΞϧΰϦζϜ • https://tech.preferred.jp/ja/blog/prototype-allreduce-library/ • NVIDIA NCCL • https://developer.nvidia.com/nccl • NVSWITCH AND DGX-2 NVLINK-SWITCHING CHIP AND SCALE-UP COMPUTE SERVER • https://www.hotchips.org/hc30/2conf/ 2.01_Nvidia_NVswitch_HotChips2018_DGX2NVS_Final.pdf • map ͱ collectɺreduce ͱ inject ―― ໊લͷҧ͍ʹݟΔൃ૝ͷҧ͍ • https://magazine.rubyist.net/articles/0038/0038-MapAndCollect.html • Twister2: A High-Performance Big Data Programming Environment • http://web.cse.ohio-state.edu/~lu.932/hpbdc2018/slides/fox-slides.pdf • ूஂ௨৴ • http://www.cv.titech.ac.jp/~hiro-lab/study/mpi_reference/chapter3.html
  21. EoP