Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#46 “Technology-Driven, Highly-Scalable Dragonf...

#46 “Technology-Driven, Highly-Scalable Dragonfly Topology”#45 “How to Fight Production Incidents?
 An Empirical Study on a Large-scale Cloud Service”

Technology-Driven, Highly-Scalable Dragonfly Topology

ISCA '08 (International Symposium on Computer Architecture)
https://dl.acm.org/doi/abs/10.1145/1394608.1382129
https://iscaconf.org/isca2008/

cafenero_777

May 11, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Technology Model 3. Dragon

    fl y Topology 4. Routing 5. Cost Comparison 6. Related Work 7. Conclusion 2
  2. ର৅࿦จ •Technology-Driven, Highly-Scalable Dragon fl y Topology • John Kim,

    William J. Dally, Steve Scott, Dennis Abts • Northwestern University, Stanford University, Cray Inc., Google Inc. • ISCA '08 (International Symposium on Computer Architecture) • https://dl.acm.org/doi/abs/10.1145/1394608.1382129 • https://iscaconf.org/isca2008/ 3
  3. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • ASICͷਐԽͰߴradixϧʔλ͕࣮ݱ͢Δ͕ɺέʔϒϧ௕ɾ਺͕ίετ • ϧʔλɺԾ૝ϧʔλɺDragon fl y TopologyΛ༻͍ͯ͜ΕΛղܾɺClosൺͰ50% •

    Ծ૝νϟωϧࣝผͱ᫔᫓ݕ஌ΫϨδοτಋೖͰεϧʔϓοτɺϨΠςϯγ͕ཧ૝తͳ Adaptive Routingग़དྷΔΑ͏ʹͳͬͨɻ •ಡ΋͏ͱͨ͠ཧ༝ • Dragon fl y Topologyͷ֓ཁΛཧղ͍ͨ͠ • Aquila@googleͰ΋Dragon fl y Topology͕࢖ΘΕ͍ͯΔ • IETF 116 rtgwgͰDTͷ࿩୊͕ग़ͨͷͰؾʹͳͬͨɻ// ಺༰͕શવ෼͔Βͳ͔ͬͨͷͰɻ 4
  4. 1. INTRODUCTION •SCͳੈքͰ͸64 Radix Closʢ୤3D Torusʣ΁ •ϙʔτ਺ʢج਺: Radixʣ૿Ճɺ௕ڑ཭ɺϑΝΠόʔ -> ߴՁ

    • Dragon fl y Topology: ϧʔλΛάϧʔϓԽͯ͠༗ޮج਺Λ૿΍͢ •ྫ: 1hopͰglobal (ڑ཭௕Ί)ʹ౸ணͤ͞Δʹ͸: Fig.1 • 2*sqrt(N): 64, 128 Radix͸࣮ݱͰ͖Δ͕ɺN=1M͸ແཧʂ •άϩʔόϧ΁ͷܦ࿏੍ޚ • άϩʔόϧ௚઀઀ଓؒͳΒUGAL(Universal Globally Adaptive Load-balanced routing)Ͱ͖Δ • ϩʔΧϧ -> άϩʔόϧ -> ...ͷ৔߹Ͱ΋࠷దԽ͍ͨ͠ͷͰҎԼΛఏҊ • ଳҬ: UGALVC_H • ஗Ԇ: UGALCR 6 100k hostऩ༰ʹ600portεΠον͕ඞཁʁʂ
  5. 2. Technology Model •૬ޓ઀ଓͷ֊૚ • ASIC಺ંฦ͠ • όοΫϓϨʔϯʢϛουϓϨʔϯʣંฦ͠ • άϩʔόϧ:

    ϥοΫؒંฦ͠ // ࢧ഑తίετ •Cupper͔ΒFiber (AOCར༻)΁ • 10m͋ͨΓ͕ίετٯస • ຊ࿦จɿຊ਺ΛݮΒ͠ɺ௕͞Λ৳͹͢ •έʔϒϧ௕ͱhigh-radixɺಛੑʹ߹ΘͤͯτϙϩδΛ޻෉ 7
  6. 3. Dragon fl y Topology 1/2 8 αʔό઀ଓ਺: p R:

    ϧʔλ͕aݸ͋Δ άϧʔϓ ~= Ծ૝ϧʔλ ϧʔλؒ઀ଓ਺: a-1 άϧʔϓؒ઀ଓ਺: h ֤ϧʔλͷج਺k=p+h+a-1 a*p ports a*h ports άϧʔϓͷج਺k' = a*(p+h) k' >> k ௚઀઀ଓ͕͠΍͘͢ͳΔ ʢάϩʔόϧܘ͕খ͘͞ͳΔʣ 1hopͰߦ͚Δݶք - g: ah+1άϧʔϓ·Ͱɻ - N: ap(ah+1)୆·Ͱɻ άϧʔϓ಺͸׬શ݁߹τϙϩδ άϧʔϓؒ͸೚ҙτϙϩδͰྑ͍ 50port͋Ε͹100k hostऩ༰Ͱ͖Δʂ
  7. 3. Dragon fl y Topology 2/2 9 p=h=2, a=4, k=7

    -> k' = 16 2D fl attened butter fl y Fig5ͱಉ͡k'=16͕ͩɺ ϧʔλϖΞͷlocality͕ߴΊ ΋͏গ͠ن໛େ͖Ί 3D fl attened butter fl y k'=32 N=1056 •όϦΤʔγϣϯ৭ʑ
  8. 4. Routing Routing on the Dragon fl y •࠷୹ܦ࿏ (Min)

    • Gs != GdͰRs͕GdΛ௚ऩͯ͠ͳ͍৔߹͸ɺάϩʔόϧ઀ଓΛ͍࣋ͬͯΔRa΁సૹ͠ɺRb΁౸ண • Rb != RdͳΒɺGd಺ϧʔςΟϯάͰRd΁౸ண •΋ͬͱLB͍ͤͨ͞ -> ValiantΞϧΰϦζϜͷదԠ͠ɺதؒάϧʔϓΛܦ༝ͤ͞Δ͜ͱ΋ग़དྷΔʢ͕ɺhop਺΋૿͑ΔͷͰ ͋·Γ΍Γͨ͘ͳ͍ʣ • Gs != GdͰRs͕GiΛ௚ऩͯ͠ͳ͍৔߹͸ɺGiάϩʔόϧ઀ଓΛ͍࣋ͬͯΔRa΁సૹ • Gs != GdͰRa͔ΒGiͷRx΁సૹ • Gi != GdͰɺRx͕Gd΁઀ଓ͞Ε͍ͯͳ͚Ε͹ɺRx͔Β͔ΒGd઀ଓ࣋ͭRy·ͰGi಺Ͱసૹ • Gi != GdͰɺRy͔ΒGdͷRb·Ͱసૹ • Rb != RdͰɺRb͔ΒRd·ͰGd಺ϧʔςΟϯά 10
  9. 4. Routing Evaluation •ΞϧΰϦζϜ: MinimulͱValiantͰධՁ • 1k node: p=h=4, a=8

    •UGAL: Universal Globally-Adaptive Load-balanced • Qͱhop਺ͰMIN or VALΛબ୒ • UGAL-L: ϩʔΧϧϧʔλͷQͷΈͰܾΊΔ • UGAL-G: GsશͯͷϧʔλͷQͰܾΊΔ 11 (a)URͷ৔߹ MIN͕࠷ߴ VAL͸࠷௿, ͖͔ͬΓ൒෼ੑೳ (b)WorstCaseͳ৔߹ MIN͕࠷௿: ͭ·Γ٧·Δ(1/ah͔͠ग़ͤͳ͍) VAL, UGAL-G͕ߴ͍͕ɺ࠷େͰ΋50%ଳҬ·Ͱɻ
  10. 4. Routing Indirect Adaptive Routing •೉͠͞: ϧʔλ୯ମग़ྗͰ͸ͳ͘ɺάϧʔϓʢͷάϩʔόϧνϟωϧʣͷग़ྗͷར༻ঢ়گΛਖ਼͘͠ೝ͍ࣝͨ͠ • ͭ·Γindirect routingͷ໰୊

    • ϩʔΧϧϧʔλ৘ใʢؒ઀৘ใʣͰάϩʔόϧνϟωϧʢΛ࣋ͭϧʔλʣΛબ୒ͤ͟ΔΛಘͳ͍ • ϩʔΧϧϧʔλͷQΛݟΔ -> ϩʔΧϧνϟωϧѹഭ͔ΒάϩʔόϧνϟωϧѹഭΛਪఆ͢Δ • ϩʔΧϧϧʔλ͕Over Provisioning͞ΕͯΔ৔߹͸ಛʹݦஶʹѹഭ͞ΕΔ 12
  11. 4. Routing Problem1: Limited throughput •ྫ: R1͔Βgc6(1hop), gc7(2hop)Ͱq1, q2͕๞࿨ •

    1hop͕༏ઌ͞ΕΔͨΊR1٧·Δ • -> gc6/7ଳҬ͕ۭ͍͍ͯΔͷʹ࢖༻͞Εͣɻɻɻɻ •վળ: UGAL-LVC •ߋʹվળ: UGAL-LVC_H •URͰ8ׂੑೳɺWCͰlatencyඍ૿ 13 H: hop count VCຖͷQɾH͕ಘ͢ΔͳΒɻ Out: ग़ྗϙʔτ ϙʔτ͕ҧͬͯQɾH͕ಘͳΒɻ ϙʔτ͕ಉ͡Ͱ΋VCతʹಘͳΒɻ
  12. 4. Routing Problem2: Higher intermediate latency 1/3 •bu ff erྔͰlatency͸มΘΔ

    • short packetͰ͸ಛʹݦஶ •ྫ: R1ʹgc0, gc7͔Βύέοτ͕དྷͨ৔߹ɺͲ͔͜Βฦ͢ʁ • R1͔Βq0, q3͸ݟ͑ͳ͍ -> q1, q2Λ୅ΘΓʹ࢖͏ • throughputతʹ͸ਖ਼͍͠ɺ͕ɺlatencyతʹ͸ଛ͢Δ • ͭ·ΓQ͕๞࿨͢Δ·Ͱɺnon-minimal routeʹτϥϑΟοΫ͕Ҡ ಈ͠ͳ͍ɻ͜Ε͕஗Ԇ७૿ͷݪҼ •όοϑΝ͕ઙ͍΄ͲάϩʔόϧQͷback pressure͕ಧ͖΍͍͢: Fig.14 • ͨͩ͠throughput͸٘ਜ਼ʹͳΔ 14 latency෼෍ Short packet͸latencyߴΊ
  13. 4. Routing Problem2: Higher intermediate latency 2/3 •஗Ԇʹର͢ΔఏҊख๏: Credit round-trip

    latency • R0Λ্ྲྀɺͦͷԼྲྀΛR1, R2ͱ͢Δ • ύέοτ͕Լྲྀϧʔλʹసૹ͞Εɺcredit͕ݮΔ ( fl it) • Լྲྀϧʔλଆ͕సૹ׬ྃ͢Δͱɺ্ྲྀͷcreditΛ૿΍͢ (credit) • ෛՙθϩͷ࣌͸tcrt0 ͱ͢Δɻෛՙ͕૿͑Ε͹t΋૿͑Δ •͜ͷt͔Βάϩʔόϧνϟωϧ᫔᫓ঢ়ଶΛਪఆ͢Δ • ֤ϙʔτOʹରͯ͠ͷtd(O) = tcrt0 (O) - tcrt0 • td(O) - min [td(o)] ͚ͩ஗Ԇͤ͞creditΛฦ͢ • minͱͷࠩΛऔΔ ~= ෼ࢄΛऔΔʢͦͷRͷෛՙ৘ใ͕ೖΔʣ •ҰݟόοϑΝ͕ઙ͍Α͏ͳৼΔ෣͍͕ͩɺόοϑΝࣗମ͸͢΂ͯ࢖͑ΔͷͰߴε ϧʔϓοτ͕ग़ͤΔ • Q͕͍ͬͺ͍ʹͳΔΑΓ΋ૣ͘᫔᫓ʹؾ͚ͮΔ 15
  14. 4. Routing Problem2: Higher intermediate latency 3/3 •ఏҊख๏ UGAL-L (cr)ͷධՁ:

    ӈਤ •ඞཁͳػೳ • tcrt ଌఆͱcredit tracking • ύέοτࣝผࢠͷͨΊλΠϜελϯ ϓ (CTQ)Λ୯७ͳΩϡʔʹpush͢Δ • ໭͖ͬͯͨΒpop͢Δ • td ஋ͷอ࣋Ϩδελ • ୯७ʹอ࣋ • credit໭࣌͠ͷ஗ԆϝΧχζϜ • ͜Ε͸࡞Δඞཁ͋Γ 16 Bu ff er 16 WC Bu ff er 16 UR WC Bu ff er 256 Bu ff er 256 UR UGAL-L (vc-h)ͱൺֱͯ͠30% , 200%தؒ஗ԆΛ޲্ தؒ஗Ԇ͸ όοϑΝαΠζʹґΒͳ͍ UGAL-GΑΓ͸ѱ͍ʢҰ෦͕non-minimal routing͞ΕΔͨΊʣ
  15. 5. Cost Comparison 1/2 •(Folded-)Clos -> Flatten butter fl y

    topology: 50%ίετ࡟ݮ • தؒϧʔλɾνϟωϧͷ࡟আͰίετ௿ •Dragon fl y topology: ߋʹ֦ுੑ૿ɺίετ௿ • ྫ: 16R (=1group = 256node) * 16 * 16 = 64k node • FBT: group͔Βॎԣ16઀ଓ (2DΛ௥Ճ) • 50%͕άϩʔόϧ઀ଓ • DimensionΛ֦ுʢେมʣ • DT: group͔ΒશRʹ1઀ଓ (1DΛ௥Ճ) • FBTൺͰάϩʔόϧ઀ଓ͕൒෼ʹͳΔ • 25%͕άϩʔόϧ઀ଓ • groupαΠζΛ֦ுʢ༰қʣ 17
  16. 5. Cost Comparison 2/2 •ϗοϓ਺͸΄΅ಉ͡ •௕͞͸एׯDT͕ෆར • άϩʔόϧ਺ΛݮΒ͠ɺ৽͍͠γάφϦϯάٕज़ར༻΁ •ऩ༰ϗετ਺Ͱͷίετ •

    1k·Ͱ͸ϧʔλؒશ઀ଓͷͨΊBT, DPมΘΒͣɻ • 4kͰBTൺͰ10%ίετݮʢέʔϒϧ௕͕୹͍ͨΊʣ • 4kҎ্͸BTൺͰ20%ίετݮʢάϩʔόϧ͕௕͘ɺগͳ͍ͨΊʣ • 3Dτʔϥε͸έʔϒϧ਺͕ଟ͍ͨΊߴίετ • folded-ClosൺͰ50%ίετݮ 18
  17. Related Work •ROENet: Scalable Opto-Electronic Network • ෳ਺αϒωοτʢάϧʔϓʣ઀ଓͳάϩʔόϧεΠονͰߏ੒ • தؒϧʔλ͕ඞཁʢDT͸֊૚͕ϑϥοτʣ

    •༷ʑͳ֊૚ܕτϙϩδ • DT΋֊૚ܕ͕ͩɺάϩʔόϧ઀ଓΛݮΒͭͭ͠hop਺͸গͳ͍఺͕ҟͳΔ • πϦʔߏ଄ -> ଳҬɺ஗ԆѱԽ cube-connected cycles -> ༗ޮج਺૿Λ׆͔ͤͣ •৴߸ٕज़ͷॏཁੑ • ҆͘ɺ௕͍έʔϒϧ఻ૹٕज़ͷग़ݱͰτϙϩδ͕มΘΔ 19
  18. Conclusion •Dragon fl y Topology • ߴ͍༗ޮج਺(Radix)Λ׆͔͠ɺωοτϫʔΫͷେ͖͞(Diameter)ɺίε τɺϨΠςϯγΛ࠷దԽ • άϩʔόϧέʔϒϧΛݮΒ͠ɺίετ࠷దԽ

    • 20%ݮ@ fl attened BTൺֱ, 50%ݮ@Floded-Closൺ • ϧʔςΟϯά՝୊: Ծ૝νϟωϧɺΫϨδοτϕʔεͷ᫔᫓੍ޚͷఏҊ 20
  19. ͓·͚: rtgwg@IETF 116ͷ࿩ •Routing in Dragon fl y Topologies IETF

    116 Yokohama •New Topologies for Data Center • Dragon fl y+͓͞Β͍ • άϧʔϓ಺෦: Closʹ͢Δ • άϧʔϓؒ: ϑϧϝογϡΛ΍ΊΔ (2hopҎ্Λڐ༰) • ՝୊ײ • BGP: min+1͸௨ৗͷϧʔςΟϯάͰɺmin+3͸source routingඞཁ • Min for Core, ECMP/WCMP for Pods (?!) • min/non-minΛ࢖͍෼͚͍͕ͨ͠ɺECMPͩͱμϝ • Global͸path properties͕ͩɺLocal͸QΛ࢖͏ʁμϝͦ͏ • ReactiveͰ͸ͳ͘Proactiveʹௐ੔͍ͨ͠ͱͳΔͱɺBGPͰ͸ͳ͘QΛ࢖͍͍ͨؾ࣋ͪɻRTT 10usఔ౓ͳͷͰඇৗʹ଎͍ɺ਺΋ଟ͍ • ECN or fl ow label (Λ࢖ͬͯͷpath mapping)Λ࢖͏ͷɺͲ͏Ͱ͔͢Ͷʁ 21
  20. ࢀߟจݙ •Dragon fl y • Routing in Dragon fl y

    Topologies IETF 116 Yokohama // IETF 116ͷࢿྉ • Dragon fl y+: Low Cost Topology for Scaling Datacenters // Α͘෼͔ΔDragon fl y+ • Exascale HPC Fabric Topology // Α͘෼͔Δʢུ • Aquila: A uni fi ed, low-latency fabric for datacenter networks // Gࣾͷར༻ࣄྫ •ࢀߟจݙ • LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS // ത࢜࿦จʂ • The BlackWidow High-radix Clos Network • The Cube-Connected Cycles: A Versatile Network for Parallel Computation 22
  21. ϝϞ •The cube-connected cycles: a versatile network for parallel computation

    • https://dl.acm.org/doi/10.1145/358645.358660 23