Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#4 “Congestion Control for Large-Scale RDMA Dep...

#4 “Congestion Control for Large-Scale RDMA Deployments”

cafenero_777

June 06, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. $ which • Congestion Control for Large-Scale RDMA Deployments •

    Yibo Zhu1,3 Haggai Eran2 Daniel Firestone1 Chuanxiong Guo1 Marina Lipshteyn1 Yehonatan Liron2 Jitendra Padhye1 Shachar Raindel2 Mohamad Haj Yahia2 Ming Zhang1 • 1Microsoft 2Mellanox 3U. C. Santa Barbara • SIGCOMM ’15, August 17–21, 2015, London, United Kingdom
  2. Agenda • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Abstract • Introduction • The need

    for DCQCN • The DCQCN algorithm • Bu ff er Settings • Analysis of DCQCN • Result • Discussion • Related work • Conclusion and future work
  3. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ௨ৗͷTCP/IPελοΫͰ͸ߴεϧʔϓοτɾ௒௿ϨΠςϯγʔ͕ग़ͤͳ͍ • RDMAͰ͋ΔRoCEv2ʹ͸σʔληϯλ಺ʢ3 Tier ClosʣͰͷར༻ʹ՝୊ •

    ্هΛվྑͨ͠΋ͷ͕DCQCNɺDC಺Ͱྑ͍ੑೳ͕ग़ͤΔ • ಡ΋͏ͱͨ͠ཧ༝ • DCͰ࢖͑Δྑ͍᫔᫓੍ޚʹڵຯ͕͋ͬͨͨΊɻ • Large-scale RDMA؀ڥʹڵຯ͕͋ͬͨͨΊɻ
  4. ΠϯτϩμΫγϣϯ • ࠓͲ͖ͷDC͸ 40Gbps+, ௨ৗͷTCP/IPͰ͸CPUෛՙɻͦ΋ͦ΋CPU͸VMʹ࢖͍͍ͨ • RDMAΛ࢖͓͏ • NICʹωοτϫʔΫϓϩτίϧ͕࣮૷͞Ε͍ͯΔ (bypass

    host stack) • loss-less NWΛ૝ఆ (࣮૷Λ؆୯ʹ͢ΔͨΊ) • RDMA͸IP-routed (IP-Clos)Ͱ͸ར༻ʹ՝୊ • ಛʹ᫔᫓੍ޚ͕໰୊ɻͦͷͨΊRoCEv2ඪ४ͷ࿮಺Ͱվྑ -> DCQCN • RMDAྺ࢙ৼΓฦΓ • In fi niBand (IB)ͷL2͸hop-by-hop, credit-based fl ow control. loss-lessͷͨΊL4͕γϯϓϧɺNICʹ࣮૷ • αʔό͸NICʹϝϞϦόοϑΝΛొ࿥ɺΫϥΠΞϯτ͸NICଆΛಡΉ -> αʔόCPUͷෛՙ͸্͕Βͳ͍ • In fi niBand (IB)͸IP/EtherͰ͸ͳ͍ɻઐ༻NWػࡐ͕ඞཁ->ӡ༻ෛՙ • RDMA over Converged Ethernet (RoCE) • IB L3 -> IP and UDP • IB L2 -> Ethernet • RoCEv2͸lossless L2͕ඞཁ -> PFC (Priority-based Flow Control)Ͱ࣮ݱ • PFC͸per portͷͨΊɺcongestion͕”޿͕ͬͯ”͠·͍ɺੑೳ͕ѱ͘ͳΔ໰୊͕͋Δ
  5. ΠϯτϩμΫγϣϯ (Cont.) • ຊ࣭తͳղܾํ๏͸PFCΛ fl ow-levelͰ੍ޚͤ͞Δ͜ͱ • MS DCͷཁ๬ •

    function over lossless L3 routed DCNW • incur low CPU overhead on end host • provide hyper-fast start in common case of no congestion • طଘͷ΋ͷͰ͸ରԠग़དྷͳ͍ • QCN͸L3͕ͩΊɺDCTCP/iWarp͸slow start, DCTCP/TCP-Blolt͸ιϑτ΢ΣΞ࣮૷ͷͨΊCPU overhead • DCQCN • E2EͷRoCEv2޲͚᫔᫓੍ޚϓϩτίϧɺL3 ClosͰ࢖͏ɻRED/ECNΛ࢖͏ɻNICͰಈ࡞ɻ • ߴ଎ͳऩଋʢfairnessʣɺߴଳҬɺΩϡʔαΠζখɺΩϡʔαΠζ͕҆ఆʢৼಈ͠ͳ͍ʣ
  6. The need for DCQCN • ௨ৗͷTCP stackͷൺֱ • RDMAͷํ͕ྑ͍ •

    PFCͷݶք: ᫔᫓͕޿͕Δ໰୊ • PFCͰSW/NICͷόοϑΝᷓΕΛ๷͙ • ͖͍͠஋Λ௒͑ͨΒPAUSE message౤͛Δ • RESUME message͕ฦ͖ͬͯͨΒ௨ৗʹ໭͢ • ϙʔτ୯Ґ (+8 priority class)Ͱߦ͏ɺper- fl owͰ͸ͳ͍ • ྫ1: ެฏੑ໰୊ɺP2͸1 fl ow, P3/P4͸ଟ਺ɻP2͕༗རʹɻ • ྫ2: ඃ֐ऀ໰୊ɺPAUSE͕Χεέʔυ͞ΕΔͱSW௚ऩ֎ϗετ΋Өڹ • طଘఏҊͷෆ଍఺ • QCN͸ fl ow-levelͷ᫔᫓੍ޚ͕࢖͑Δɻ • ૹ৴ઌ͔Βૹ৴ݩʹϝοηʔδૹ৴͠సૹrateΛԼ͛ΔɻSW΋ΩϡʔαΠζ౳ௐ੔ • ͨͩ͠L2ͷΈ (MACΞυϨε + fl ow ID) • IPΛUDPΛ࢖͏Α͏ʹมߋ͠NICͰಈ࡞ͤ͞ΔɻSW͸ͦͷ··ʢASICվम͸͕͔͔࣌ؒΔͨΊʣ 25.4us 1.7us 2.8us
  7. The DCQCN algorithm֓ཁ • CP Algorithm: Congestion point, SW •

    DCTCPͱಉ͡Α͏ʹಈ࡞ɿΩϡʔͷ௕͕͖͍͞͠஋Λ௒͑ΔͱECNϚʔΫ • ΩϡʔαΠζʹൺྫͯ͠RED͢Δ • NP Algorithm: noti fi cation point, ड৴ଆ • ECNΛड͚ͨड৴ଆ͕ૹ৴ଆʹ50usຖʹCNP (Congestion Noti fi cation Packet)ΛૹΔ • RP Algorithm: reaction point, ૹ৴ଆ • CNPΛड͚ͨΒૹ৴ϨʔτΛԼ͛ΔɻNPʹϑΟʔυόοΫແ͠ • 55us಺ʹCNPΛड͚ͳ͚Ε͹ૹ৴ϨʔτͷԼ͛෯ΛԼ͛Δ • ߋʹόΠτΧ΢ϯλʢB Byteຖʹ૿΍͢ʣͱλΠϚʔʢTඵޙʹҰؾʹ໭͢ʣΛ࢖͏ • PFCΛ࢖Θͳ͍༁Ͱ͸ͳ͍
  8. Result • ϚΠΫϩϕϯνϚʔΫ • Ϟσϧͷݕূ • A -> Bʹૹ৴ɺ10msecޙʹC->Bʹૹ৴ •

    TimerͰߴ଎҆ఆ • ϕϯνϚʔΫ • Ϋϥ΢υετϨʔδΛ૝ఆ • Ϣʔβ௨৴ • rebuild௨৴ -> incast-like௨৴ • ࣮ࡍͷ௨৴ʢ480 machine, 10million fl owʣΛࢀߟʹಛ௃తͳ௨৴Λ࡞ͬͯϕϯνϚʔΫ • Clos഑Լͷ20୆ͷϚγϯΛ࢖ͬͨɻ20ϖΞͰݻఆ௨৴ • ϢʔβτϥϑΟοΫ • disk rebuild tra ff i c (incast fl ow) 3.43Gbps 1.17Gbps
  9. Result (Cont.) • ߋʹଟ͘ͷincast • DCQCNͷํ͕ΑΓଟ͘ͷτϥϑΟοΫΛ͞͹͚Δ (x16) • ઃఆ৭ʑม͑ͯΈΔ •

    ઃఆϛεͷ৔߹ɿECN͕PCFͷલʹൃಈ”͠ͳ͍”৔߹ • PFC༗Γແ͠ • slow startແ͠ -> ᫔᫓ͯ͠ଳҬ͕θϩʹͳΔɻɻPFC͸ඞਢʢ࠷ޙͷࡆʣ • go-back-N loss recovery scheme (on ConnectX-3 Pro) • loss rate͕খ͍͞ͱ͖͸ॆ෼͕ͩɻɻ • latency: the 95th-percentile queue length • 76.6KB with DCQCN • 162.9KB with DCTCP • ଞͷTCP௨৴ͱͷ਌࿨ੑ • DCQCNͷ໨తͰ͸ͳ͍͕ɺ802.3 priority tagΛ෇͚Ε͹isolateՄೳʢVLAN࢖͏ͷɻɻɻʁʣ
  10. Related work • iWarp • TCP stack in HW͸ෳࡶ •

    ΍ͬͱ40G, RoCEv2͸100G shipped • TCP-Bolt • PFCͷ੍ݶΛ؇࿨ • ιϑτ΢ΣΞ࣮૷ͷͨΊCPU overhead • TIMELY • ڵຯਂ͍ʢ༗๬ʣ • Ϣʔβϥϯυ stack • CPU overhead, VMʹ࢖͍͍ͨͷʹɻɻ
  11. Conclusion and future work • Ԡ౴ੑͱ҆ఆੑͷઓ͍ˏDC • DCQCN͸ • packet

    lossΛߴ଎ʹ๷͙ • E2Eͷ᫔᫓ΛʢൺֱతΏͬ͘Γʣ๷͙ • 100, 400Gbpsʹ΋Ԡ༻Ͱ͖Δϋζ
  12. EoP