Slide 1

Slide 1 text

Research Paper Introduction #4 “Congestion Control for Large-Scale RDMA Deployments” @cafenero_777 2019/08/13

Slide 2

Slide 2 text

$ which • Congestion Control for Large-Scale RDMA Deployments • Yibo Zhu1,3 Haggai Eran2 Daniel Firestone1 Chuanxiong Guo1 Marina Lipshteyn1 Yehonatan Liron2 Jitendra Padhye1 Shachar Raindel2 Mohamad Haj Yahia2 Ming Zhang1 • 1Microsoft 2Mellanox 3U. C. Santa Barbara • SIGCOMM ’15, August 17–21, 2015, London, United Kingdom

Slide 3

Slide 3 text

Agenda • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Abstract • Introduction • The need for DCQCN • The DCQCN algorithm • Bu ff er Settings • Analysis of DCQCN • Result • Discussion • Related work • Conclusion and future work

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ௨ৗͷTCP/IPελοΫͰ͸ߴεϧʔϓοτɾ௒௿ϨΠςϯγʔ͕ग़ͤͳ͍ • RDMAͰ͋ΔRoCEv2ʹ͸σʔληϯλ಺ʢ3 Tier ClosʣͰͷར༻ʹ՝୊ • ্هΛվྑͨ͠΋ͷ͕DCQCNɺDC಺Ͱྑ͍ੑೳ͕ग़ͤΔ • ಡ΋͏ͱͨ͠ཧ༝ • DCͰ࢖͑Δྑ͍᫔᫓੍ޚʹڵຯ͕͋ͬͨͨΊɻ • Large-scale RDMA؀ڥʹڵຯ͕͋ͬͨͨΊɻ

Slide 5

Slide 5 text

ΠϯτϩμΫγϣϯ • ࠓͲ͖ͷDC͸ 40Gbps+, ௨ৗͷTCP/IPͰ͸CPUෛՙɻͦ΋ͦ΋CPU͸VMʹ࢖͍͍ͨ • RDMAΛ࢖͓͏ • NICʹωοτϫʔΫϓϩτίϧ͕࣮૷͞Ε͍ͯΔ (bypass host stack) • loss-less NWΛ૝ఆ (࣮૷Λ؆୯ʹ͢ΔͨΊ) • RDMA͸IP-routed (IP-Clos)Ͱ͸ར༻ʹ՝୊ • ಛʹ᫔᫓੍ޚ͕໰୊ɻͦͷͨΊRoCEv2ඪ४ͷ࿮಺Ͱվྑ -> DCQCN • RMDAྺ࢙ৼΓฦΓ • In fi niBand (IB)ͷL2͸hop-by-hop, credit-based fl ow control. loss-lessͷͨΊL4͕γϯϓϧɺNICʹ࣮૷ • αʔό͸NICʹϝϞϦόοϑΝΛొ࿥ɺΫϥΠΞϯτ͸NICଆΛಡΉ -> αʔόCPUͷෛՙ͸্͕Βͳ͍ • In fi niBand (IB)͸IP/EtherͰ͸ͳ͍ɻઐ༻NWػࡐ͕ඞཁ->ӡ༻ෛՙ • RDMA over Converged Ethernet (RoCE) • IB L3 -> IP and UDP • IB L2 -> Ethernet • RoCEv2͸lossless L2͕ඞཁ -> PFC (Priority-based Flow Control)Ͱ࣮ݱ • PFC͸per portͷͨΊɺcongestion͕”޿͕ͬͯ”͠·͍ɺੑೳ͕ѱ͘ͳΔ໰୊͕͋Δ

Slide 6

Slide 6 text

ΠϯτϩμΫγϣϯ (Cont.) • ຊ࣭తͳղܾํ๏͸PFCΛ fl ow-levelͰ੍ޚͤ͞Δ͜ͱ • MS DCͷཁ๬ • function over lossless L3 routed DCNW • incur low CPU overhead on end host • provide hyper-fast start in common case of no congestion • طଘͷ΋ͷͰ͸ରԠग़དྷͳ͍ • QCN͸L3͕ͩΊɺDCTCP/iWarp͸slow start, DCTCP/TCP-Blolt͸ιϑτ΢ΣΞ࣮૷ͷͨΊCPU overhead • DCQCN • E2EͷRoCEv2޲͚᫔᫓੍ޚϓϩτίϧɺL3 ClosͰ࢖͏ɻRED/ECNΛ࢖͏ɻNICͰಈ࡞ɻ • ߴ଎ͳऩଋʢfairnessʣɺߴଳҬɺΩϡʔαΠζখɺΩϡʔαΠζ͕҆ఆʢৼಈ͠ͳ͍ʣ

Slide 7

Slide 7 text

The need for DCQCN • ௨ৗͷTCP stackͷൺֱ • RDMAͷํ͕ྑ͍ • PFCͷݶք: ᫔᫓͕޿͕Δ໰୊ • PFCͰSW/NICͷόοϑΝᷓΕΛ๷͙ • ͖͍͠஋Λ௒͑ͨΒPAUSE message౤͛Δ • RESUME message͕ฦ͖ͬͯͨΒ௨ৗʹ໭͢ • ϙʔτ୯Ґ (+8 priority class)Ͱߦ͏ɺper- fl owͰ͸ͳ͍ • ྫ1: ެฏੑ໰୊ɺP2͸1 fl ow, P3/P4͸ଟ਺ɻP2͕༗རʹɻ • ྫ2: ඃ֐ऀ໰୊ɺPAUSE͕Χεέʔυ͞ΕΔͱSW௚ऩ֎ϗετ΋Өڹ • طଘఏҊͷෆ଍఺ • QCN͸ fl ow-levelͷ᫔᫓੍ޚ͕࢖͑Δɻ • ૹ৴ઌ͔Βૹ৴ݩʹϝοηʔδૹ৴͠సૹrateΛԼ͛ΔɻSW΋ΩϡʔαΠζ౳ௐ੔ • ͨͩ͠L2ͷΈ (MACΞυϨε + fl ow ID) • IPΛUDPΛ࢖͏Α͏ʹมߋ͠NICͰಈ࡞ͤ͞ΔɻSW͸ͦͷ··ʢASICվम͸͕͔͔࣌ؒΔͨΊʣ 25.4us 1.7us 2.8us

Slide 8

Slide 8 text

The DCQCN algorithm֓ཁ • CP Algorithm: Congestion point, SW • DCTCPͱಉ͡Α͏ʹಈ࡞ɿΩϡʔͷ௕͕͖͍͞͠஋Λ௒͑ΔͱECNϚʔΫ • ΩϡʔαΠζʹൺྫͯ͠RED͢Δ • NP Algorithm: noti fi cation point, ड৴ଆ • ECNΛड͚ͨड৴ଆ͕ૹ৴ଆʹ50usຖʹCNP (Congestion Noti fi cation Packet)ΛૹΔ • RP Algorithm: reaction point, ૹ৴ଆ • CNPΛड͚ͨΒૹ৴ϨʔτΛԼ͛ΔɻNPʹϑΟʔυόοΫແ͠ • 55us಺ʹCNPΛड͚ͳ͚Ε͹ૹ৴ϨʔτͷԼ͛෯ΛԼ͛Δ • ߋʹόΠτΧ΢ϯλʢB Byteຖʹ૿΍͢ʣͱλΠϚʔʢTඵޙʹҰؾʹ໭͢ʣΛ࢖͏ • PFCΛ࢖Θͳ͍༁Ͱ͸ͳ͍

Slide 9

Slide 9 text

Result • ϚΠΫϩϕϯνϚʔΫ • Ϟσϧͷݕূ • A -> Bʹૹ৴ɺ10msecޙʹC->Bʹૹ৴ • TimerͰߴ଎҆ఆ • ϕϯνϚʔΫ • Ϋϥ΢υετϨʔδΛ૝ఆ • Ϣʔβ௨৴ • rebuild௨৴ -> incast-like௨৴ • ࣮ࡍͷ௨৴ʢ480 machine, 10million fl owʣΛࢀߟʹಛ௃తͳ௨৴Λ࡞ͬͯϕϯνϚʔΫ • Clos഑Լͷ20୆ͷϚγϯΛ࢖ͬͨɻ20ϖΞͰݻఆ௨৴ • ϢʔβτϥϑΟοΫ • disk rebuild tra ff i c (incast fl ow) 3.43Gbps 1.17Gbps

Slide 10

Slide 10 text

Result (Cont.) • ߋʹଟ͘ͷincast • DCQCNͷํ͕ΑΓଟ͘ͷτϥϑΟοΫΛ͞͹͚Δ (x16) • ઃఆ৭ʑม͑ͯΈΔ • ઃఆϛεͷ৔߹ɿECN͕PCFͷલʹൃಈ”͠ͳ͍”৔߹ • PFC༗Γແ͠ • slow startແ͠ -> ᫔᫓ͯ͠ଳҬ͕θϩʹͳΔɻɻPFC͸ඞਢʢ࠷ޙͷࡆʣ • go-back-N loss recovery scheme (on ConnectX-3 Pro) • loss rate͕খ͍͞ͱ͖͸ॆ෼͕ͩɻɻ • latency: the 95th-percentile queue length • 76.6KB with DCQCN • 162.9KB with DCTCP • ଞͷTCP௨৴ͱͷ਌࿨ੑ • DCQCNͷ໨తͰ͸ͳ͍͕ɺ802.3 priority tagΛ෇͚Ε͹isolateՄೳʢVLAN࢖͏ͷɻɻɻʁʣ

Slide 11

Slide 11 text

Related work • iWarp • TCP stack in HW͸ෳࡶ • ΍ͬͱ40G, RoCEv2͸100G shipped • TCP-Bolt • PFCͷ੍ݶΛ؇࿨ • ιϑτ΢ΣΞ࣮૷ͷͨΊCPU overhead • TIMELY • ڵຯਂ͍ʢ༗๬ʣ • Ϣʔβϥϯυ stack • CPU overhead, VMʹ࢖͍͍ͨͷʹɻɻ

Slide 12

Slide 12 text

Conclusion and future work • Ԡ౴ੑͱ҆ఆੑͷઓ͍ˏDC • DCQCN͸ • packet lossΛߴ଎ʹ๷͙ • E2Eͷ᫔᫓ΛʢൺֱతΏͬ͘Γʣ๷͙ • 100, 400Gbpsʹ΋Ԡ༻Ͱ͖Δϋζ

Slide 13

Slide 13 text

EoP