Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Paper Introduction #20 Breaking the Transience-Equilibrium Nexus: A New Approach to Datacenter Packet Transport

Research Paper Introduction #20 Breaking the Transience-Equilibrium Nexus: A New Approach to Datacenter Packet Transport

cafenero_777

June 10, 2021
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #20 “Breaking the Transience-Equilibrium Nexus: A New

    Approach to Datacenter Packet Transport" ௨ࢉ#70 @cafenero_777 2021/04/26
  2. Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Transience–Equilibrium Tension

    3. On-Ramp Design 4. Implementation 5. Evaluation 6. Evaluation in Facebook’s Network 7. On-Ramp Deep Dive 8. Related Work 9. Conclusion and Future Work
  3. ର৅࿦จ • Breaking the Transience-Equilibrium Nexus: A New Approach to

    Datacenter Packet Transport • Shiyu Liu1, Ahmad Ghalayini1, Mohammad Alizadeh2, Balaji Prabhakar1, Mendel Rosenblum1, and Anirudh Sivaraman3 • 1Stanford University, 2MIT, 3NYU • NSDI 2021
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ࡢࠓͷDCTCP/RDMA͸େֻ͔Γ͗͢ΔͷͰେม • Next Generation TCPͩͱincast໰୊͕͏·͘ղܾ͠ͳ͍ •

    equilibriumʢฏߧظʣʹ͸᫔᫓੍ޚ͸ͦͷ··ར༻ • transientsʢա౉ظʣʹ͸On-RampΞϧΰϦζϜΛ։ൃɾར༻ • GCP (VM)౳ͰQUBIC, BBRͱൺֱͯ͠2.8-5.6ഒྑ͍ & incast໰୊΋վળ • ಡ΋͏ͱͨ͠ཧ༝ • Next Generation TCPͷ৽खʁ • public cloudʢͭ·ΓDC಺ʣͷVMʹద༻͚ͨͩ͠ͰޮՌ͕͋Δʢ෺ཧεΠον΍௥ՃHW͕ཁΒͳ͍ʁʣ • Ξϒετ͔Βจষ͕ΩϨοΩϨʢओ؍Ͱ͢ɻʣ ͜ͷ෦෼͕”On Ramp”
  5. 1. Introduction • DC TCPͰߴεϧʔϓοτɾ௿ϨΠςϯγʔΛ໨ࢦͯ͠ • ᫔᫓৴߸ར༻ɿECNར༻΍ΩϡʔαΠζɾϦϯΫ৘ใͷར༻ • ΞϓϦଆɾkernelଆ(pkt scheduling)ͷ޻෉ͳͲ

    • ʮ΋͏গ͠γϯϓϧʹ࣮ݱͰ͖ͳ͍ͷ͔ʁʯ • Simple is best. • NWػೳґଘͩͱར༻؀ڥ੍͕ݶʢi.e. ύϒϦοΫΫϥ΢υͷސ٬ʣ • ՝୊੔ཧɿେن໛ϑϩʔ૿ݮͷ҆ఆऩଋੑͱincast໰୊Λ࠶ௐࠪɾ੔ཧ • On-Ramp (OR)ϓϩτίϧΛ։ൃ • طଘTCPελοΫʢCUBIC, BBR, DCQCN, TIMELY, DCTCP, HPCCʣʹػೳ௥Ճ͢ΔΠϝʔδ • E2EͰಈ࡞ʢதؒϗετʹखΛՃ͑ͳ͍ʣ • VM, ϕΞϝλϧ, ns-3ͰධՁɿGCPͷRCT 99%ile͕CUBICൺͰx2.8, BBRൺͰx4.1, CPUෛՙ໿3%૿ https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd
  6. 2. Transience–Equilibrium Tension (1/2) • ᫔᫓੍ޚΞϧΰϦζϜ͸ೋ཯എ൓ɿ"Transience-Equilibrium Tension” • K஋͕େ͖͍ͱա౉ظͷԠ౴͕ૣ͍͕ɺฏߧظͰύϑΥʔϚϯε௿Լ •

    K஋͕େ͖͍ͱա౉ظͷԠ౴͕஗͍͕ɺฏߧظͰύϑΥʔϚϯε޲্ • طଘͷN.G TCP (TIMELY΍DCQCN)Ͱ΋ࣄ৘͸ಉ͡ • ա౉ظʢi.e. incast᫔᫓࣌ʣͱฏߧ࣌ͰॲཧΛ෼͚Δʁ৚݅෼ذ͸ʁ • On-RampΞϧΰϦζϜ • Oneway delay (OWD)͕ᮢ஋TΛ௒͑ͨΒૹ৴ݩͰૹ৴ΛࢭΊΔʢڧ੍᫔᫓؇࿨ʣ • K஋ͷײ౓Λ͍͍ײ͡ʹ௿ݮͰ͖Δ Wn+1 = f(Wn , Congestion Signals, K) W : Window Size K : Gain
  7. 2. Transience–Equilibrium Tension (2/2) • TIMELYʹOn-RampΛ૊ΈࠐΜͰڍಈΛ؍࡯ • αʔό2୆->1୆ ͔Βαʔό12୆->1୆ʹมΘͬͨ৔߹ʢάϥϑʣ •

    ᫔᫓੍ޚΞϧΰϦζϜ͸ͦͷ·· • ௕ظͳεϧʔϓοτɾ஗Ԇ͸طଘTCP᫔᫓੍ޚʹ͓·͔ͤ • ա౉ظ͸On-RampͰ͏·͘΍Δ 😀: transience 😇: equilibrium ʢβ=0.8͸ਪ঑஋ʣ 😇: transience 😀: equilibrium 😀: transience 😀: equilibrium ʢβ: TIMELYͷύϥϝʔλʣ Q. delay-basedͳTCPελοΫ಺Ͱશ෦΍Βͳ͍ͷʁ A. λΠϜεέʔϧ͕ҧ͏͔Β΍Γʹ͍͘ɻ ɹ Ұ࣌తͳ᫔᫓؅ཧͱ௕ظతͳ᫔᫓؅ཧ͸ࠞͥΔͳɻ (incastͳͲ) (ଳҬᷓΕͳͲ)
  8. 3. On-Ramp Design (1/2) • E2EͰOWDΛଌఆ͠ɺᮢ஋TΛ௒͑ͨΒૹ৴ΛҰ࣌ఀࢭ • CCʹΩϡʔΠϯά஗Ԇ੍ޚ͕௥Ճ͞ΕΔܗʢ҆શ૷ஔʣ • φΠʔϒͳ࣮૷

    • ૹ৴ଆͷσʔλߏ଄ɿ5taple͝ͱͷQɺ࣍ͷૹ৴λΠϛϯά • ड৴ଆ͔Β5taple/seq#/TSΛؚΉUDP-AckΛฦ͢ • O͕TΛ௒͑ͨΒɺ(O-T)͚࣌ؒͩૹ৴Λ”Ұ࣌ఀࢭ”͢Δ • ack஗Ԇ͕͋Δ৔߹͸ඞཁҎ্ʹ”଴ͭ”͜ͱʹͳΔՄೳੑ • εΠονόοϑΝ͕ࡹ͚ΔΑΓଟ͘଴ͭ͜ͱ΋ O : One Way Delay T : Threshold
  9. 3. On-Ramp Design (2/2) • ࠷ऴ൛: ack஗ԆͱsenderҰ࣌ఀࢭΛߟྀ • ࢦ਺ՃॏҠಈฏۉʢEWMAʣͯ͠ิਖ਼ͨ͠ •

    ਖ਼֬ͳOWDଌఆ͕ඞਢ • Huygens: SWϕʔεͷ෼ࢄΫϩοΫಉظ • ~100ns, 2-3us@99%ile • 20-40ns@99%ile w/ HW-TS g : gain of EWMA = 1 16 ௚લ1ճͷ଴ͪ࣌ؒʢൺʣ ͦΕҎલͷEWMA஋ βm = (OB − OG ) PBG , if(PBG > 0) PBG = pause time for BG period . ࠷ۙͷ܏޲Λ൓ө ௚લҰճͷ߹ܭఀࢭ࣌ؒ ࠷ऴ൛ͷํ͕ΑΓ҆ఆ
  10. 4. Implementation 2 sender -> 1 receiver • kernelϞδϡʔϧͱ࣮ͯ͠૷ •

    OR-ACK͸TS෇͖UDPΛ৽ͨʹੜ੒ɾసૹɻطଘ௨৴ͷpkt͸͍͡Βͳ͍ɻ • QDisc (FQ)ʹஷΊΔɻίϯτϩʔϥ͕TSͰૹ৴Λ൑அ • NW StackΑΓDriverͷํ͕TSਫ਼౓͕ྑ͍ • ఀࢭղআޙʹେྔૹ৴͸͞Εͳ͍ • CUBIC/DCTCP: window-base CC: TCP/QDiscͷQαΠζʹ੍ݶɻղআ͸ϥ΢ϯυϩϏϯɻ • BBR/DCQCN/TIMELY: rate-base CC: rateΛམͱͯ͠ૹ৴ɻ • CPUෛՙ • OR-ACKͷ͍ͤɻ10 MTU packetຖʹ1 OR-ACKૹΔɻʢଳҬ׵ࢉͰ0.2%ఔ౓ʣ • Smart NICͰॲཧͤ͞Ε͹CPUෛՙ͕ͳ͘ͳΔ • ࣌ࠁಉظ (Huygens)ͷෛՙ͸ແࢹՄೳ
  11. 5. Evaluation • ධՁ࣠ • ΞϓϦ໨ઢɿincastͷRCTɺnon-incastͷFlowCompletionTime • incast (5KB, 500KB),

    non-incast (Web search, Hadoop, gRPC) • NW໨ઢɿpacket drop, TCP timeout • 1. VM؀ڥ (GCP) • 4vCPU, 10Gbps * 50୆ • TSඪ४ภࠩ: 200ns • 3. NS3 (γϛϡϨʔγϣϯ) • 100Gbps * 320 server (in 3 layered Fat-Tree) • TSඪ४ภࠩ: 200ns૬౰ʹͳΔΑ͏ʹϥϯμϜʹ஗Ԇૠೖ • ࣍ท: Facebook Hadoop • BBRͰ΋ಉ͡܏޲ • BMͰ΋ಉ͡܏޲ • BG tra ffi cෛՙɾछผΛม͑ͯ΋ಉ͡܏޲ • 2. BM؀ڥ (CloudLab) • 8core, 10G * 100୆ (in Clos) • TSඪ४ภࠩ: 100ns େ෯ʹվળ܏޲ ʢHPCC͸طʹߴੑೳͷͨΊվળ͞Εͣʣ
  12. 6. Evaluation in Facebook’s Network • ໨తɿMap Reduce௨৴Λͭͭ͠ɺStorage tra ff

    i cΛ֬อ͍ͨ͠ɻ • ܭࢉ༻15୆ʢ25GbpsʣɺετϨʔδ༻12୆(100Gbps) in Clos NW • ετϨʔδ௨৴͸࠷େ300Gbps (25Gbps*12) = incastൃੜʢ fi g.17ʣ • DCTCP (ECNඞཁ)ͱಉఔ౓·Ͱվળ
  13. 7. On-Ramp Deep Dive • OWDͷ௕͞ʹΑͬͯRCTมԽ • GSOͷΩϡʔαΠζʹΑͬͯRCTมΘΔ • 64KBΑΓ16KBͷํ͕ੑೳ͕ྑ͍͕ɺૹ৴ଆCPUෛՙ্͕͕Δɻ

    • OR/non-ORͱͷڞଘ • ڞଘ͢ΔͱޮՌ͕ബΕΔՄೳੑ • GCPͷධՁ݁Ռతʹ͸ͦΕ΄Ͳѱ͘ͳΒͳ͍ • BM؀ڥͰ50୆, 50୆ͰධՁ • ORͷஈ֊తಋೖʹ޲͍͍ͯΔ͜ͱΛࣔࠦ • ͖͍͠஋T, EWMA gain: g஋ • RCT, FCT͕ϐʔΩʔʹ͸ͳΒͳ͍ ORΛಋೖ͢Δ͜ͱͰɺʢ݁Ռతʹʣnon-OR΋ੑೳྑ͘ͳΔʂ
  14. 9. Conclusion and Future Work • τϥϯεϙʔτ૚ͷվળ͸ॏཁ͕ͩɺ • ϓϩτίϧελοΫ಺Ͱ͸΍Γʹ͍͘ •

    ECNར༻౳Ͱ͸NWػೳʹґଘ • ฏߧظͱա౉ظͷ”ۓு”͕ࠜຊݪҼ • ա౉ظͷϋϯυϦϯάΛ੾Γग़͠ -> On-Ramp • ฏߧظͷϋϯυϦϯά͸طଘϓϩτίϧελοΫʹҠৡ • DCϢʔεέʔεͰ2.8-5.6ഒఔ౓ੑೳվળ & incast௨৴஗Ԇ΋վળ • طଘTCPελοΫͱڞଘͭͭ͠ORҠߦͰ͖ͦ͏
  15. EoP