Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#2 "Attaining the Promise and Avoiding the Pitf...

#2 "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter"

cafenero_777

June 08, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #2 Attaining the Promise and Avoiding the

    Pitfalls of TCP in the Datacenter @cafenero_777 2019/06/11
  2. $ which • Attaining the Promise and Avoiding the Pitfalls

    of TCP in the Datacenter • Glenn Judd, Morgan Stanley • NSDI’15 • https://www.usenix.org/conference/nsdi15/technical-sessions/ presentation/judd
  3. Agenda • ֓ཁͱಡΜͩཧ༝ • Introduction • Setting • σʔληϯλͰͷTCPͷ՝୊ •

    Delayed ACKs • Reducing RTOmin • DCTCP • TCTCP Deployment Challenges • TCTCP+Performance • Receive Bu ff er Tuning • Related Work • Conclusion
  4. ֓ཁͱಡΜͩཧ༝ • ֓ཁ • ௨ৗͷTCPΛσʔληϯλͰར༻͢ΔͱύϑΥʔϚϯε͕ྑ͘ͳ͍৔߹͕͋Δ • େن໛σʔληϯλͰDCTCPΛ࣮ࡍʹར༻ͯ͠ɺύϑΥʔϚϯεվળɾࣦഊ౳Λٞ࿦ • ಡ΋͏ͱͨ͠ཧ༝ •

    Closੈ୅ʹ߹ͬͨTCPελοΫͷػೳʹݟ͑ͨͨΊ • Leaf/SpineͰ޿ଳҬ͕֬อ͞Εͯ΋ɺincast໰୊͸ະղܾ • HW SwitchͷϙʔτόοϑΝ༰ྔ໰୊ΛαʔόαΠυଆ͔Βʢιϑτ΢ΣΞతʹʣղܾͰ͖ΔՄೳੑ • BBR/MPTCP౳ͷελοΫͱซ༻Ͱ͖Δͱཧ૝తʁ • TCPελοΫͳΒʢಛʹVMͩͱʣมߋ͠΍͍ͨ͢Ί
  5. Introduction • େن໛σʔληϯλ (warehouse-scale: search/SNS/cloud, etc) • TCP/IP͸ͦ΋ͦ΋DC޲͚ʹઃܭ͞Ε͍ͯͳ͍ɻੑೳతʹ৯͑ͳ͍ • ਺msec,

    ଟ਺ͷ fl owΛ૝ఆ͠ͳ͍ɻ • Πϯλʔωοτ௨৴: ఻ൖ஗Ԇ >>> ΩϡʔΠϯά஗Ԇ • σʔληϯλ௨৴: ఻ൖ஗Ԇ <<< ΩϡʔΠϯά஗Ԇ • UDP࢖͏ʁΞϓϦॻ͖௚͠ʁ • DCTCPͳΒTCPͷ··ੑೳ޲্Մೳ • ϓϩμΫγϣϯͰ࢖ͬͨࣄྫɺϨγʔϒόοϑΝνϡʔχϯάɺطଘTCPͱͷൺֱɺRTO_min௿ݮൺֱ
  6. Setting • ΠϯλʔωοταʔϏεͰ͸ͳ͍ • ʢϨϓϦΧަ׵ʁʣϞϯςΧϧϩɾγϛϡϨʔγϣϯ • σʔλղੳ • σʔλΞΫηε •

    KVS -> όϧΫr/wͰେن໛incastى͖΍͍͢ • Distributed-Data-Storage -> ಉ࣌ʹಉ͡blockʹΞΫηε • DC಺Ͱ༷ʑͳΞϓϦ͕ಈͨ͘Ίɺloos couplingඞਢ • 10Gbps, 9k MTU, iPerf, ൺֱର৅͸CUBIC
  7. Setting (Cont. 2/2) • ࣮ࡍͲΜͳτϥϑΟοΫ͕ྲྀΕ͍ͯΔ͔ • 2෼ؒଌఆɻ3WHSͰ͸ͳ͘TCP pushϑϥάͰ1 fl owͱ͢Δ

    • গ਺ fl owͷγϯϓϧͳϦΫΤετ͕ɺେଟ਺ͷଳҬΛ࢖͏ • KVS͕େଟ਺ͷଳҬΛ઎ΊΔ • fl ow਺͕গͳ͗ͯ͢ैདྷͷ᫔᫓੍ޚͰ͸ରԠࠔ೉ • ಛʹincast໰୊
  8. σʔληϯλͰͷTCPͷ՝୊ (2/3) • TCP incast • 1 receiver from N

    Sender in short amount of time • DC಺૚Ͱ͸ͱͯ΋ྑ͋͘Δ௨৴ύλʔϯ͕ͩରԠࠔ೉ • ΞϓϦ͔Β͸Ͳ͏ݟ͑Δ͔ʁ • λΠϜΞ΢τɺ࠶ૹɺεϧʔϓοτ௿ԼɺϨΠςϯγ૿ɺjitter • ϊΠδʔωΠόʔ • ΞϓϦଆͷ࣮૷มߋͰରԠ͢Δͷ͸ྑ͘ͳ͍ɻɻɻ
  9. Delayed ACKs • ରԠࡦ • delayed ACKΛ΍ΊΔ (delay=0ms) • ύέοτ͕૿͑ɺΤϯυαʔόෛՙ૿

    • timeoutΛ୹͘͢Δ • ͲΕ͙Β͍ͳΒڐ༰Ͱ͖Δʁ • 1msʹઃఆɻ1msҎ্͸ෛՙมΘΒͣɻ Receiver’s IRQ load (2 sender to 1 receiver)
  10. Reducing RTOmin • RTO: Retransmission Time Out • incast-induced TCP

    timeout • ΞϓϦ૚ͷtask tail latencyʹӨڹ࣮ྫ: Fig.9, • 5msʹݮΒͨ͠ྫFig.10 • RTO໰୊Λmitigate͸Ͱ͖Δ͕prevent͸Ͱ͖ͳ͍ • ύέϩε͸Ή͠Ζ૿Ճ • D-ACK, RTOmin ͚ͩͰ͸ࠜຊղܾʹ͸ͳΒͳ͍
  11. DCTCP deployment Challenge • զʑͷ໨త • TCP timeoutΛݶΓͳ͘ݮΒ͢ɾlatencyΛݮΒ͢ɾΞϓϦઐ༻NWΛݮΒ͢ • TCPϊΠδʔωΠόʔΛ΁Β͢

    • DCTCPͰ࣮ݱͰ͖ͦ͏ • ผͳHW/SWཁΒͳ͍ • https://github.com/myasuda/DCTCP-Linux • σʔληϯλNW ʢΠϯλʔωοτͷNWͱ͸ಛ௃͕ҧ͏ʣ • ಉ͡NWߏ੒͕ͨ͘͞Μɻ؅ཧ͸ҰՕॴɺޙํޓ׵Λؾʹ͢Δඞཁແ͠ • ҰؾʹDCTCPಋೖ͍͕ͨ͠Ͱ͖ͳ͍ཧ༝ • طଘͷTCPʹѱӨڹ͕͋Δ৔߹΍DCTCPΛαϙʔτͰ͖ͳ͍৔߹(ex. fi le server)
  12. DCTCP: 1.Coexistence with TCP • ͦ΋ͦ΋ڞଘͰ͖ͳ͍: Fig. 11 ͜Ε͸໨తʹ൓͢Δ •

    ECN (DCTCP)͕༏ઌ͞Εͯ͠·͏ͨΊTCPΛۦஞ • TCP͸cwnd͕খ͘͞ͳΔͨΊɻ • TCP/DCTCPΛڞଘͤ͞ΔͨΊɺDSCPͰDCTCPΛϚʔΩϯά • DCTCP <- AQMΩϡʔΠϯάͰ؅ཧ • TCP <- drop-tailΩϡʔΠϯάͰ؅ཧ ࢀߟɿhttps://www.nic.ad.jp/ja/materials/iw/1999/proceedings/C03.PDF https://milestone-of-se.nesuke.com/nw-basic/ip/ip-format/ 8 9 10 11 12 13 14 15 DSCP (Di ff Serv Code Point) ECT CE ECN: Explicit Congestion Noti fi cation ECT: ECN Capable Transport L4ϓϩτίϧ͕͕ECNରԠ͍ͯ͠Δ͜ͱΛࣔ͢bit
 CE: Congestion Experience bit ᫔᫓͕ى͖͍ͯΔ͜ͱΛࣔ͢bit AQM: Active Queue Management IPϔομͷҰ෦Λൈਮ
  13. DCTCP: 3. Non-technical challenges • Reduction in coupling • Timing

    • ඞཁͳػࡐΛࣄલʹௐ੔ • ैདྷͷTCP, non-ECN compliant switcheͰطଘΞϓϦʹӨڹͳ͠
  14. DCTCP: 4. Connection Establishment • ߴෛՙͩͱSYN, SYN-ACK(ECNແ͠)͕ഁغ͞ΕɺίωΫγϣϯ ֬཰ෆՄ • Stanford

    implementation (https://github.com/myasuda/ DCTCP-Linux) • RFC 3168: “A host MUST NOT set ECN on SYN or SYN-ACK packets” • ηΩϡϦςΟΛߟྀͨ͠ɻECN_SYN DoS͸ࠔΔ • DC಺ͳΒࠔΒͳ͍ɻ • SYN, SYN-ACKʹ΋ECNΛ࢖͑ΔΑ͏ʹมߋͨ͠ (DCTCP+) DCTCPίωΫγϣϯ͕͋Δঢ়ଶͰɺ ී௨ͷTCPίωΫγϣϯཱ֬ͷ֬཰Λଌఆ
  15. DCTCP+ Performance: Incast • Incast Throughput and Fairness with Bu

    ff er Tuning Active • 19 sender -> 1 receiver • ϨγʔϒόοϑΝ͸auto tuning • ݁Ռ • DCTCP: ߴ଎Ͱ৴པੑ͋Γ • TCP: ϩεͱϨγʔϒόοϑΝௐ੔͕ශऑͳͨΊɺεϧʔϓοτ͕҆ఆ͠ͳ͍ • DCTCPͩͱpacket latency͸2ܻখ͘͞ͳΔ • ΞϓϦϨΠϠʔʹ΋ߩݙ • ༧ଌෆՄೳͳ஗ԆରॲϩδοΫΛΞϓϦʹೖΕͳͯ͘ྑ͍
  16. DCTCP+ Performance: Scale (1/2) • Scale test • 1 receiver

    from 100-500 senders • Result • 500୆Ͱ΋ଳҬΛฏ౳ʹ࢖͍੾ΕΔ • RTT஗Ԇ͕ൺֱత௕͍ • 300୆ͰΩϡʔ͕ᷓΕΔɺ400୆Ͱϩεൃੜ • ͜ͷن໛Ͱ͸௨ৗͷTCPΑΓྑ͍ͱ͸ݴ͑ͳ͍
  17. DCTCP+ Performance: Scale (2/2) • Կނ͔ʁ • DCTCP͕᫔᫓Λ๷͙fall backΛੵۃతʹ͸͠ͳ͍ͨΊ •

    ۩ମతʹ͸ӈࣜ • ଌఆ࣌ͷৼΔ෣͍Λઆ໌ • cwndαΠζΛʢܭࢉతʹʣݮΒͤ͹εέʔϧͰ͖Δ • సૹύέοτ͕ݮΔͷͰlatency͸૿͑ͦ͏ɻɻ • 600୆Λ1.5೥࣮ՔಇɻΞϓϦ૚͕DCTCPͷ໰୊ΛҾ͘͜ͱ͸ແ͔ͬͨ • // 500୆ͷ৔߹͸25GͳΒ໰୊ͳͦ͞͏ ݪཧతʹ͸ɿ ࣮ࡍͷίʔυɿ ࣌ؒͰׂΔɿ ଌఆ஋͔ΒRateΛܭࢉɿ
  18. Receive Buffer Tuning • εϧʔϓοτൺֱ • TCPͷෆ҆ఆ͕͞RBT dynamicͰߋʹ૿௕ɻऩଋʹ਺ඵ͔͔Δ • ΩϡʔΠϯά஗Ԇ͕఻೻஗ԆΑΓ4ܻখ͍͞ঢ়گ

    • ͜ΕΛߟྀͨ͠ઃܭʹͳ͍ͬͯͳ͍ͨΊɻ • RBT dynamic/staticͷҧ͍ • staticʹઃఆ͢ΔͱΞϓϦɾNWʹґଘͯ͠͠·͏ɻͦ΋ͦ΋ௐ੔ࠔ೉ɻ • static͸Loose CouplingΛࣦ͍ɺ൚༻తͰ͸ແ͍ RBT dynamic RBT static TCP ෆ҆ఆ× ඇৗʹ҆ఆ˕ DCTCP+ ҆ఆ̋ ҆ఆ̋ • dynamic͕ྑ͍
  19. Conclusion • TCP͸ͱͯ΋ྑ͍͚Ͳɺݱ୅తͳDC಺Ͱݱ୅తͳΞϓϦΛ࢖͏ͱෆ౎߹͕ଟ͍ • KVS, ෼ࢄετϨʔδ • incast, Delayed ACKsReceive

    Bu ff er Tuning • ෆ҆ఆͳεϧʔϓοτɺߴ஗ԆɺΞϓϦଆ·Ͱtimeout • DCTCP+Λ࢖͏͜ͱͰɺ҆ఆత͔ͭ௿஗ԆΛ࣮ݱ • SYN, SYN-ACK + ECN • ΞϓϦଆͷվमඞཁͳ͠ • طଘͷTCPͱ΋ʢNWػثଆͰ͏·͘ௐ੔͢Δ͜ͱͰʣڞଘͰ͖Δ
  20. Related Work • First discussed about incast by Nagle et

    al • reducing RTOmin by Phanishayee et al • TCP peromance@ISP perspectivee by Yu et al • RED/ECN drop non-ECT on switch (by not DCTCP) Wu et al • developing automatically tuning TCP bu ff er by Semke et al • pFabric: clean-slate approach to DC communication. ༗๬͕ͩઌ͕௕͍
  21. EoP