Slide 1

Slide 1 text

Research Paper Introduction #2 Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter @cafenero_777 2019/06/11

Slide 2

Slide 2 text

$ which • Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter • Glenn Judd, Morgan Stanley • NSDI’15 • https://www.usenix.org/conference/nsdi15/technical-sessions/ presentation/judd

Slide 3

Slide 3 text

Agenda • ֓ཁͱಡΜͩཧ༝ • Introduction • Setting • σʔληϯλͰͷTCPͷ՝୊ • Delayed ACKs • Reducing RTOmin • DCTCP • TCTCP Deployment Challenges • TCTCP+Performance • Receive Bu ff er Tuning • Related Work • Conclusion

Slide 4

Slide 4 text

֓ཁͱಡΜͩཧ༝ • ֓ཁ • ௨ৗͷTCPΛσʔληϯλͰར༻͢ΔͱύϑΥʔϚϯε͕ྑ͘ͳ͍৔߹͕͋Δ • େن໛σʔληϯλͰDCTCPΛ࣮ࡍʹར༻ͯ͠ɺύϑΥʔϚϯεվળɾࣦഊ౳Λٞ࿦ • ಡ΋͏ͱͨ͠ཧ༝ • Closੈ୅ʹ߹ͬͨTCPελοΫͷػೳʹݟ͑ͨͨΊ • Leaf/SpineͰ޿ଳҬ͕֬อ͞Εͯ΋ɺincast໰୊͸ະղܾ • HW SwitchͷϙʔτόοϑΝ༰ྔ໰୊ΛαʔόαΠυଆ͔Βʢιϑτ΢ΣΞతʹʣղܾͰ͖ΔՄೳੑ • BBR/MPTCP౳ͷελοΫͱซ༻Ͱ͖Δͱཧ૝తʁ • TCPελοΫͳΒʢಛʹVMͩͱʣมߋ͠΍͍ͨ͢Ί

Slide 5

Slide 5 text

Introduction • େن໛σʔληϯλ (warehouse-scale: search/SNS/cloud, etc) • TCP/IP͸ͦ΋ͦ΋DC޲͚ʹઃܭ͞Ε͍ͯͳ͍ɻੑೳతʹ৯͑ͳ͍ • ਺msec, ଟ਺ͷ fl owΛ૝ఆ͠ͳ͍ɻ • Πϯλʔωοτ௨৴: ఻ൖ஗Ԇ >>> ΩϡʔΠϯά஗Ԇ • σʔληϯλ௨৴: ఻ൖ஗Ԇ <<< ΩϡʔΠϯά஗Ԇ • UDP࢖͏ʁΞϓϦॻ͖௚͠ʁ • DCTCPͳΒTCPͷ··ੑೳ޲্Մೳ • ϓϩμΫγϣϯͰ࢖ͬͨࣄྫɺϨγʔϒόοϑΝνϡʔχϯάɺطଘTCPͱͷൺֱɺRTO_min௿ݮൺֱ

Slide 6

Slide 6 text

Setting • ΠϯλʔωοταʔϏεͰ͸ͳ͍ • ʢϨϓϦΧަ׵ʁʣϞϯςΧϧϩɾγϛϡϨʔγϣϯ • σʔλղੳ • σʔλΞΫηε • KVS -> όϧΫr/wͰେن໛incastى͖΍͍͢ • Distributed-Data-Storage -> ಉ࣌ʹಉ͡blockʹΞΫηε • DC಺Ͱ༷ʑͳΞϓϦ͕ಈͨ͘Ίɺloos couplingඞਢ • 10Gbps, 9k MTU, iPerf, ൺֱର৅͸CUBIC

Slide 7

Slide 7 text

Setting (Cont. 2/2) • ࣮ࡍͲΜͳτϥϑΟοΫ͕ྲྀΕ͍ͯΔ͔ • 2෼ؒଌఆɻ3WHSͰ͸ͳ͘TCP pushϑϥάͰ1 fl owͱ͢Δ • গ਺ fl owͷγϯϓϧͳϦΫΤετ͕ɺେଟ਺ͷଳҬΛ࢖͏ • KVS͕େଟ਺ͷଳҬΛ઎ΊΔ • fl ow਺͕গͳ͗ͯ͢ैདྷͷ᫔᫓੍ޚͰ͸ରԠࠔ೉ • ಛʹincast໰୊

Slide 8

Slide 8 text

σʔληϯλͰͷTCPͷ՝୊ (1/3) • Delayed ACKsʹΑΔstall • default timeout͸10~100msec • ࠓͲ͖ͷDCͩͱ௕͗͢ʂ • < 1msec • ΞϓϦϨϕϧͰλΠϜΞ΢τ

Slide 9

Slide 9 text

σʔληϯλͰͷTCPͷ՝୊ (2/3) • TCP incast • 1 receiver from N Sender in short amount of time • DC಺૚Ͱ͸ͱͯ΋ྑ͋͘Δ௨৴ύλʔϯ͕ͩରԠࠔ೉ • ΞϓϦ͔Β͸Ͳ͏ݟ͑Δ͔ʁ • λΠϜΞ΢τɺ࠶ૹɺεϧʔϓοτ௿ԼɺϨΠςϯγ૿ɺjitter • ϊΠδʔωΠόʔ • ΞϓϦଆͷ࣮૷มߋͰରԠ͢Δͷ͸ྑ͘ͳ͍ɻɻɻ

Slide 10

Slide 10 text

σʔληϯλͰͷTCPͷ՝୊ (3/3) • Receive Bu ff er Tuning • ड৴όοϑΝαΠζ͕ҎԼʹӨڹ • TCPύϑΥʔϚϯε • αʔόͷRAM࢖༻཰

Slide 11

Slide 11 text

Delayed ACKs • ରԠࡦ • delayed ACKΛ΍ΊΔ (delay=0ms) • ύέοτ͕૿͑ɺΤϯυαʔόෛՙ૿ • timeoutΛ୹͘͢Δ • ͲΕ͙Β͍ͳΒڐ༰Ͱ͖Δʁ • 1msʹઃఆɻ1msҎ্͸ෛՙมΘΒͣɻ Receiver’s IRQ load (2 sender to 1 receiver)

Slide 12

Slide 12 text

Reducing RTOmin • RTO: Retransmission Time Out • incast-induced TCP timeout • ΞϓϦ૚ͷtask tail latencyʹӨڹ࣮ྫ: Fig.9, • 5msʹݮΒͨ͠ྫFig.10 • RTO໰୊Λmitigate͸Ͱ͖Δ͕prevent͸Ͱ͖ͳ͍ • ύέϩε͸Ή͠Ζ૿Ճ • D-ACK, RTOmin ͚ͩͰ͸ࠜຊղܾʹ͸ͳΒͳ͍

Slide 13

Slide 13 text

DCTCP deployment Challenge • զʑͷ໨త • TCP timeoutΛݶΓͳ͘ݮΒ͢ɾlatencyΛݮΒ͢ɾΞϓϦઐ༻NWΛݮΒ͢ • TCPϊΠδʔωΠόʔΛ΁Β͢ • DCTCPͰ࣮ݱͰ͖ͦ͏ • ผͳHW/SWཁΒͳ͍ • https://github.com/myasuda/DCTCP-Linux • σʔληϯλNW ʢΠϯλʔωοτͷNWͱ͸ಛ௃͕ҧ͏ʣ • ಉ͡NWߏ੒͕ͨ͘͞Μɻ؅ཧ͸ҰՕॴɺޙํޓ׵Λؾʹ͢Δඞཁແ͠ • ҰؾʹDCTCPಋೖ͍͕ͨ͠Ͱ͖ͳ͍ཧ༝ • طଘͷTCPʹѱӨڹ͕͋Δ৔߹΍DCTCPΛαϙʔτͰ͖ͳ͍৔߹(ex. fi le server)

Slide 14

Slide 14 text

DCTCP: 1.Coexistence with TCP • ͦ΋ͦ΋ڞଘͰ͖ͳ͍: Fig. 11 ͜Ε͸໨తʹ൓͢Δ • ECN (DCTCP)͕༏ઌ͞Εͯ͠·͏ͨΊTCPΛۦஞ • TCP͸cwnd͕খ͘͞ͳΔͨΊɻ • TCP/DCTCPΛڞଘͤ͞ΔͨΊɺDSCPͰDCTCPΛϚʔΩϯά • DCTCP <- AQMΩϡʔΠϯάͰ؅ཧ • TCP <- drop-tailΩϡʔΠϯάͰ؅ཧ ࢀߟɿhttps://www.nic.ad.jp/ja/materials/iw/1999/proceedings/C03.PDF https://milestone-of-se.nesuke.com/nw-basic/ip/ip-format/ 8 9 10 11 12 13 14 15 DSCP (Di ff Serv Code Point) ECT CE ECN: Explicit Congestion Noti fi cation ECT: ECN Capable Transport L4ϓϩτίϧ͕͕ECNରԠ͍ͯ͠Δ͜ͱΛࣔ͢bit
 CE: Congestion Experience bit ᫔᫓͕ى͖͍ͯΔ͜ͱΛࣔ͢bit AQM: Active Queue Management IPϔομͷҰ෦Λൈਮ

Slide 15

Slide 15 text

DCTCP: 2. Non-compliant switches • ToRͷΈECNαϙʔτ • ͦΕҎ্ͷεΠον͸୯७ͳdrop tail

Slide 16

Slide 16 text

DCTCP: 3. Non-technical challenges • Reduction in coupling • Timing • ඞཁͳػࡐΛࣄલʹௐ੔ • ैདྷͷTCP, non-ECN compliant switcheͰطଘΞϓϦʹӨڹͳ͠

Slide 17

Slide 17 text

DCTCP: 4. Connection Establishment • ߴෛՙͩͱSYN, SYN-ACK(ECNແ͠)͕ഁغ͞ΕɺίωΫγϣϯ ֬཰ෆՄ • Stanford implementation (https://github.com/myasuda/ DCTCP-Linux) • RFC 3168: “A host MUST NOT set ECN on SYN or SYN-ACK packets” • ηΩϡϦςΟΛߟྀͨ͠ɻECN_SYN DoS͸ࠔΔ • DC಺ͳΒࠔΒͳ͍ɻ • SYN, SYN-ACKʹ΋ECNΛ࢖͑ΔΑ͏ʹมߋͨ͠ (DCTCP+) DCTCPίωΫγϣϯ͕͋Δঢ়ଶͰɺ ී௨ͷTCPίωΫγϣϯཱ֬ͷ֬཰Λଌఆ

Slide 18

Slide 18 text

DCTCP+ Performance: Incast • Incast Throughput and Fairness with Bu ff er Tuning Active • 19 sender -> 1 receiver • ϨγʔϒόοϑΝ͸auto tuning • ݁Ռ • DCTCP: ߴ଎Ͱ৴པੑ͋Γ • TCP: ϩεͱϨγʔϒόοϑΝௐ੔͕ශऑͳͨΊɺεϧʔϓοτ͕҆ఆ͠ͳ͍ • DCTCPͩͱpacket latency͸2ܻখ͘͞ͳΔ • ΞϓϦϨΠϠʔʹ΋ߩݙ • ༧ଌෆՄೳͳ஗ԆରॲϩδοΫΛΞϓϦʹೖΕͳͯ͘ྑ͍

Slide 19

Slide 19 text

DCTCP+ Performance: Scale (1/2) • Scale test • 1 receiver from 100-500 senders • Result • 500୆Ͱ΋ଳҬΛฏ౳ʹ࢖͍੾ΕΔ • RTT஗Ԇ͕ൺֱత௕͍ • 300୆ͰΩϡʔ͕ᷓΕΔɺ400୆Ͱϩεൃੜ • ͜ͷن໛Ͱ͸௨ৗͷTCPΑΓྑ͍ͱ͸ݴ͑ͳ͍

Slide 20

Slide 20 text

DCTCP+ Performance: Scale (2/2) • Կނ͔ʁ • DCTCP͕᫔᫓Λ๷͙fall backΛੵۃతʹ͸͠ͳ͍ͨΊ • ۩ମతʹ͸ӈࣜ • ଌఆ࣌ͷৼΔ෣͍Λઆ໌ • cwndαΠζΛʢܭࢉతʹʣݮΒͤ͹εέʔϧͰ͖Δ • సૹύέοτ͕ݮΔͷͰlatency͸૿͑ͦ͏ɻɻ • 600୆Λ1.5೥࣮ՔಇɻΞϓϦ૚͕DCTCPͷ໰୊ΛҾ͘͜ͱ͸ແ͔ͬͨ • // 500୆ͷ৔߹͸25GͳΒ໰୊ͳͦ͞͏ ݪཧతʹ͸ɿ ࣮ࡍͷίʔυɿ ࣌ؒͰׂΔɿ ଌఆ஋͔ΒRateΛܭࢉɿ

Slide 21

Slide 21 text

Receive Buffer Tuning • εϧʔϓοτൺֱ • TCPͷෆ҆ఆ͕͞RBT dynamicͰߋʹ૿௕ɻऩଋʹ਺ඵ͔͔Δ • ΩϡʔΠϯά஗Ԇ͕఻೻஗ԆΑΓ4ܻখ͍͞ঢ়گ • ͜ΕΛߟྀͨ͠ઃܭʹͳ͍ͬͯͳ͍ͨΊɻ • RBT dynamic/staticͷҧ͍ • staticʹઃఆ͢ΔͱΞϓϦɾNWʹґଘͯ͠͠·͏ɻͦ΋ͦ΋ௐ੔ࠔ೉ɻ • static͸Loose CouplingΛࣦ͍ɺ൚༻తͰ͸ແ͍ RBT dynamic RBT static TCP ෆ҆ఆ× ඇৗʹ҆ఆ˕ DCTCP+ ҆ఆ̋ ҆ఆ̋ • dynamic͕ྑ͍

Slide 22

Slide 22 text

Conclusion • TCP͸ͱͯ΋ྑ͍͚Ͳɺݱ୅తͳDC಺Ͱݱ୅తͳΞϓϦΛ࢖͏ͱෆ౎߹͕ଟ͍ • KVS, ෼ࢄετϨʔδ • incast, Delayed ACKsReceive Bu ff er Tuning • ෆ҆ఆͳεϧʔϓοτɺߴ஗ԆɺΞϓϦଆ·Ͱtimeout • DCTCP+Λ࢖͏͜ͱͰɺ҆ఆత͔ͭ௿஗ԆΛ࣮ݱ • SYN, SYN-ACK + ECN • ΞϓϦଆͷվमඞཁͳ͠ • طଘͷTCPͱ΋ʢNWػثଆͰ͏·͘ௐ੔͢Δ͜ͱͰʣڞଘͰ͖Δ

Slide 23

Slide 23 text

Related Work • First discussed about incast by Nagle et al • reducing RTOmin by Phanishayee et al • TCP peromance@ISP perspectivee by Yu et al • RED/ECN drop non-ECT on switch (by not DCTCP) Wu et al • developing automatically tuning TCP bu ff er by Semke et al • pFabric: clean-slate approach to DC communication. ༗๬͕ͩઌ͕௕͍

Slide 24

Slide 24 text

EoP