Slide 1

Slide 1 text

Research Paper Introduction #25 “Swift: Delay is Simple and E ff ective for Congestion Control in the Datacenter” ௨ࢉ#78 @cafenero_777 2021/07/29 1

Slide 2

Slide 2 text

Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. MOTIVATION 3. SWIFT DESIGN & IMPLEMENTATION 4. TAKEAWAYS FROM PRODUCTION 5. EXPERIMENTAL RESULTS 6. RELATED WORK 7. CONCLUSION AND FUTURE DIRECTIONS 2

Slide 3

Slide 3 text

ର৅࿦จ • Swift: Delay is Simple and E ff ective for Congestion Control in the Datacenter • Gautam Kumar, Nandita Dukkipati, Keon Jang (MPI-SWS)∗, Hassan M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat • Google LLC • ACM SIGCOM ’20, “Awarded best paper” • https://dl.acm.org/doi/abs/10.1145/3387514.3406591 3

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • Swift congestion controlͰ஗Ԇ࡟ݮ • <50us tail-latency@100Gbps • DCTCPΑΓ10ഒ௿஗Ԇɺincastʹͱͯ΋ڧ͍ • ಡ΋͏ͱͨ͠ཧ༝ͱײ૝ • ON-RAMPͰҾ༻͞Ε͍ͯͨͨΊ • DC༻్Ͱ࡞ΒΕ͍ͯΔͨΊ • Best Paper@SIGCOM 2020 4 https://olympic-placard-generater.herokuapp.com

Slide 5

Slide 5 text

“Swift”ଟ͗͢ʂ • γϡοͱ଎͍Πϝʔδʁ • Կʹ͚ͭͯ΋ྑͦ͞͏ײ 5 https://en.wikipedia.org/wiki/Swift

Slide 6

Slide 6 text

1. Introduction • ੈ͸େ”෼ࢄγεςϜ”࣌୅, buzz word: disaggregation X (Storage/Compute/Memory) • TIMELY͔ΒSwift΁ਐԽ • E2Eͷਖ਼֬ͳ஗ԆଌఆΛ࢖͏ɻSLOʹରԠ͠΍͍͢ɻϗετͱϑΝϒϦοΫͷ஗Ԇ෼཭ • NWͷػೳΛ࢖Θͳ͍ͷͰ࣍ੈ୅εΠονͷػೳɾઃఆ͕࢖͑Δ • ECNํࣜɿେن໛incast΍IOPS͕ଟ͍workloadʹෆ޲͖ɻ -> ϗετଆͷॲཧ஗Ԇʹ͸ରॲͰ͖ͳ͍ɻi.e. DCTCP, HPCC • ύέοτεέδϡʔϦϯάํࣜɿεΠονͱ໌ࣔతʹεέδϡʔϦϯά -> ಋೖɾอक͕େมɺϚϧνςφϯτͩͱແཧɻɻi.e. pFabric • Swiftͷ݁Ռ • incastͳͲͷτϥϑΟοΫͰ΋server, switchͷΩϡʔΠϯά஗Ԇ࡟ݮ, 100GbpsͰ΋௿஗Ԇ, ΄΅zero packet loss • ༷ʑͳΞϓϦϨΠϠʔ͔Βݟͯ΋RPCੑೳྑ͍ • ஗Ԇ͸᫔᫓৴߸ͱͯ͠γϯϓϧ͔ͭ༏ΕͨੑೳʢTIMELYΑΓγϯϓϧʣɻϗετଆ᫔᫓ʹରԠɻincast౳ʹ΋ରԠ 6

Slide 7

Slide 7 text

2. MOTIVATION • Storage Workloads • HDD: O(10)ms -> Flash O(100us) -> … • ෼ࢄʢෳ਺σόΠεΞΫηεʣ -> Tail-latencyͰ཯଎ • Host Networking Stacks • kernelมߋίετߴ -> Snap (OS bypass & microkernel NW module)Ͱ࣮૷ • ϗετଆॲཧ஗Ԇ΋ݕ஌ͤ͞Δ • Datacenter Switches • ੑೳɾػछόϥόϥͷͨΊɺ᫔᫓੍ޚΛ֤ػثͰӡ༻͸ࠔ೉ • ྫɿECNόοϑΝαΠζௐ੔ • ϗετଆγάφϧΛݩʹNW஗Ԇ࡟ݮΛ໨ࢦ͢ 7 https://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

Slide 8

Slide 8 text

3. SWIFT DESIGN & IMPLEMENTATION (1/4) ઃܭࢦ਑ 8 • Swiftͷࢦ਑ • େن໛DCNͰ༷ʑͳϫʔΫϩʔυΛ޿ଳҬɺ௿஗Ԇ, near zero lossΛఏڙ • host (NIC/software stack)ͷ᫔᫓੍ޚͰE2E/NW FabricΛ௿஗ԆԽͤ͞Δ • CPUޮ཰͕ྑ͍͜ͱ • AIMD (Additive-Increase Multiplicative-Decrease )ΞϧΰϦζϜ • Ճࢉత૿Ճɾ৐ࢉతݮগΞϧΰϦζϜ • ஗Ԇ࣌ؒΛNICؒ(fabric)ͱϗετؒʹ෼཭ ?

Slide 9

Slide 9 text

3. SWIFT DESIGN & IMPLEMENTATION (2/4) ஗Ԇγάφϧͷ੔ཧ 9 Q಺Ͱա࣌ؒ͢͝ (ૹ৴։࢝·Ͱͷ࣌ؒ) NIC Q಺଴ͪ࣌ؒ ߴෛՙͩͱ஗Ԇܹ૿ ύέοτॲཧͯ͠ ackΛฦ͢·Ͱͷ࣌ؒ SW->hostͷ߹ܭ࣌ؒ ʢ஗Ԇ͕ରশͰͳ͍Մೳੑ͋Γɻʣ ackύέοτॲཧͯ͠ ௨৴׬ྃ͢Δ·Ͱͷ࣌ؒ NW᫔᫓ͷࢦඪ ϗετ᫔᫓ͷࢦඪ ໭Γͷ௨৴͸ʢී௨͸௚઀ʣ੍ޚͰ͖ͳ͍ -> Remoteଆ΋SwiftΛ࢖͑͹੍ޚͰ͖Δ t4-t2ΛૹΔ host->SW΁ͷ߹ܭ࣌ؒ ʢγϦΞϧԽɺΩϡʔΠϯά౳ʣ t6-t1: E2E஗Ԇ (RTT) • E: ΤϯυϙΠϯτ஗Ԇ: (t4-t2) + (t6-t5) • F: ϑΝϒϦοΫ஗Ԇ: t1 - E F: E:

Slide 10

Slide 10 text

3. SWIFT DESIGN & IMPLEMENTATION (3/4) cwndͷܾΊํ 10 • ໨ඪ஗Ԇͱ࣮ࡍͷ஗ԆͱͷࠩΛݩʹɺ
 AIMDతʹʢՃࢉత૿Ճɾ৐ࢉతݮগʣcwndΛࢉग़ • ϑϩʔຖʹ஗Ԇ෼཭ • ϑΝϒϦοΫ஗ԆΛݩʹfcwndΛࢉग़ • ΤϯυϙΠϯτ஗ԆΛݩʹecwndΛࢉग़ • ༗ޮcwind := min (fcwnd, ecwnd) • ߴੑೳrwndͱࢥ͓͏ɻi.e. ૉͷcwind͸ϝϞϦߟྀͷΈ • େن໛incastΛߟྀͯ͠cwnd͸1(packet)ҎԼͷ৔߹΋༗Γ • pacing_delay΋͏·͘࡞Δ • ͜ͷ෼཭࡞ઓͰtail-latency͸2ഒվળ ? มಈΠϝʔδ มಈΠϝʔδ

Slide 11

Slide 11 text

3. SWIFT DESIGN & IMPLEMENTATION (4/4) Scaling the Target Delay 11 • λʔήοτ஗Ԇʢfabric஗ԆʣͷܾΊํ • τϙϩδґଘͷ஗Ԇ૿Ճ • ܦ࿏ʹΑͬͯ஗Ԇ͕ҧ͏ • ϑϩʔຖʹTTL(hop਺)ΛݟΔ -> λʔήοτ஗Ԇʹม׵ • ϑϩʔ਺ґଘͷ஗Ԇ૿Ճ • ϑϩʔ਺NʹԠͯ͡O(sqrt(N))ͰΩϡʔͷ଴ͪߦྻ૿Ճ • fair-shareঢ়ଶͰ͸cwnd͸ϑϩʔ਺ʹ൓ൺྫ • ஗Ԇ࣌ؒɿ t = base_target + #hops × ℏ + max(0, min( α fcwnd + β, fs_range)) τϙϩδґଘ ϑϩʔ਺ґଘ

Slide 12

Slide 12 text

4. TAKEAWAYS FROM PRODUCTION (1/3) ProdͰͷଌఆ֓ཁ • 4೥͔͚ͯSwiftΛಋೖ • HDD/SSD read/write (byte-intensive) • in-memory KVS (latency-intensive) • in-memory FS for BigQuery shu ff l e (IOPS-intensive) • ଌఆํ๏ • 1week, fl eet-wide, ෯޿͍ϫʔΫϩʔυɾن໛ɾར༻཰ • εΠον౷ܭ৘ใʢlink-util, less-rate / portʣ • ϗετؒRTT (Swift, non-SwiftͷNIC timestamp) • ΞϓϦ΁ͷӨڹ • DCTCP૬౰ͱൺֱɻͨͩ͠ECNͰ͸ͳ͘GCNར༻ʢECNΛड৴ͨ͠Β஗ԆACKΛແޮԽ͢Δʣ 12

Slide 13

Slide 13 text

• Loss • Swiftͩͱ1-2ܻ௿͍ϩε཰ • ΫϥελຖͩͱGCN͸มಈ͕େ͖͍ • Latency • ฏۉlatency ~= target delayΛୡ੒ • ֎Ε஋: ߴQoS, heavy incast • E2EͰlantecy؇࿨ • cwnd<1 • hostͰͷ᫔᫓ݕ஌ 4. TAKEAWAYS FROM PRODUCTION (2/3) Loss & Latency 13 Fig6. Edge (ToR to host) links Fig9. ClusterEdge (ToR to host) links in the cluster Fig12. Cluster Swift/GCN RTT

Slide 14

Slide 14 text

• wire-rateग़͍ͯΔͷʹϩε͕ͳ͍ -> ೋ౓όάใࠂ͕དྷͨʂ • QoSຖʹݟΔʢnon-Swift QoSଆ͸ϩε͕͋Δ͸ͣʣ • εϧʔϓοτΛ٘ਜ਼ʹ௿஗ԆΛ࣮ݱͯ͠Δɺͱ͍͏ѱධʢʂʣ • ݁Ռͷѱ͍ผόʔδϣϯͱൺֱɻͰ΋εϧʔϓοτվળͤͣɻΉ͠ΖRTT, loss૿Ճ • target-delayҾ্͖͛ɺcwnd<0ΛແޮԽͳͲ • SwiftͷCPU࢖༻཰͸2.6%ఔ౓ʢ1.4%͕ackॲཧɺ0.31%͕࣌ؒଌఆॲཧʣ • ECNͷ͖͍͠஋ͷεΠονઃఆɾௐ੔͸େมͩͬͨʢSwiftͷE2E latency͸ϗετଆ੍ޚͷͨΊ؆୯ʣ 4. TAKEAWAYS FROM PRODUCTION (3/3) Production Experience 14

Slide 15

Slide 15 text

5. EXPERIMENTAL RESULTS • ςετ؀ڥ • T1: 50Gbps host * 60 • T2: 100Gbps host * 500 15 T1: target delay@incast ~= RTTΛୡ੒֬ೝ 25usҎ্ͰthroughputΛୡ੒֬ೝ T2: RTT~25usఔ౓ͰSwiftͰthrouhgputୡ੒Λ֬ೝ T1: େن໛incastͰ΋SwiftͳΒ଱͑͏Δɻ ʢGCNͩͱ1000:1Ͱ΋loss 5.1%, RTT-ave 8.6ഒʣ

Slide 16

Slide 16 text

5. EXPERIMENTAL RESULTS • a 16 T1: workload(App)ຖʹ᫔᫓Օॴ͕ҧ͏ T1: Swift-v0 (delay=RTT)Λ࢖͍ɺincast (64kB & 60B)Λྲྀ͢ Swift (not v0)͸ThroughputͱRTTͷ྆ํྑ͍ IOPSܥ௨৴ fcwndେ͖͍ ecwndখ͍͞ʢhost஗Ԇʣ byteܥ௨৴ fcwndখ͍͞(fabric஗Ԇ) ecwndେ͖͍

Slide 17

Slide 17 text

5. EXPERIMENTAL RESULTS fl ow-fairness 17 T1: গͳ͍ϑϩʔ਺Ͱެฏੑ֬ೝ T1: 50host*100 fl ow͕link্Ͱ᫔᫓ͨ͠৔߹ fl ow-based scalingʢʁʁʁʣ͕༗ޮ Jain fairness-index (J=0.91)͕ͱͯ΋ྑ͍ T1: path௕͕ҧ͏৔߹ Topology-base scaling༗ޮ intra-rack inter-rack

Slide 18

Slide 18 text

6. RELATED WORK 18 TIMELY͔ΒSwift΁ͷਐԽ • fabric, host᫔᫓ͷ෼཭ • RTTޯ഑Ͱ͸ͳ͘E2E஗ԆΛར༻ ʢਖ਼֬ͳtimestampΛackͰૹΔʣ • ෛՙɾτϙϩδʹԠͯ͡໨ඪ஗ԆΛscalingͤ͞Δ • େن໛incastରԠ

Slide 19

Slide 19 text

7. CONCLUSION AND FUTURE DIRECTIONS • ᫔᫓੍ޚͷύϥυοΫε • ʢҰൠతʹ࢖͑Δʣ᫔᫓੍ޚΛ࡞Γ͍ͨ • ʢٯઆతʹʣεΠονɾϗετɺதԝ؅ཧɾΞϓϦຖͷௐ੔͕ඞཁʹͳΔ • GoogleͰͷؾ෇͖ • γϯϓϧͳ᫔᫓γάφϧ͕ྑ͍ • NIC Timestamp, ϓϩτίϧελοΫվྑ • Swift: 100GbpsଳҬ@30us • طଘϓϩτίϧελοΫʹ࠶ར༻Ͱ͖ͦ͏ • <10usΛαϙʔτ͢Δʹ͸ߋʹ৽͍ٕ͠ज़͕ඞཁ 19

Slide 20

Slide 20 text

ࡾߦ·ͱΊ • Swift᫔᫓੍ޚɿNW஗Ԇͱϗετॲཧ஗ԆΛ෼ׂɺਖ਼֬ʹRTTଌఆ • εΠονଆͷઃఆมߋ͸θϩ • ௿latencyͱߴ͍throughputΛಉ࣌ʹୡ੒ • େྔincastʹ΋ରԠ͠ɺϑϩʔެฏੑ΋࣮ݱ 20

Slide 21

Slide 21 text

EoP 21

Slide 22

Slide 22 text

ิ଍ࢿྉ 22

Slide 23

Slide 23 text

• ༷ʑͳΞϓϦͰ༗ޮ • in-memory FS: ௕ظؒlatency҆ఆ • SSD storage system: ௿͍tail-latency 4. TAKEAWAYS FROM PRODUCTION 23

Slide 24

Slide 24 text

• QoS෼཭ͷྫ • GCN (QoSߴ༏ઌ) V.S. Swift (QoS௿༏ઌ)ͷྫ • IOPSूதΫϥελ(IOPS-sensitive) vs. storage RPC (throughput-intensive) • IOPSܕ͸NIC (not fabric)ͷtail-latency͕େ͖͍ • debugʹ༗ޮ (host or fabric) 4. TAKEAWAYS FROM PRODUCTION 24

Slide 25

Slide 25 text

3. SWIFT DESIGN & IMPLEMENTATION (1/5) Loss Recovery and ACKs, Coexistence via QoS 25 • ύέϩε • SWIFTͰtail-latency͕ྑ͍ • SACKͱRTOͰ࠷খݶͷରԠ • QoSΛ࢖ͬͯSWIFTͱଞͷTCPΛڞଘΛ໨ࢦ͢ • εΠονͷQoSΩϡʔͷҰ෦ΛSWIFTʹׂΓ౰ͯ