Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#25 “Swift: Delay is Simple and Effective for C...

#25 “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”

ACM SIGCOM ’20, “Awarded best paper”
https://dl.acm.org/doi/abs/10.1145/3387514.3406591

cafenero_777

June 19, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #25 “Swift: Delay is Simple and E

    ff ective for Congestion Control in the Datacenter” ௨ࢉ#78 @cafenero_777 2021/07/29 1
  2. Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. MOTIVATION 3.

    SWIFT DESIGN & IMPLEMENTATION 4. TAKEAWAYS FROM PRODUCTION 5. EXPERIMENTAL RESULTS 6. RELATED WORK 7. CONCLUSION AND FUTURE DIRECTIONS 2
  3. ର৅࿦จ • Swift: Delay is Simple and E ff ective

    for Congestion Control in the Datacenter • Gautam Kumar, Nandita Dukkipati, Keon Jang (MPI-SWS)∗, Hassan M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat • Google LLC • ACM SIGCOM ’20, “Awarded best paper” • https://dl.acm.org/doi/abs/10.1145/3387514.3406591 3
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • Swift congestion controlͰ஗Ԇ࡟ݮ • <50us tail-latency@100Gbps

    • DCTCPΑΓ10ഒ௿஗Ԇɺincastʹͱͯ΋ڧ͍ • ಡ΋͏ͱͨ͠ཧ༝ͱײ૝ • ON-RAMPͰҾ༻͞Ε͍ͯͨͨΊ • DC༻్Ͱ࡞ΒΕ͍ͯΔͨΊ • Best Paper@SIGCOM 2020 4 https://olympic-placard-generater.herokuapp.com
  5. 1. Introduction • ੈ͸େ”෼ࢄγεςϜ”࣌୅, buzz word: disaggregation X (Storage/Compute/Memory) •

    TIMELY͔ΒSwift΁ਐԽ • E2Eͷਖ਼֬ͳ஗ԆଌఆΛ࢖͏ɻSLOʹରԠ͠΍͍͢ɻϗετͱϑΝϒϦοΫͷ஗Ԇ෼཭ • NWͷػೳΛ࢖Θͳ͍ͷͰ࣍ੈ୅εΠονͷػೳɾઃఆ͕࢖͑Δ • ECNํࣜɿେن໛incast΍IOPS͕ଟ͍workloadʹෆ޲͖ɻ -> ϗετଆͷॲཧ஗Ԇʹ͸ରॲͰ͖ͳ͍ɻi.e. DCTCP, HPCC • ύέοτεέδϡʔϦϯάํࣜɿεΠονͱ໌ࣔతʹεέδϡʔϦϯά -> ಋೖɾอक͕େมɺϚϧνςφϯτͩͱແཧɻɻi.e. pFabric • Swiftͷ݁Ռ • incastͳͲͷτϥϑΟοΫͰ΋server, switchͷΩϡʔΠϯά஗Ԇ࡟ݮ, 100GbpsͰ΋௿஗Ԇ, ΄΅zero packet loss • ༷ʑͳΞϓϦϨΠϠʔ͔Βݟͯ΋RPCੑೳྑ͍ • ஗Ԇ͸᫔᫓৴߸ͱͯ͠γϯϓϧ͔ͭ༏ΕͨੑೳʢTIMELYΑΓγϯϓϧʣɻϗετଆ᫔᫓ʹରԠɻincast౳ʹ΋ରԠ 6
  6. 2. MOTIVATION • Storage Workloads • HDD: O(10)ms -> Flash

    O(100us) -> … • ෼ࢄʢෳ਺σόΠεΞΫηεʣ -> Tail-latencyͰ཯଎ • Host Networking Stacks • kernelมߋίετߴ -> Snap (OS bypass & microkernel NW module)Ͱ࣮૷ • ϗετଆॲཧ஗Ԇ΋ݕ஌ͤ͞Δ • Datacenter Switches • ੑೳɾػछόϥόϥͷͨΊɺ᫔᫓੍ޚΛ֤ػثͰӡ༻͸ࠔ೉ • ྫɿECNόοϑΝαΠζௐ੔ • ϗετଆγάφϧΛݩʹNW஗Ԇ࡟ݮΛ໨ࢦ͢ 7 https://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
  7. 3. SWIFT DESIGN & IMPLEMENTATION (1/4) ઃܭࢦ਑ 8 • Swiftͷࢦ਑

    • େن໛DCNͰ༷ʑͳϫʔΫϩʔυΛ޿ଳҬɺ௿஗Ԇ, near zero lossΛఏڙ • host (NIC/software stack)ͷ᫔᫓੍ޚͰE2E/NW FabricΛ௿஗ԆԽͤ͞Δ • CPUޮ཰͕ྑ͍͜ͱ • AIMD (Additive-Increase Multiplicative-Decrease )ΞϧΰϦζϜ • Ճࢉత૿Ճɾ৐ࢉతݮগΞϧΰϦζϜ • ஗Ԇ࣌ؒΛNICؒ(fabric)ͱϗετؒʹ෼཭ ?
  8. 3. SWIFT DESIGN & IMPLEMENTATION (2/4) ஗Ԇγάφϧͷ੔ཧ 9 Q಺Ͱա࣌ؒ͢͝ (ૹ৴։࢝·Ͱͷ࣌ؒ)

    NIC Q಺଴ͪ࣌ؒ ߴෛՙͩͱ஗Ԇܹ૿ ύέοτॲཧͯ͠ ackΛฦ͢·Ͱͷ࣌ؒ SW->hostͷ߹ܭ࣌ؒ ʢ஗Ԇ͕ରশͰͳ͍Մೳੑ͋Γɻʣ ackύέοτॲཧͯ͠ ௨৴׬ྃ͢Δ·Ͱͷ࣌ؒ NW᫔᫓ͷࢦඪ ϗετ᫔᫓ͷࢦඪ ໭Γͷ௨৴͸ʢී௨͸௚઀ʣ੍ޚͰ͖ͳ͍ -> Remoteଆ΋SwiftΛ࢖͑͹੍ޚͰ͖Δ t4-t2ΛૹΔ host->SW΁ͷ߹ܭ࣌ؒ ʢγϦΞϧԽɺΩϡʔΠϯά౳ʣ t6-t1: E2E஗Ԇ (RTT) • E: ΤϯυϙΠϯτ஗Ԇ: (t4-t2) + (t6-t5) • F: ϑΝϒϦοΫ஗Ԇ: t1 - E F: E:
  9. 3. SWIFT DESIGN & IMPLEMENTATION (3/4) cwndͷܾΊํ 10 • ໨ඪ஗Ԇͱ࣮ࡍͷ஗ԆͱͷࠩΛݩʹɺ


    AIMDతʹʢՃࢉత૿Ճɾ৐ࢉతݮগʣcwndΛࢉग़ • ϑϩʔຖʹ஗Ԇ෼཭ • ϑΝϒϦοΫ஗ԆΛݩʹfcwndΛࢉग़ • ΤϯυϙΠϯτ஗ԆΛݩʹecwndΛࢉग़ • ༗ޮcwind := min (fcwnd, ecwnd) • ߴੑೳrwndͱࢥ͓͏ɻi.e. ૉͷcwind͸ϝϞϦߟྀͷΈ • େن໛incastΛߟྀͯ͠cwnd͸1(packet)ҎԼͷ৔߹΋༗Γ • pacing_delay΋͏·͘࡞Δ • ͜ͷ෼཭࡞ઓͰtail-latency͸2ഒվળ ? มಈΠϝʔδ มಈΠϝʔδ
  10. 3. SWIFT DESIGN & IMPLEMENTATION (4/4) Scaling the Target Delay

    11 • λʔήοτ஗Ԇʢfabric஗ԆʣͷܾΊํ • τϙϩδґଘͷ஗Ԇ૿Ճ • ܦ࿏ʹΑͬͯ஗Ԇ͕ҧ͏ • ϑϩʔຖʹTTL(hop਺)ΛݟΔ -> λʔήοτ஗Ԇʹม׵ • ϑϩʔ਺ґଘͷ஗Ԇ૿Ճ • ϑϩʔ਺NʹԠͯ͡O(sqrt(N))ͰΩϡʔͷ଴ͪߦྻ૿Ճ • fair-shareঢ়ଶͰ͸cwnd͸ϑϩʔ਺ʹ൓ൺྫ • ஗Ԇ࣌ؒɿ t = base_target + #hops × ℏ + max(0, min( α fcwnd + β, fs_range)) τϙϩδґଘ ϑϩʔ਺ґଘ
  11. 4. TAKEAWAYS FROM PRODUCTION (1/3) ProdͰͷଌఆ֓ཁ • 4೥͔͚ͯSwiftΛಋೖ • HDD/SSD

    read/write (byte-intensive) • in-memory KVS (latency-intensive) • in-memory FS for BigQuery shu ff l e (IOPS-intensive) • ଌఆํ๏ • 1week, fl eet-wide, ෯޿͍ϫʔΫϩʔυɾن໛ɾར༻཰ • εΠον౷ܭ৘ใʢlink-util, less-rate / portʣ • ϗετؒRTT (Swift, non-SwiftͷNIC timestamp) • ΞϓϦ΁ͷӨڹ • DCTCP૬౰ͱൺֱɻͨͩ͠ECNͰ͸ͳ͘GCNར༻ʢECNΛड৴ͨ͠Β஗ԆACKΛແޮԽ͢Δʣ 12
  12. • Loss • Swiftͩͱ1-2ܻ௿͍ϩε཰ • ΫϥελຖͩͱGCN͸มಈ͕େ͖͍ • Latency • ฏۉlatency

    ~= target delayΛୡ੒ • ֎Ε஋: ߴQoS, heavy incast • E2EͰlantecy؇࿨ • cwnd<1 • hostͰͷ᫔᫓ݕ஌ 4. TAKEAWAYS FROM PRODUCTION (2/3) Loss & Latency 13 Fig6. Edge (ToR to host) links Fig9. ClusterEdge (ToR to host) links in the cluster Fig12. Cluster Swift/GCN RTT
  13. • wire-rateग़͍ͯΔͷʹϩε͕ͳ͍ -> ೋ౓όάใࠂ͕དྷͨʂ • QoSຖʹݟΔʢnon-Swift QoSଆ͸ϩε͕͋Δ͸ͣʣ • εϧʔϓοτΛ٘ਜ਼ʹ௿஗ԆΛ࣮ݱͯ͠Δɺͱ͍͏ѱධʢʂʣ •

    ݁Ռͷѱ͍ผόʔδϣϯͱൺֱɻͰ΋εϧʔϓοτվળͤͣɻΉ͠ΖRTT, loss૿Ճ • target-delayҾ্͖͛ɺcwnd<0ΛແޮԽͳͲ • SwiftͷCPU࢖༻཰͸2.6%ఔ౓ʢ1.4%͕ackॲཧɺ0.31%͕࣌ؒଌఆॲཧʣ • ECNͷ͖͍͠஋ͷεΠονઃఆɾௐ੔͸େมͩͬͨʢSwiftͷE2E latency͸ϗετଆ੍ޚͷͨΊ؆୯ʣ 4. TAKEAWAYS FROM PRODUCTION (3/3) Production Experience 14
  14. 5. EXPERIMENTAL RESULTS • ςετ؀ڥ • T1: 50Gbps host *

    60 • T2: 100Gbps host * 500 15 T1: target delay@incast ~= RTTΛୡ੒֬ೝ 25usҎ্ͰthroughputΛୡ੒֬ೝ T2: RTT~25usఔ౓ͰSwiftͰthrouhgputୡ੒Λ֬ೝ T1: େن໛incastͰ΋SwiftͳΒ଱͑͏Δɻ ʢGCNͩͱ1000:1Ͱ΋loss 5.1%, RTT-ave 8.6ഒʣ
  15. 5. EXPERIMENTAL RESULTS • a 16 T1: workload(App)ຖʹ᫔᫓Օॴ͕ҧ͏ T1: Swift-v0

    (delay=RTT)Λ࢖͍ɺincast (64kB & 60B)Λྲྀ͢ Swift (not v0)͸ThroughputͱRTTͷ྆ํྑ͍ IOPSܥ௨৴ fcwndେ͖͍ ecwndখ͍͞ʢhost஗Ԇʣ byteܥ௨৴ fcwndখ͍͞(fabric஗Ԇ) ecwndେ͖͍
  16. 5. EXPERIMENTAL RESULTS fl ow-fairness 17 T1: গͳ͍ϑϩʔ਺Ͱެฏੑ֬ೝ T1: 50host*100

    fl ow͕link্Ͱ᫔᫓ͨ͠৔߹ fl ow-based scalingʢʁʁʁʣ͕༗ޮ Jain fairness-index (J=0.91)͕ͱͯ΋ྑ͍ T1: path௕͕ҧ͏৔߹ Topology-base scaling༗ޮ intra-rack inter-rack
  17. 6. RELATED WORK 18 TIMELY͔ΒSwift΁ͷਐԽ • fabric, host᫔᫓ͷ෼཭ • RTTޯ഑Ͱ͸ͳ͘E2E஗ԆΛར༻

    ʢਖ਼֬ͳtimestampΛackͰૹΔʣ • ෛՙɾτϙϩδʹԠͯ͡໨ඪ஗ԆΛscalingͤ͞Δ • େن໛incastରԠ
  18. 7. CONCLUSION AND FUTURE DIRECTIONS • ᫔᫓੍ޚͷύϥυοΫε • ʢҰൠతʹ࢖͑Δʣ᫔᫓੍ޚΛ࡞Γ͍ͨ •

    ʢٯઆతʹʣεΠονɾϗετɺதԝ؅ཧɾΞϓϦຖͷௐ੔͕ඞཁʹͳΔ • GoogleͰͷؾ෇͖ • γϯϓϧͳ᫔᫓γάφϧ͕ྑ͍ • NIC Timestamp, ϓϩτίϧελοΫվྑ • Swift: 100GbpsଳҬ@30us • طଘϓϩτίϧελοΫʹ࠶ར༻Ͱ͖ͦ͏ • <10usΛαϙʔτ͢Δʹ͸ߋʹ৽͍ٕ͠ज़͕ඞཁ 19
  19. • QoS෼཭ͷྫ • GCN (QoSߴ༏ઌ) V.S. Swift (QoS௿༏ઌ)ͷྫ • IOPSूதΫϥελ(IOPS-sensitive)

    vs. storage RPC (throughput-intensive) • IOPSܕ͸NIC (not fabric)ͷtail-latency͕େ͖͍ • debugʹ༗ޮ (host or fabric) 4. TAKEAWAYS FROM PRODUCTION 24
  20. 3. SWIFT DESIGN & IMPLEMENTATION (1/5) Loss Recovery and ACKs,

    Coexistence via QoS 25 • ύέϩε • SWIFTͰtail-latency͕ྑ͍ • SACKͱRTOͰ࠷খݶͷରԠ • QoSΛ࢖ͬͯSWIFTͱଞͷTCPΛڞଘΛ໨ࢦ͢ • εΠονͷQoSΩϡʔͷҰ෦ΛSWIFTʹׂΓ౰ͯ