Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#25 “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”

#25 “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”

ACM SIGCOM ’20, “Awarded best paper”
https://dl.acm.org/doi/abs/10.1145/3387514.3406591

cafenero_777

June 19, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #25


    “Swift: Delay is Simple and E
    ff
    ective for Congestion Control in the Datacenter”

    ௨ࢉ#78
    @cafenero_777

    2021/07/29
    1

    View full-size slide

  2. Agenda
    • ର৅࿦จ

    • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝

    1. INTRODUCTION

    2. MOTIVATION

    3. SWIFT DESIGN & IMPLEMENTATION

    4. TAKEAWAYS FROM PRODUCTION

    5. EXPERIMENTAL RESULTS

    6. RELATED WORK

    7. CONCLUSION AND FUTURE DIRECTIONS
    2

    View full-size slide

  3. ର৅࿦จ
    • Swift: Delay is Simple and E
    ff
    ective for Congestion Control in the Datacenter

    • Gautam Kumar, Nandita Dukkipati, Keon Jang (MPI-SWS)∗, Hassan M. G.
    Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn,
    Christopher Alfeld, Michael Ryan, David Wetherall, and Amin Vahdat

    • Google LLC

    • ACM SIGCOM ’20, “Awarded best paper”

    • https://dl.acm.org/doi/abs/10.1145/3387514.3406591
    3

    View full-size slide

  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝
    • ֓ཁ

    • Swift congestion controlͰ஗Ԇ࡟ݮ

    • <50us tail-latency@100Gbps

    • DCTCPΑΓ10ഒ௿஗Ԇɺincastʹͱͯ΋ڧ͍

    • ಡ΋͏ͱͨ͠ཧ༝ͱײ૝

    • ON-RAMPͰҾ༻͞Ε͍ͯͨͨΊ

    • DC༻్Ͱ࡞ΒΕ͍ͯΔͨΊ

    • Best Paper@SIGCOM 2020
    4
    https://olympic-placard-generater.herokuapp.com

    View full-size slide

  5. “Swift”ଟ͗͢ʂ
    • γϡοͱ଎͍Πϝʔδʁ

    • Կʹ͚ͭͯ΋ྑͦ͞͏ײ
    5
    https://en.wikipedia.org/wiki/Swift

    View full-size slide

  6. 1. Introduction
    • ੈ͸େ”෼ࢄγεςϜ”࣌୅, buzz word: disaggregation X (Storage/Compute/Memory)

    • TIMELY͔ΒSwift΁ਐԽ

    • E2Eͷਖ਼֬ͳ஗ԆଌఆΛ࢖͏ɻSLOʹରԠ͠΍͍͢ɻϗετͱϑΝϒϦοΫͷ஗Ԇ෼཭

    • NWͷػೳΛ࢖Θͳ͍ͷͰ࣍ੈ୅εΠονͷػೳɾઃఆ͕࢖͑Δ

    • ECNํࣜɿେن໛incast΍IOPS͕ଟ͍workloadʹෆ޲͖ɻ -> ϗετଆͷॲཧ஗Ԇʹ͸ରॲͰ͖ͳ͍ɻi.e. DCTCP, HPCC

    • ύέοτεέδϡʔϦϯάํࣜɿεΠονͱ໌ࣔతʹεέδϡʔϦϯά -> ಋೖɾอक͕େมɺϚϧνςφϯτͩͱແཧɻɻi.e. pFabric

    • Swiftͷ݁Ռ

    • incastͳͲͷτϥϑΟοΫͰ΋server, switchͷΩϡʔΠϯά஗Ԇ࡟ݮ, 100GbpsͰ΋௿஗Ԇ, ΄΅zero packet loss

    • ༷ʑͳΞϓϦϨΠϠʔ͔Βݟͯ΋RPCੑೳྑ͍

    • ஗Ԇ͸᫔᫓৴߸ͱͯ͠γϯϓϧ͔ͭ༏ΕͨੑೳʢTIMELYΑΓγϯϓϧʣɻϗετଆ᫔᫓ʹରԠɻincast౳ʹ΋ରԠ
    6

    View full-size slide

  7. 2. MOTIVATION
    • Storage Workloads

    • HDD: O(10)ms -> Flash O(100us) -> …

    • ෼ࢄʢෳ਺σόΠεΞΫηεʣ -> Tail-latencyͰ཯଎

    • Host Networking Stacks

    • kernelมߋίετߴ -> Snap (OS bypass & microkernel NW module)Ͱ࣮૷

    • ϗετଆॲཧ஗Ԇ΋ݕ஌ͤ͞Δ

    • Datacenter Switches

    • ੑೳɾػछόϥόϥͷͨΊɺ᫔᫓੍ޚΛ֤ػثͰӡ༻͸ࠔ೉

    • ྫɿECNόοϑΝαΠζௐ੔

    • ϗετଆγάφϧΛݩʹNW஗Ԇ࡟ݮΛ໨ࢦ͢
    7 https://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

    View full-size slide

  8. 3. SWIFT DESIGN & IMPLEMENTATION (1/4)


    ઃܭࢦ਑
    8
    • Swiftͷࢦ਑

    • େن໛DCNͰ༷ʑͳϫʔΫϩʔυΛ޿ଳҬɺ௿஗Ԇ, near zero lossΛఏڙ

    • host (NIC/software stack)ͷ᫔᫓੍ޚͰE2E/NW FabricΛ௿஗ԆԽͤ͞Δ

    • CPUޮ཰͕ྑ͍͜ͱ

    • AIMD (Additive-Increase Multiplicative-Decrease )ΞϧΰϦζϜ

    • Ճࢉత૿Ճɾ৐ࢉతݮগΞϧΰϦζϜ

    • ஗Ԇ࣌ؒΛNICؒ(fabric)ͱϗετؒʹ෼཭
    ?

    View full-size slide

  9. 3. SWIFT DESIGN & IMPLEMENTATION (2/4)


    ஗Ԇγάφϧͷ੔ཧ
    9
    Q಺Ͱա࣌ؒ͢͝
    (ૹ৴։࢝·Ͱͷ࣌ؒ)
    NIC Q಺଴ͪ࣌ؒ
    ߴෛՙͩͱ஗Ԇܹ૿
    ύέοτॲཧͯ͠
    ackΛฦ͢·Ͱͷ࣌ؒ
    SW->hostͷ߹ܭ࣌ؒ
    ʢ஗Ԇ͕ରশͰͳ͍Մೳੑ͋Γɻʣ
    ackύέοτॲཧͯ͠
    ௨৴׬ྃ͢Δ·Ͱͷ࣌ؒ
    NW᫔᫓ͷࢦඪ
    ϗετ᫔᫓ͷࢦඪ
    ໭Γͷ௨৴͸ʢී௨͸௚઀ʣ੍ޚͰ͖ͳ͍
    -> Remoteଆ΋SwiftΛ࢖͑͹੍ޚͰ͖Δ t4-t2ΛૹΔ
    host->SW΁ͷ߹ܭ࣌ؒ
    ʢγϦΞϧԽɺΩϡʔΠϯά౳ʣ
    t6-t1: E2E஗Ԇ (RTT)
    • E: ΤϯυϙΠϯτ஗Ԇ: (t4-t2) + (t6-t5)
    • F: ϑΝϒϦοΫ஗Ԇ: t1 - E
    F:
    E:

    View full-size slide

  10. 3. SWIFT DESIGN & IMPLEMENTATION (3/4)


    cwndͷܾΊํ
    10
    • ໨ඪ஗Ԇͱ࣮ࡍͷ஗ԆͱͷࠩΛݩʹɺ

    AIMDతʹʢՃࢉత૿Ճɾ৐ࢉతݮগʣcwndΛࢉग़

    • ϑϩʔຖʹ஗Ԇ෼཭

    • ϑΝϒϦοΫ஗ԆΛݩʹfcwndΛࢉग़

    • ΤϯυϙΠϯτ஗ԆΛݩʹecwndΛࢉग़

    • ༗ޮcwind := min (fcwnd, ecwnd)

    • ߴੑೳrwndͱࢥ͓͏ɻi.e. ૉͷcwind͸ϝϞϦߟྀͷΈ

    • େن໛incastΛߟྀͯ͠cwnd͸1(packet)ҎԼͷ৔߹΋༗Γ

    • pacing_delay΋͏·͘࡞Δ

    • ͜ͷ෼཭࡞ઓͰtail-latency͸2ഒվળ
    ?
    มಈΠϝʔδ
    มಈΠϝʔδ

    View full-size slide

  11. 3. SWIFT DESIGN & IMPLEMENTATION (4/4)


    Scaling the Target Delay
    11
    • λʔήοτ஗Ԇʢfabric஗ԆʣͷܾΊํ

    • τϙϩδґଘͷ஗Ԇ૿Ճ

    • ܦ࿏ʹΑͬͯ஗Ԇ͕ҧ͏

    • ϑϩʔຖʹTTL(hop਺)ΛݟΔ -> λʔήοτ஗Ԇʹม׵

    • ϑϩʔ਺ґଘͷ஗Ԇ૿Ճ

    • ϑϩʔ਺NʹԠͯ͡O(sqrt(N))ͰΩϡʔͷ଴ͪߦྻ૿Ճ

    • fair-shareঢ়ଶͰ͸cwnd͸ϑϩʔ਺ʹ൓ൺྫ

    • ஗Ԇ࣌ؒɿ t = base_target + #hops × ℏ + max(0, min(
    α
    fcwnd
    + β, fs_range))
    τϙϩδґଘ ϑϩʔ਺ґଘ

    View full-size slide

  12. 4. TAKEAWAYS FROM PRODUCTION (1/3)


    ProdͰͷଌఆ֓ཁ
    • 4೥͔͚ͯSwiftΛಋೖ

    • HDD/SSD read/write (byte-intensive)

    • in-memory KVS (latency-intensive)

    • in-memory FS for BigQuery shu
    ff
    l
    e (IOPS-intensive)

    • ଌఆํ๏

    • 1week,
    fl
    eet-wide, ෯޿͍ϫʔΫϩʔυɾن໛ɾར༻཰

    • εΠον౷ܭ৘ใʢlink-util, less-rate / portʣ

    • ϗετؒRTT (Swift, non-SwiftͷNIC timestamp)

    • ΞϓϦ΁ͷӨڹ

    • DCTCP૬౰ͱൺֱɻͨͩ͠ECNͰ͸ͳ͘GCNར༻ʢECNΛड৴ͨ͠Β஗ԆACKΛແޮԽ͢Δʣ
    12

    View full-size slide

  13. • Loss

    • Swiftͩͱ1-2ܻ௿͍ϩε཰

    • ΫϥελຖͩͱGCN͸มಈ͕େ͖͍

    • Latency

    • ฏۉlatency ~= target delayΛୡ੒

    • ֎Ε஋: ߴQoS, heavy incast

    • E2EͰlantecy؇࿨

    • cwnd<1

    • hostͰͷ᫔᫓ݕ஌
    4. TAKEAWAYS FROM PRODUCTION (2/3)


    Loss & Latency
    13
    Fig6. Edge (ToR to host) links Fig9. ClusterEdge (ToR to host) links in the cluster
    Fig12. Cluster Swift/GCN RTT

    View full-size slide

  14. • wire-rateग़͍ͯΔͷʹϩε͕ͳ͍ -> ೋ౓όάใࠂ͕དྷͨʂ

    • QoSຖʹݟΔʢnon-Swift QoSଆ͸ϩε͕͋Δ͸ͣʣ

    • εϧʔϓοτΛ٘ਜ਼ʹ௿஗ԆΛ࣮ݱͯ͠Δɺͱ͍͏ѱධʢʂʣ

    • ݁Ռͷѱ͍ผόʔδϣϯͱൺֱɻͰ΋εϧʔϓοτվળͤͣɻΉ͠ΖRTT, loss૿Ճ

    • target-delayҾ্͖͛ɺcwnd<0ΛແޮԽͳͲ

    • SwiftͷCPU࢖༻཰͸2.6%ఔ౓ʢ1.4%͕ackॲཧɺ0.31%͕࣌ؒଌఆॲཧʣ

    • ECNͷ͖͍͠஋ͷεΠονઃఆɾௐ੔͸େมͩͬͨʢSwiftͷE2E latency͸ϗετଆ੍ޚͷͨΊ؆୯ʣ
    4. TAKEAWAYS FROM PRODUCTION (3/3)


    Production Experience
    14

    View full-size slide

  15. 5. EXPERIMENTAL RESULTS
    • ςετ؀ڥ

    • T1: 50Gbps host * 60

    • T2: 100Gbps host * 500
    15
    T1: target delay@incast ~= RTTΛୡ੒֬ೝ
    25usҎ্ͰthroughputΛୡ੒֬ೝ
    T2: RTT~25usఔ౓ͰSwiftͰthrouhgputୡ੒Λ֬ೝ
    T1: େن໛incastͰ΋SwiftͳΒ଱͑͏Δɻ
    ʢGCNͩͱ1000:1Ͱ΋loss 5.1%, RTT-ave 8.6ഒʣ

    View full-size slide

  16. 5. EXPERIMENTAL RESULTS
    • a
    16
    T1: workload(App)ຖʹ᫔᫓Օॴ͕ҧ͏
    T1: Swift-v0 (delay=RTT)Λ࢖͍ɺincast (64kB & 60B)Λྲྀ͢
    Swift (not v0)͸ThroughputͱRTTͷ྆ํྑ͍
    IOPSܥ௨৴
    fcwndେ͖͍
    ecwndখ͍͞ʢhost஗Ԇʣ
    byteܥ௨৴
    fcwndখ͍͞(fabric஗Ԇ)
    ecwndେ͖͍

    View full-size slide

  17. 5. EXPERIMENTAL RESULTS


    fl
    ow-fairness
    17
    T1: গͳ͍ϑϩʔ਺Ͱެฏੑ֬ೝ T1: 50host*100
    fl
    ow͕link্Ͱ᫔᫓ͨ͠৔߹
    fl
    ow-based scalingʢʁʁʁʣ͕༗ޮ
    Jain fairness-index (J=0.91)͕ͱͯ΋ྑ͍
    T1: path௕͕ҧ͏৔߹
    Topology-base scaling༗ޮ
    intra-rack inter-rack

    View full-size slide

  18. 6. RELATED WORK
    18
    TIMELY͔ΒSwift΁ͷਐԽ
    • fabric, host᫔᫓ͷ෼཭
    • RTTޯ഑Ͱ͸ͳ͘E2E஗ԆΛར༻ ʢਖ਼֬ͳtimestampΛackͰૹΔʣ
    • ෛՙɾτϙϩδʹԠͯ͡໨ඪ஗ԆΛscalingͤ͞Δ
    • େن໛incastରԠ

    View full-size slide

  19. 7. CONCLUSION AND FUTURE DIRECTIONS
    • ᫔᫓੍ޚͷύϥυοΫε

    • ʢҰൠతʹ࢖͑Δʣ᫔᫓੍ޚΛ࡞Γ͍ͨ

    • ʢٯઆతʹʣεΠονɾϗετɺதԝ؅ཧɾΞϓϦຖͷௐ੔͕ඞཁʹͳΔ

    • GoogleͰͷؾ෇͖

    • γϯϓϧͳ᫔᫓γάφϧ͕ྑ͍

    • NIC Timestamp, ϓϩτίϧελοΫվྑ

    • Swift: 100GbpsଳҬ@30us

    • طଘϓϩτίϧελοΫʹ࠶ར༻Ͱ͖ͦ͏

    • <10usΛαϙʔτ͢Δʹ͸ߋʹ৽͍ٕ͠ज़͕ඞཁ
    19

    View full-size slide

  20. ࡾߦ·ͱΊ
    • Swift᫔᫓੍ޚɿNW஗Ԇͱϗετॲཧ஗ԆΛ෼ׂɺਖ਼֬ʹRTTଌఆ

    • εΠονଆͷઃఆมߋ͸θϩ

    • ௿latencyͱߴ͍throughputΛಉ࣌ʹୡ੒

    • େྔincastʹ΋ରԠ͠ɺϑϩʔެฏੑ΋࣮ݱ
    20

    View full-size slide

  21. ิ଍ࢿྉ
    22

    View full-size slide

  22. • ༷ʑͳΞϓϦͰ༗ޮ

    • in-memory FS: ௕ظؒlatency҆ఆ

    • SSD storage system: ௿͍tail-latency
    4. TAKEAWAYS FROM PRODUCTION
    23

    View full-size slide

  23. • QoS෼཭ͷྫ

    • GCN (QoSߴ༏ઌ) V.S. Swift (QoS௿༏ઌ)ͷྫ

    • IOPSूதΫϥελ(IOPS-sensitive) vs. storage RPC (throughput-intensive)

    • IOPSܕ͸NIC (not fabric)ͷtail-latency͕େ͖͍

    • debugʹ༗ޮ (host or fabric)
    4. TAKEAWAYS FROM PRODUCTION
    24

    View full-size slide

  24. 3. SWIFT DESIGN & IMPLEMENTATION (1/5)


    Loss Recovery and ACKs, Coexistence via QoS
    25
    • ύέϩε

    • SWIFTͰtail-latency͕ྑ͍

    • SACKͱRTOͰ࠷খݶͷରԠ

    • QoSΛ࢖ͬͯSWIFTͱଞͷTCPΛڞଘΛ໨ࢦ͢

    • εΠονͷQoSΩϡʔͷҰ෦ΛSWIFTʹׂΓ౰ͯ

    View full-size slide