Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#24 “Ananta: Cloud Scale Load Balancing”

#24 “Ananta: Cloud Scale Load Balancing”

cafenero_777

June 19, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #24


    “Ananta: Cloud Scale Load Balancing”

    ௨ࢉ#75
    @cafenero_777

    2021/06/24
    1

    View full-size slide

  2. Agenda
    • ର৅࿦จ

    • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝

    1. INTRODUCTION

    2. BACKGROUND

    3. DESIGN

    4. IMPLEMENTATION

    5. MEASUREMENTS

    6. OPERATIONAL EXPERIENCE

    7. RELATED WORK

    8. CONCLUSION
    2

    View full-size slide

  3. ର৅࿦จ
    • Ananta: Cloud Scale Load Balancing

    • Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert
    Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos,
    Hongyu Wu, Changhoon Kim, Naveen Karri

    • Microsoft

    • ACM SIGCOM ’13

    • https://dl.acm.org/doi/10.1145/2534169.2486026
    3

    View full-size slide

  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝
    • ֓ཁ

    • Ananta: Scalable L4LB (DSR/NAT)

    • ୯ҰVIPͰ100Gbps, ߹ܭͰ1TbpsҎ্ͷଳҬ෯

    • Azure্Ͱಈ࡞

    • ಡ΋͏ͱͨ͠ཧ༝ͱײ૝

    • AzureͷVFP࿦จʹҾ༻

    • ଞͷLB੎ (Maglevͱ͔)Ͱ΋݁ߏҾ༻͞Ε͍ͯͨͷͰɻ
    4
    https://www.connectedpapers.com/main/5c295df1a7f302c97f6f379eab6abba592811d42/Ananta-cloud-scale-load-balancing/graph
    Ananta
    Maglev
    SilkRoad
    Beamer
    Faild
    Middleboxܥ
    ෼ࢄɾߴޮ཰ܥ

    View full-size slide

  5. 1. Introduction
    • Ϋϥ΢υίϯϐϡʔςΟϯάͷීٴ

    • ߴ͍Քಇ཰ (SLA), Ϛϧνςφϯτɺେن໛τϥϑΟοΫ

    • 1VIP 100Gbps, 1000host/VIP, 6~60ճૢ࡞/1min

    • શSLAҧ൓/ো֐ͷ36%͸LBؔ࿈

    • Ananta (αϯεΫϦοτޠͰແݶ)

    • Scalable L4 LB (NAT/DSR)

    • D-plane: ECMP (in NW), LB, NAT (in VFP/HV)

    • C-plane: SDN/Paxos, S-NAT࿈ܞ

    • 2011/09ʹAzureʹ100୆ಋೖ, 1Tbps, 100k VIPs

    • L4LB@Cloud, NW෼ࢄγεςϜͱsclaingʹ͍ͭͯɺଌఆ݁Ռͱӡ༻݁ՌΛ঺հ
    5

    View full-size slide

  6. 2. BACKGROUND
    • Data Center Clos NW: 10Gαʔό*40k୆ɺoversub 1:4, 400Gbps@Border

    • VIPτϥϑΟοΫͷੑ࣭

    • 44%͸VIPτϥϑΟοΫʢDC಺:DC֎=2:1ʣ

    • DCؒin/out͸1:1, σʔλಉظܥ

    • ཁ݅·ͱΊ

    1. “Scale, Scale and Scale”:

    • ௿ίετʢαʔόίετͷ1%, 400୆͸ଟ͗͢ɻʣ

    • ࠷େ1VIP 100Gbps & 1M current-conn, 100ճઃఆมߋ/෼

    2. ৴པੑ: N+1ߏ੒Ͱͷࣗಈճ෮ɺϝϯςରԠ

    3. “Any Service Anywhere”: L2υϝΠϯ੍ݶʹറΒΕͳ͍Α͏ʹ͢Δ

    4. ςφϯτ෼཭: LBڞ༗ʹΑΔDoSӨڹʢଞͷސ٬͕ଳҬΛୣΘΕΔ౳ʣͷରࡦ
    6
    400Gbps
    100Tbps

    View full-size slide

  7. 3. DESIGN (1/5)


    Principles & Architecture
    7
    • Scale outͰ͖ΔΑ͏ʹ

    • RouterͷΑ͏ʹϑϩʔҡ࣋ػߏΛ࣋ͨͳ͍Α͏ʹ͢Δ

    • શ୆ಉظ͕ඞཁͳ΋ͷ͸࢖Θͳ͍

    • i.e. WRR (Weighted Round Robin)ͱWeighted Random

    • ೉͍͠ॲཧ͸HVଆͰ΍ΔʢΦϑϩʔυ͢Δʣ

    • ACL, Rate Limit, Metering

    • Ananta Manager (AM), Multiplexer (Mux), Host Agent (HA)

    • Inbound: IP-in-IP, NAT and DSR

    • Outbound: DIP->VIP͸VIP:sportͷmappingΛMUXͱHAͰಉظ͓ͯ͘͠
    VIP޿ใ/ECMP
    Selection/IP-in-IP
    L3 Routing
    Decap/DNAT
    NAT໭͠
    DSR (Encapͳ͠)
    sportͱVIPΛཁٻ
    sportͱVIPΛઃఆ
    dportͱVIPͰ

    VMʹৼΓ෼͚

    View full-size slide

  8. 3. DESIGN (2/5)


    Principles & Architecture
    8
    • Fastpath: VIP to VIP௨৴ɿLBΛbypath͠ɺHVؒͰ௚઀௨৴ͤ͞Δ

    • ࠷ॳ͸LBΛ௨ͯ͠௨৴

    • 3WHS׬ྃ͢ΔͱDIP mapping৘ใΛϦμΠϨΫτ

    • HA͕௚઀௨৴ͤ͞Δ

    • Ҏޙ͸LBΛ௨Βͳ͍

    • ৐ͬऔΓରࡦ͸ඞཁ
    ͜ͷ௨৴͸DIP2ͱmapping͞ΕͯΔΑ
    DIP1ඥ෇͚
    DIP1/DIP2௚઀௨৴

    View full-size slide

  9. 3. DESIGN (3/5)


    Mux/Host Agent
    9
    • Mux Pool (Muxͷηοτ)

    • Mux: BGP Speaker: VIPΛ޿ใɻো֐࣌͸ܦ࿏ॖୀɻTCP MD5ೝূ

    • AM͕VIP/DIP mappingΛMuxʹσϓϩΠɻ5tupleͰselection, hashؔ਺ɾseed஋͸શMuxͰڞ௨ʢECMPͰͲͷMuxʹ౸ୡͯ͠΋ಉ͡ॲཧΛอূʣ

    • Ұ౓mappingΛࢀর͞ΕΔͱϑϩʔΛอ࣋ɻͨͩ͠ϝϞϦׂ౰ྔ͸ผʢSYN-Flood߈ܸରࡦʣ

    • ৴པͰ͖Δϑϩʔɿෳ਺ύέοτ->timeout௕Ίʹ͢Δ

    • ৴པͰ͖ͳ͍ϑϩʔɿ1ύέοτ->timeout୹Ίʹ͢Δ

    • Mux͕μ΢ϯ͢ΔͱECMPΨϥΨϥϙϯ

    • μ΢ϯதʹmappingมߋ͞ΕΔͱϑϩʔҡ࣋Ͱ͖ͳ͍ ->DHT (Distributed hash table)Λར༻

    • Host Agent: શHV্ʹଘࡏɺFastpath, NAT, Health checkΛߦ͏ ʢP.8ͷઆ໌ʣ

    • ϙʔτͷ࠶ར༻ػೳ

    • Health check͸MuxͰ͸ͳ͘HAଆͰ΍Δɻ

    View full-size slide

  10. 3. DESIGN (4/5)


    Ananta Manager/Tenant Isolation
    10
    • Ananta Manager (AM)

    • Paxosϕʔεͷ෼ࢄίϯτϩʔϥ

    • 5ϨϓϦΧͰՔಇɺ3ϨϓϦΧҎ্Ͱਖ਼ৗॲཧ

    • S-NAT: portׂ౰ΛόϧΫॲཧ

    • ςφϯτ෼཭

    • Muxຖʹಠཱ֤ͯ͠ςφϯτΞΠιϨʔγϣϯΛ࣮૷͢Ε͹ྑ͍

    • AM: ཁٻ͸FCFS(ઌணॱ:
    fi
    rst-come-
    fi
    rst-serve)͞ΕΔɻ͔ͭɺಉ͡Α͏ͳ৽نϦΫΤετ͸औΓԼ͛ɻ(2)

    • Mux: ద੾ͳଳҬ෯Λ௒͑ͨ৔߹ɺա৒ଳҬʹൺྫͨ֬͠཰Ͱdrop and rate limit͢Δ

    • Top talker(Ұ൪௨৴͍ͯ͠Δ) VIPΛMux͔ΒҠಈͤ͞Δ

    View full-size slide

  11. 3. DESIGN (5/5)


    Alternatives
    11
    • DNS-based LB

    • ෛՙ෼ࢄͷࣄલ༧ଌ͕೉͍͠ʢClient͔ΒͷϦΫΤετ͕ภΔʣ???

    • DNSΩϟογϡফ͑Δ·Ͱ͕͔͔࣌ؒΔ

    • stateful (NATͳͲ)͕Ͱ͖ͳ͍

    • OpenFlow-based LB

    • ࢢൢOpenFlowσόΠεͰ͸2-4kϑϩʔ·ͰʢMux͸~Mϑϩʔঢ়ଶΛอ͍࣋ͨ͠ʣ

    • ςφϯτ෼཭ͷػೳ

    • BGP޿ใͰ͖ͳ͍ʢAMʹ೚ͤΔʁʣ

    View full-size slide

  12. 4. IMPLEMENTATION
    • AM: Ԡ౴ੑॏཁ

    • SEDA (Staged event-driven Arch.)తͳϩοΫϑϦʔઃܭ

    • thread poolڞ༗ʢ૯਺੍ݶʣ

    • ༏ઌ౓ʢྫɿVIP࡞੒͸༏ઌ࿮ʣ

    • Paxos SDK + Discovery + Health MonitoringͰ࣮૷

    • ϓϥΠϚϦ͕ॲཧΛߦ͏͜ͱΛอূ

    • upgrade࣌ʹAMΠϯελϯε͕1ͭҎ্མͪͳ͍͜ͱΛอূ

    • Mux: ΧʔωϧʢυϥΠόʣͰͷύέοτॲཧ + ϢʔβϞʔυͷBGPॲཧ

    • ΧʔωϧػೳΛͦͷ··࢖͏: IPIP/RSS/IPv6 etc

    • 1VIPͰ20k DIP, 1.6M SNAT port mapping. ~Mͷಉ࣌ίωΫγϣϯ৘ใΛอ࣋
    12
    *5
    *8
    *all

    View full-size slide

  13. 5. MEASUREMENTS


    Micro-benchmark
    13
    10VM * 2 tenantͰ1MB௨৴/connection
    ͔ᷮʹHostෛՙ͕૿͑Δ͕ɺMuxෛՙ͸େ෯ʹԼΔ
    10VM * 5 tenant (baseτϥϑΟοΫ+SYN-
    fl
    ood * 10ճ)
    தʙߴෛՙͰDoSͷݟ෼͚͕ͭ͘ʹ͘͘ͳΔɻ
    ΄΅શͯ75msҎ಺ʹऩ·Δ
    ϙʔτ֬อͰ͖ͳ͍৔߹͸௥Ճ͕࣌ؒlong-tailͰ͔͔Δ
    Fastpath༗Γແ͠ͰͷCPUෛՙൺֱ SYN-
    fl
    ood Attack Mitigation S-NAT·ͰͷϨΠςϯγʔ

    View full-size slide

  14. 5. MEASUREMENTS


    Real World Data (1/2)
    • ߹ܭ1Tbps, 3೥ӡ༻ɺinter/intranet, ༻్ɿblob, table/queue, storage
    14
    %ileతʹ͸ࠔΔγφϦΦ͸΄΅ແ͍ɻ
    <- 50ms
    <- 200ms
    <- max 2s
    req/5min@test tenant
    ฏۉՔಇ཰99.95%
    Muxߴෛՙ
    ʢSYN-
    fl
    oodʣ
    NW໰୊
    ޡݕ஌
    <- 75ms@50%ile
    <- max 2s
    ςφϯτɾMuxͷن໛ʹґଘɻ
    SLAʹऩ·͍ͬͯΔɻ
    S-NATͷϦΫΤετ࣌ؒ Մ༻ੑ con
    fi
    g׬ྃ࣌ؒ

    View full-size slide

  15. 5. MEASUREMENTS


    Real World Data (2/2)
    15
    800Mbps (220Kpps) / core
    ॲཧ͕͔֬ʹECMP͞Ε͍ͯΔ
    ߹ܭ33.6Gbps: 2.4Gbps*14
    Mux14୆ͷଳҬͱෛՙঢ়گ
    (25%)

    View full-size slide

  16. 6. OPERATIONAL EXPERIENCE
    • 3೥ؒΫϥ΢υͰӡ༻

    • HW LBʹ”ݟ੾ΓΛ͚ͭͨ”ཧ༝ɿDoS߈ܸରԠ͕Ͱ͖ͳ͍ɺ஄ྗੑ (elasticity)͕ͳ͍ɺଳҬ૿ՃɾՁ֨ѹྗʹݟ߹Θͳ͍

    • SW LBͷى͖࣮ͨࡍʹى͖ͨ໰୊ͱ՝୊

    • AM dual primary໰୊: ݹ͍primaryػ͔ΒMux΁ϦΫΤετɺMuxଆ͸͜ΕΛڋ൱ɻ

    • Muxଆ͕ϦΫΤετڋ൱Λͨ͠ΒτϥϯβΫγϣϯΛ࣮ߦɺͰղܾ

    • IP-in-IPͷͨΊMTUมߋ, HA͕MSSௐ੔͢Δ͸͕ͣԿނ͔֎ΕͯMTU௒͑Ͱdrop

    • ͋ΔϗʔϜϧʔλʹMSS஋͕
    fi
    x͞ΕΔόά

    • ͋ΔϞόΠϧOSͷTCPόάͰTCP࠶઀ଓ࣌ʹϑϧαΠζͷηάϝϯτΛͦͷ··࢖͏όά

    • NWશମͷMTUΛ্͛ͨ

    • BGPͱLB͕ಉډ͍ͯ͠ΔͷͰɺଳҬ͋;ΕΔͱڞ౗Εɻ͔͠΋1୆མͪΔͱτϥϑΟοΫ͕دΔͷͰ࿈࠯ো֐ͷՄೳੑ

    • BGP/LBͰI/FΛ෼͚ΔɾϧʔλଆͰτϥϑΟοΫϨʔτΛߜΔɻBGP/LBಉډͷ΄͏͕ઃܭ͕γϯϓϧ

    • HW LBͷΞΠυϧίωΫγϣϯλΠϜΞ΢τʢ̒̌ඵʣ໰୊

    • SW LBͰ΋stateᷓΕରࡦʢDoSରࡦʣΛҾ͖ܧ͍ͩɻϞόΠϧ௨৴͸ுΓͬͺͳ͕͠ଟ͍ -> ͦ΋ͦ΋VIP mapping͕͋ΔͷͰstateΛ࡞Βͳͯ͘ྑ͍->ແࣄλΠϜΞ΢τΛ௕͘
    Ͱ͖ͨ
    16

    View full-size slide

  17. 7. RELATED WORK
    • HW LB͸εέʔϧΞοϓܕʢ1+1ܕʣ

    • धཁʹԠͨ͡εέʔϧΞοϓɾμ΢ϯ͕Ͱ͖ͳ͍

    • Ϋϥ΢υ؀ڥͰ͸Քಇ཰ཁ݅ͷͨΊN+1ͷ৑௕ੑ͕ඞཁ

    • Ծ૝ΞϓϥΠΞϯε΍OSS (HAProxyͳͲ)

    • N+1͕Ͱ͖ͳ͍ɻNWো֐࣌͸εϖΞIP (I/F)Λ࢖͏ҝL2υϝΠϯ੍ݶ

    • 1VIPΛεέʔϧͰ͖ͳ͍

    • Embrace: ϗετଆͰಈ࡞ɺEgiΒ/RouteBricks: ίϞσΟςΟHWͰߴੑೳϧʔλΛ࣮ݱ

    • ETTM: શͯͷΤϯυϗετ͕ύέοτॲཧɻAnanta͸LB͚ͩઐ༻ͷαʔόɻ
    17

    View full-size slide

  18. 8. CONCLUSION
    • Ananta

    • ෼ࢄܕL4LB/NAT

    • Ϛϧνςφϯτɺߴ৴པੑɺӡ༻ͷ(Azureͷ)ཁ݅Λຬͨ͢Α͏ʹઃܭ

    • AzureҎ֎Ͱ΋໾ʹཱͭϋζ

    • େن໛༻్Ͱίετʹݟ߹͏ɾscale-outͰ͖Δઃܭ͕ඞཁ

    • ECMP, BGP, DSR, Fastpath, HostଆNAT, rate limit

    • LB100୆ɺ10ສਓҎ্ʹVIPαʔϏεΛఏڙ
    18

    View full-size slide

  19. 3ߦ·ͱΊ
    • ઃܭࢥ૝͸Maglev (google)΍VPPLB (YNWLB2)ͱಉ͡

    • BGP/ECMP/Consistent-hash/L4LB/DSR

    • εέʔϧͤ͞ΔͨΊͷ޻෉

    • LBͰඞཁͳॲཧʢNATॲཧɾϔϧενΣοΫͳͲʣΛHVଆʹΦϑϩʔυ͢Δ

    • 1%ϧʔϧʢLBϊʔυ୆਺͸Ϋϥελ಺ͷαʔό୆਺ͷ1%·Ͱʣ

    • LBઃఆมߋ͸ര଎ʢ75msʣ

    • Fastpath (్த͔ΒLBΛհ͞ͳ͍௨৴ʹ੾Γସ͑Δ)Ͱߋʹޮ཰Խ
    19

    View full-size slide