Slide 1

Slide 1 text

Research Paper Introduction #24 “Ananta: Cloud Scale Load Balancing” ௨ࢉ#75 @cafenero_777 2021/06/24 1

Slide 2

Slide 2 text

Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. BACKGROUND 3. DESIGN 4. IMPLEMENTATION 5. MEASUREMENTS 6. OPERATIONAL EXPERIENCE 7. RELATED WORK 8. CONCLUSION 2

Slide 3

Slide 3 text

ର৅࿦จ • Ananta: Cloud Scale Load Balancing • Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, Naveen Karri • Microsoft • ACM SIGCOM ’13 • https://dl.acm.org/doi/10.1145/2534169.2486026 3

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • Ananta: Scalable L4LB (DSR/NAT) • ୯ҰVIPͰ100Gbps, ߹ܭͰ1TbpsҎ্ͷଳҬ෯ • Azure্Ͱಈ࡞ • ಡ΋͏ͱͨ͠ཧ༝ͱײ૝ • AzureͷVFP࿦จʹҾ༻ • ଞͷLB੎ (Maglevͱ͔)Ͱ΋݁ߏҾ༻͞Ε͍ͯͨͷͰɻ 4 https://www.connectedpapers.com/main/5c295df1a7f302c97f6f379eab6abba592811d42/Ananta-cloud-scale-load-balancing/graph Ananta Maglev SilkRoad Beamer Faild Middleboxܥ ෼ࢄɾߴޮ཰ܥ

Slide 5

Slide 5 text

1. Introduction • Ϋϥ΢υίϯϐϡʔςΟϯάͷීٴ • ߴ͍Քಇ཰ (SLA), Ϛϧνςφϯτɺେن໛τϥϑΟοΫ • 1VIP 100Gbps, 1000host/VIP, 6~60ճૢ࡞/1min • શSLAҧ൓/ো֐ͷ36%͸LBؔ࿈ • Ananta (αϯεΫϦοτޠͰແݶ) • Scalable L4 LB (NAT/DSR) • D-plane: ECMP (in NW), LB, NAT (in VFP/HV) • C-plane: SDN/Paxos, S-NAT࿈ܞ • 2011/09ʹAzureʹ100୆ಋೖ, 1Tbps, 100k VIPs • L4LB@Cloud, NW෼ࢄγεςϜͱsclaingʹ͍ͭͯɺଌఆ݁Ռͱӡ༻݁ՌΛ঺հ 5

Slide 6

Slide 6 text

2. BACKGROUND • Data Center Clos NW: 10Gαʔό*40k୆ɺoversub 1:4, 400Gbps@Border • VIPτϥϑΟοΫͷੑ࣭ • 44%͸VIPτϥϑΟοΫʢDC಺:DC֎=2:1ʣ • DCؒin/out͸1:1, σʔλಉظܥ • ཁ݅·ͱΊ 1. “Scale, Scale and Scale”: • ௿ίετʢαʔόίετͷ1%, 400୆͸ଟ͗͢ɻʣ • ࠷େ1VIP 100Gbps & 1M current-conn, 100ճઃఆมߋ/෼ 2. ৴པੑ: N+1ߏ੒Ͱͷࣗಈճ෮ɺϝϯςରԠ 3. “Any Service Anywhere”: L2υϝΠϯ੍ݶʹറΒΕͳ͍Α͏ʹ͢Δ 4. ςφϯτ෼཭: LBڞ༗ʹΑΔDoSӨڹʢଞͷސ٬͕ଳҬΛୣΘΕΔ౳ʣͷରࡦ 6 400Gbps 100Tbps

Slide 7

Slide 7 text

3. DESIGN (1/5) Principles & Architecture 7 • Scale outͰ͖ΔΑ͏ʹ • RouterͷΑ͏ʹϑϩʔҡ࣋ػߏΛ࣋ͨͳ͍Α͏ʹ͢Δ • શ୆ಉظ͕ඞཁͳ΋ͷ͸࢖Θͳ͍ • i.e. WRR (Weighted Round Robin)ͱWeighted Random • ೉͍͠ॲཧ͸HVଆͰ΍ΔʢΦϑϩʔυ͢Δʣ • ACL, Rate Limit, Metering • Ananta Manager (AM), Multiplexer (Mux), Host Agent (HA) • Inbound: IP-in-IP, NAT and DSR • Outbound: DIP->VIP͸VIP:sportͷmappingΛMUXͱHAͰಉظ͓ͯ͘͠ VIP޿ใ/ECMP Selection/IP-in-IP L3 Routing Decap/DNAT NAT໭͠ DSR (Encapͳ͠) sportͱVIPΛཁٻ sportͱVIPΛઃఆ dportͱVIPͰ
 VMʹৼΓ෼͚

Slide 8

Slide 8 text

3. DESIGN (2/5) Principles & Architecture 8 • Fastpath: VIP to VIP௨৴ɿLBΛbypath͠ɺHVؒͰ௚઀௨৴ͤ͞Δ • ࠷ॳ͸LBΛ௨ͯ͠௨৴ • 3WHS׬ྃ͢ΔͱDIP mapping৘ใΛϦμΠϨΫτ • HA͕௚઀௨৴ͤ͞Δ • Ҏޙ͸LBΛ௨Βͳ͍ • ৐ͬऔΓରࡦ͸ඞཁ ͜ͷ௨৴͸DIP2ͱmapping͞ΕͯΔΑ DIP1ඥ෇͚ DIP1/DIP2௚઀௨৴

Slide 9

Slide 9 text

3. DESIGN (3/5) Mux/Host Agent 9 • Mux Pool (Muxͷηοτ) • Mux: BGP Speaker: VIPΛ޿ใɻো֐࣌͸ܦ࿏ॖୀɻTCP MD5ೝূ • AM͕VIP/DIP mappingΛMuxʹσϓϩΠɻ5tupleͰselection, hashؔ਺ɾseed஋͸શMuxͰڞ௨ʢECMPͰͲͷMuxʹ౸ୡͯ͠΋ಉ͡ॲཧΛอূʣ • Ұ౓mappingΛࢀর͞ΕΔͱϑϩʔΛอ࣋ɻͨͩ͠ϝϞϦׂ౰ྔ͸ผʢSYN-Flood߈ܸରࡦʣ • ৴པͰ͖Δϑϩʔɿෳ਺ύέοτ->timeout௕Ίʹ͢Δ • ৴པͰ͖ͳ͍ϑϩʔɿ1ύέοτ->timeout୹Ίʹ͢Δ • Mux͕μ΢ϯ͢ΔͱECMPΨϥΨϥϙϯ • μ΢ϯதʹmappingมߋ͞ΕΔͱϑϩʔҡ࣋Ͱ͖ͳ͍ ->DHT (Distributed hash table)Λར༻ • Host Agent: શHV্ʹଘࡏɺFastpath, NAT, Health checkΛߦ͏ ʢP.8ͷઆ໌ʣ • ϙʔτͷ࠶ར༻ػೳ • Health check͸MuxͰ͸ͳ͘HAଆͰ΍Δɻ

Slide 10

Slide 10 text

3. DESIGN (4/5) Ananta Manager/Tenant Isolation 10 • Ananta Manager (AM) • Paxosϕʔεͷ෼ࢄίϯτϩʔϥ • 5ϨϓϦΧͰՔಇɺ3ϨϓϦΧҎ্Ͱਖ਼ৗॲཧ • S-NAT: portׂ౰ΛόϧΫॲཧ • ςφϯτ෼཭ • Muxຖʹಠཱ֤ͯ͠ςφϯτΞΠιϨʔγϣϯΛ࣮૷͢Ε͹ྑ͍ • AM: ཁٻ͸FCFS(ઌணॱ: fi rst-come- fi rst-serve)͞ΕΔɻ͔ͭɺಉ͡Α͏ͳ৽نϦΫΤετ͸औΓԼ͛ɻ(2) • Mux: ద੾ͳଳҬ෯Λ௒͑ͨ৔߹ɺա৒ଳҬʹൺྫͨ֬͠཰Ͱdrop and rate limit͢Δ • Top talker(Ұ൪௨৴͍ͯ͠Δ) VIPΛMux͔ΒҠಈͤ͞Δ

Slide 11

Slide 11 text

3. DESIGN (5/5) Alternatives 11 • DNS-based LB • ෛՙ෼ࢄͷࣄલ༧ଌ͕೉͍͠ʢClient͔ΒͷϦΫΤετ͕ภΔʣ??? • DNSΩϟογϡফ͑Δ·Ͱ͕͔͔࣌ؒΔ • stateful (NATͳͲ)͕Ͱ͖ͳ͍ • OpenFlow-based LB • ࢢൢOpenFlowσόΠεͰ͸2-4kϑϩʔ·ͰʢMux͸~Mϑϩʔঢ়ଶΛอ͍࣋ͨ͠ʣ • ςφϯτ෼཭ͷػೳ • BGP޿ใͰ͖ͳ͍ʢAMʹ೚ͤΔʁʣ

Slide 12

Slide 12 text

4. IMPLEMENTATION • AM: Ԡ౴ੑॏཁ • SEDA (Staged event-driven Arch.)తͳϩοΫϑϦʔઃܭ • thread poolڞ༗ʢ૯਺੍ݶʣ • ༏ઌ౓ʢྫɿVIP࡞੒͸༏ઌ࿮ʣ • Paxos SDK + Discovery + Health MonitoringͰ࣮૷ • ϓϥΠϚϦ͕ॲཧΛߦ͏͜ͱΛอূ • upgrade࣌ʹAMΠϯελϯε͕1ͭҎ্མͪͳ͍͜ͱΛอূ • Mux: ΧʔωϧʢυϥΠόʣͰͷύέοτॲཧ + ϢʔβϞʔυͷBGPॲཧ • ΧʔωϧػೳΛͦͷ··࢖͏: IPIP/RSS/IPv6 etc • 1VIPͰ20k DIP, 1.6M SNAT port mapping. ~Mͷಉ࣌ίωΫγϣϯ৘ใΛอ࣋ 12 *5 *8 *all

Slide 13

Slide 13 text

5. MEASUREMENTS Micro-benchmark 13 10VM * 2 tenantͰ1MB௨৴/connection ͔ᷮʹHostෛՙ͕૿͑Δ͕ɺMuxෛՙ͸େ෯ʹԼΔ 10VM * 5 tenant (baseτϥϑΟοΫ+SYN- fl ood * 10ճ) தʙߴෛՙͰDoSͷݟ෼͚͕ͭ͘ʹ͘͘ͳΔɻ ΄΅શͯ75msҎ಺ʹऩ·Δ ϙʔτ֬อͰ͖ͳ͍৔߹͸௥Ճ͕࣌ؒlong-tailͰ͔͔Δ Fastpath༗Γແ͠ͰͷCPUෛՙൺֱ SYN- fl ood Attack Mitigation S-NAT·ͰͷϨΠςϯγʔ

Slide 14

Slide 14 text

5. MEASUREMENTS Real World Data (1/2) • ߹ܭ1Tbps, 3೥ӡ༻ɺinter/intranet, ༻్ɿblob, table/queue, storage 14 %ileతʹ͸ࠔΔγφϦΦ͸΄΅ແ͍ɻ <- 50ms <- 200ms <- max 2s req/5min@test tenant ฏۉՔಇ཰99.95% Muxߴෛՙ ʢSYN- fl oodʣ NW໰୊ ޡݕ஌ <- 75ms@50%ile <- max 2s ςφϯτɾMuxͷن໛ʹґଘɻ SLAʹऩ·͍ͬͯΔɻ S-NATͷϦΫΤετ࣌ؒ Մ༻ੑ con fi g׬ྃ࣌ؒ

Slide 15

Slide 15 text

5. MEASUREMENTS Real World Data (2/2) 15 800Mbps (220Kpps) / core ॲཧ͕͔֬ʹECMP͞Ε͍ͯΔ ߹ܭ33.6Gbps: 2.4Gbps*14 Mux14୆ͷଳҬͱෛՙঢ়گ (25%)

Slide 16

Slide 16 text

6. OPERATIONAL EXPERIENCE • 3೥ؒΫϥ΢υͰӡ༻ • HW LBʹ”ݟ੾ΓΛ͚ͭͨ”ཧ༝ɿDoS߈ܸରԠ͕Ͱ͖ͳ͍ɺ஄ྗੑ (elasticity)͕ͳ͍ɺଳҬ૿ՃɾՁ֨ѹྗʹݟ߹Θͳ͍ • SW LBͷى͖࣮ͨࡍʹى͖ͨ໰୊ͱ՝୊ • AM dual primary໰୊: ݹ͍primaryػ͔ΒMux΁ϦΫΤετɺMuxଆ͸͜ΕΛڋ൱ɻ • Muxଆ͕ϦΫΤετڋ൱Λͨ͠ΒτϥϯβΫγϣϯΛ࣮ߦɺͰղܾ • IP-in-IPͷͨΊMTUมߋ, HA͕MSSௐ੔͢Δ͸͕ͣԿނ͔֎ΕͯMTU௒͑Ͱdrop • ͋ΔϗʔϜϧʔλʹMSS஋͕ fi x͞ΕΔόά • ͋ΔϞόΠϧOSͷTCPόάͰTCP࠶઀ଓ࣌ʹϑϧαΠζͷηάϝϯτΛͦͷ··࢖͏όά • NWશମͷMTUΛ্͛ͨ • BGPͱLB͕ಉډ͍ͯ͠ΔͷͰɺଳҬ͋;ΕΔͱڞ౗Εɻ͔͠΋1୆མͪΔͱτϥϑΟοΫ͕دΔͷͰ࿈࠯ো֐ͷՄೳੑ • BGP/LBͰI/FΛ෼͚ΔɾϧʔλଆͰτϥϑΟοΫϨʔτΛߜΔɻBGP/LBಉډͷ΄͏͕ઃܭ͕γϯϓϧ • HW LBͷΞΠυϧίωΫγϣϯλΠϜΞ΢τʢ̒̌ඵʣ໰୊ • SW LBͰ΋stateᷓΕରࡦʢDoSରࡦʣΛҾ͖ܧ͍ͩɻϞόΠϧ௨৴͸ுΓͬͺͳ͕͠ଟ͍ -> ͦ΋ͦ΋VIP mapping͕͋ΔͷͰstateΛ࡞Βͳͯ͘ྑ͍->ແࣄλΠϜΞ΢τΛ௕͘ Ͱ͖ͨ 16

Slide 17

Slide 17 text

7. RELATED WORK • HW LB͸εέʔϧΞοϓܕʢ1+1ܕʣ • धཁʹԠͨ͡εέʔϧΞοϓɾμ΢ϯ͕Ͱ͖ͳ͍ • Ϋϥ΢υ؀ڥͰ͸Քಇ཰ཁ݅ͷͨΊN+1ͷ৑௕ੑ͕ඞཁ • Ծ૝ΞϓϥΠΞϯε΍OSS (HAProxyͳͲ) • N+1͕Ͱ͖ͳ͍ɻNWো֐࣌͸εϖΞIP (I/F)Λ࢖͏ҝL2υϝΠϯ੍ݶ • 1VIPΛεέʔϧͰ͖ͳ͍ • Embrace: ϗετଆͰಈ࡞ɺEgiΒ/RouteBricks: ίϞσΟςΟHWͰߴੑೳϧʔλΛ࣮ݱ • ETTM: શͯͷΤϯυϗετ͕ύέοτॲཧɻAnanta͸LB͚ͩઐ༻ͷαʔόɻ 17

Slide 18

Slide 18 text

8. CONCLUSION • Ananta • ෼ࢄܕL4LB/NAT • Ϛϧνςφϯτɺߴ৴པੑɺӡ༻ͷ(Azureͷ)ཁ݅Λຬͨ͢Α͏ʹઃܭ • AzureҎ֎Ͱ΋໾ʹཱͭϋζ • େن໛༻్Ͱίετʹݟ߹͏ɾscale-outͰ͖Δઃܭ͕ඞཁ • ECMP, BGP, DSR, Fastpath, HostଆNAT, rate limit • LB100୆ɺ10ສਓҎ্ʹVIPαʔϏεΛఏڙ 18

Slide 19

Slide 19 text

3ߦ·ͱΊ • ઃܭࢥ૝͸Maglev (google)΍VPPLB (YNWLB2)ͱಉ͡ • BGP/ECMP/Consistent-hash/L4LB/DSR • εέʔϧͤ͞ΔͨΊͷ޻෉ • LBͰඞཁͳॲཧʢNATॲཧɾϔϧενΣοΫͳͲʣΛHVଆʹΦϑϩʔυ͢Δ • 1%ϧʔϧʢLBϊʔυ୆਺͸Ϋϥελ಺ͷαʔό୆਺ͷ1%·Ͱʣ • LBઃఆมߋ͸ര଎ʢ75msʣ • Fastpath (్த͔ΒLBΛհ͞ͳ͍௨৴ʹ੾Γସ͑Δ)Ͱߋʹޮ཰Խ 19

Slide 20

Slide 20 text

EoP 20