Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#24 “Ananta: Cloud Scale Load Balancing”
Search
cafenero_777
June 19, 2023
Technology
0
320
#24 “Ananta: Cloud Scale Load Balancing”
ACM SIGCOM ’13
https://dl.acm.org/doi/10.1145/2534169.2486026
cafenero_777
June 19, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
530
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
130
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
150
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
110
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
79
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
150
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
58
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
270
#25 “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”
cafenero_777
0
180
Other Decks in Technology
See All in Technology
Oracle AI Database移行・アップグレード勉強会 - RAT活用編
oracle4engineer
PRO
0
100
制約が導く迷わない設計 〜 信頼性と運用性を両立するマイナンバー管理システムの実践 〜
bwkw
3
1k
2026年、サーバーレスの現在地 -「制約と戦う技術」から「当たり前の実行基盤」へ- /serverless2026
slsops
2
260
こんなところでも(地味に)活躍するImage Modeさんを知ってるかい?- Image Mode for OpenShift -
tsukaman
1
160
レガシー共有バッチ基盤への挑戦 - SREドリブンなリアーキテクチャリングの取り組み
tatsukoni
0
220
CDK対応したAWS DevOps Agentを試そう_20260201
masakiokuda
1
370
We Built for Predictability; The Workloads Didn’t Care
stahnma
0
140
SchooでVue.js/Nuxtを技術選定している理由
yamanoku
3
160
30万人の同時アクセスに耐えたい!新サービスの盤石なリリースを支える負荷試験 / SRE Kaigi 2026
genda
4
1.3k
ランサムウェア対策としてのpnpm導入のススメ
ishikawa_satoru
0
210
StrandsとNeptuneを使ってナレッジグラフを構築する
yakumo
1
120
SREのプラクティスを用いた3領域同時 マネジメントへの挑戦 〜SRE・情シス・セキュリティを統合した チーム運営術〜
coconala_engineer
2
710
Featured
See All Featured
The Cult of Friendly URLs
andyhume
79
6.8k
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
130
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.8k
jQuery: Nuts, Bolts and Bling
dougneiner
65
8.4k
A designer walks into a library…
pauljervisheath
210
24k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.7k
Product Roadmaps are Hard
iamctodd
PRO
55
12k
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
0
3.4k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
9.9k
Typedesign – Prime Four
hannesfritz
42
2.9k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
420
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
280
Transcript
Research Paper Introduction #24 “Ananta: Cloud Scale Load Balancing” ௨ࢉ#75
@cafenero_777 2021/06/24 1
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. BACKGROUND 3.
DESIGN 4. IMPLEMENTATION 5. MEASUREMENTS 6. OPERATIONAL EXPERIENCE 7. RELATED WORK 8. CONCLUSION 2
ରจ • Ananta: Cloud Scale Load Balancing • Parveen Patel,
Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, Naveen Karri • Microsoft • ACM SIGCOM ’13 • https://dl.acm.org/doi/10.1145/2534169.2486026 3
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • Ananta: Scalable L4LB (DSR/NAT) • ୯ҰVIPͰ100Gbps,
߹ܭͰ1TbpsҎ্ͷଳҬ෯ • Azure্Ͱಈ࡞ • ಡ͏ͱͨ͠ཧ༝ͱײ • AzureͷVFPจʹҾ༻ • ଞͷLB (Maglevͱ͔)Ͱ݁ߏҾ༻͞Ε͍ͯͨͷͰɻ 4 https://www.connectedpapers.com/main/5c295df1a7f302c97f6f379eab6abba592811d42/Ananta-cloud-scale-load-balancing/graph Ananta Maglev SilkRoad Beamer Faild Middleboxܥ ࢄɾߴޮܥ
1. Introduction • ΫϥυίϯϐϡʔςΟϯάͷීٴ • ߴ͍Քಇ (SLA), ϚϧνςφϯτɺେنτϥϑΟοΫ • 1VIP
100Gbps, 1000host/VIP, 6~60ճૢ࡞/1min • શSLAҧ/োͷ36%LBؔ࿈ • Ananta (αϯεΫϦοτޠͰແݶ) • Scalable L4 LB (NAT/DSR) • D-plane: ECMP (in NW), LB, NAT (in VFP/HV) • C-plane: SDN/Paxos, S-NAT࿈ܞ • 2011/09ʹAzureʹ100ಋೖ, 1Tbps, 100k VIPs • L4LB@Cloud, NWࢄγεςϜͱsclaingʹ͍ͭͯɺଌఆ݁Ռͱӡ༻݁ՌΛհ 5
2. BACKGROUND • Data Center Clos NW: 10Gαʔό*40kɺoversub 1:4, 400Gbps@Border
• VIPτϥϑΟοΫͷੑ࣭ • 44%VIPτϥϑΟοΫʢDC:DC֎=2:1ʣ • DCؒin/out1:1, σʔλಉظܥ • ཁ݅·ͱΊ 1. “Scale, Scale and Scale”: • ίετʢαʔόίετͷ1%, 400ଟ͗͢ɻʣ • ࠷େ1VIP 100Gbps & 1M current-conn, 100ճઃఆมߋ/ 2. ৴པੑ: N+1ߏͰͷࣗಈճ෮ɺϝϯςରԠ 3. “Any Service Anywhere”: L2υϝΠϯ੍ݶʹറΒΕͳ͍Α͏ʹ͢Δ 4. ςφϯτ: LBڞ༗ʹΑΔDoSӨڹʢଞͷސ٬͕ଳҬΛୣΘΕΔʣͷରࡦ 6 400Gbps 100Tbps
3. DESIGN (1/5) Principles & Architecture 7 • Scale outͰ͖ΔΑ͏ʹ
• RouterͷΑ͏ʹϑϩʔҡ࣋ػߏΛ࣋ͨͳ͍Α͏ʹ͢Δ • શಉظ͕ඞཁͳͷΘͳ͍ • i.e. WRR (Weighted Round Robin)ͱWeighted Random • ͍͠ॲཧHVଆͰΔʢΦϑϩʔυ͢Δʣ • ACL, Rate Limit, Metering • Ananta Manager (AM), Multiplexer (Mux), Host Agent (HA) • Inbound: IP-in-IP, NAT and DSR • Outbound: DIP->VIPVIP:sportͷmappingΛMUXͱHAͰಉظ͓ͯ͘͠ VIPใ/ECMP Selection/IP-in-IP L3 Routing Decap/DNAT NAT͠ DSR (Encapͳ͠) sportͱVIPΛཁٻ sportͱVIPΛઃఆ dportͱVIPͰ VMʹৼΓ͚
3. DESIGN (2/5) Principles & Architecture 8 • Fastpath: VIP
to VIP௨৴ɿLBΛbypath͠ɺHVؒͰ௨৴ͤ͞Δ • ࠷ॳLBΛ௨ͯ͠௨৴ • 3WHSྃ͢ΔͱDIP mappingใΛϦμΠϨΫτ • HA͕௨৴ͤ͞Δ • ҎޙLBΛ௨Βͳ͍ • ͬऔΓରࡦඞཁ ͜ͷ௨৴DIP2ͱmapping͞ΕͯΔΑ DIP1ඥ͚ DIP1/DIP2௨৴
3. DESIGN (3/5) Mux/Host Agent 9 • Mux Pool (Muxͷηοτ)
• Mux: BGP Speaker: VIPΛใɻো࣌ܦ࿏ॖୀɻTCP MD5ೝূ • AM͕VIP/DIP mappingΛMuxʹσϓϩΠɻ5tupleͰselection, hashؔɾseedશMuxͰڞ௨ʢECMPͰͲͷMuxʹ౸ୡͯ͠ಉ͡ॲཧΛอূʣ • ҰmappingΛࢀর͞ΕΔͱϑϩʔΛอ࣋ɻͨͩ͠ϝϞϦׂྔผʢSYN-Flood߈ܸରࡦʣ • ৴པͰ͖Δϑϩʔɿෳύέοτ->timeoutΊʹ͢Δ • ৴པͰ͖ͳ͍ϑϩʔɿ1ύέοτ->timeoutΊʹ͢Δ • Mux͕μϯ͢ΔͱECMPΨϥΨϥϙϯ • μϯதʹmappingมߋ͞ΕΔͱϑϩʔҡ࣋Ͱ͖ͳ͍ ->DHT (Distributed hash table)Λར༻ • Host Agent: શHV্ʹଘࡏɺFastpath, NAT, Health checkΛߦ͏ ʢP.8ͷઆ໌ʣ • ϙʔτͷ࠶ར༻ػೳ • Health checkMuxͰͳ͘HAଆͰΔɻ
3. DESIGN (4/5) Ananta Manager/Tenant Isolation 10 • Ananta Manager
(AM) • Paxosϕʔεͷࢄίϯτϩʔϥ • 5ϨϓϦΧͰՔಇɺ3ϨϓϦΧҎ্Ͱਖ਼ৗॲཧ • S-NAT: portׂΛόϧΫॲཧ • ςφϯτ • Muxຖʹಠཱ֤ͯ͠ςφϯτΞΠιϨʔγϣϯΛ࣮͢Εྑ͍ • AM: ཁٻFCFS(ઌணॱ: fi rst-come- fi rst-serve)͞ΕΔɻ͔ͭɺಉ͡Α͏ͳ৽نϦΫΤετऔΓԼ͛ɻ(2) • Mux: దͳଳҬ෯Λ͑ͨ߹ɺաଳҬʹൺྫͨ֬͠Ͱdrop and rate limit͢Δ • Top talker(Ұ൪௨৴͍ͯ͠Δ) VIPΛMux͔ΒҠಈͤ͞Δ
3. DESIGN (5/5) Alternatives 11 • DNS-based LB • ෛՙࢄͷࣄલ༧ଌ͕͍͠ʢClient͔ΒͷϦΫΤετ͕ภΔʣ???
• DNSΩϟογϡফ͑Δ·Ͱ͕͔͔࣌ؒΔ • stateful (NATͳͲ)͕Ͱ͖ͳ͍ • OpenFlow-based LB • ࢢൢOpenFlowσόΠεͰ2-4kϑϩʔ·ͰʢMux~Mϑϩʔঢ়ଶΛอ͍࣋ͨ͠ʣ • ςφϯτͷػೳ • BGPใͰ͖ͳ͍ʢAMʹͤΔʁʣ
4. IMPLEMENTATION • AM: Ԡੑॏཁ • SEDA (Staged event-driven Arch.)తͳϩοΫϑϦʔઃܭ
• thread poolڞ༗ʢ૯੍ݶʣ • ༏ઌʢྫɿVIP࡞༏ઌʣ • Paxos SDK + Discovery + Health MonitoringͰ࣮ • ϓϥΠϚϦ͕ॲཧΛߦ͏͜ͱΛอূ • upgrade࣌ʹAMΠϯελϯε͕1ͭҎ্མͪͳ͍͜ͱΛอূ • Mux: ΧʔωϧʢυϥΠόʣͰͷύέοτॲཧ + ϢʔβϞʔυͷBGPॲཧ • ΧʔωϧػೳΛͦͷ··͏: IPIP/RSS/IPv6 etc • 1VIPͰ20k DIP, 1.6M SNAT port mapping. ~Mͷಉ࣌ίωΫγϣϯใΛอ࣋ 12 *5 *8 *all
5. MEASUREMENTS Micro-benchmark 13 10VM * 2 tenantͰ1MB௨৴/connection ͔ᷮʹHostෛՙ͕૿͑Δ͕ɺMuxෛՙେ෯ʹԼΔ 10VM
* 5 tenant (baseτϥϑΟοΫ+SYN- fl ood * 10ճ) தʙߴෛՙͰDoSͷݟ͚͕ͭ͘ʹ͘͘ͳΔɻ ΄΅શͯ75msҎʹऩ·Δ ϙʔτ֬อͰ͖ͳ͍߹Ճ͕࣌ؒlong-tailͰ͔͔Δ Fastpath༗Γແ͠ͰͷCPUෛՙൺֱ SYN- fl ood Attack Mitigation S-NAT·ͰͷϨΠςϯγʔ
5. MEASUREMENTS Real World Data (1/2) • ߹ܭ1Tbps, 3ӡ༻ɺinter/intranet, ༻్ɿblob,
table/queue, storage 14 %ileతʹࠔΔγφϦΦ΄΅ແ͍ɻ <- 50ms <- 200ms <- max 2s req/5min@test tenant ฏۉՔಇ99.95% Muxߴෛՙ ʢSYN- fl oodʣ NW ޡݕ <- 75ms@50%ile <- max 2s ςφϯτɾMuxͷنʹґଘɻ SLAʹऩ·͍ͬͯΔɻ S-NATͷϦΫΤετ࣌ؒ Մ༻ੑ con fi gྃ࣌ؒ
5. MEASUREMENTS Real World Data (2/2) 15 800Mbps (220Kpps) /
core ॲཧ͕͔֬ʹECMP͞Ε͍ͯΔ ߹ܭ33.6Gbps: 2.4Gbps*14 Mux14ͷଳҬͱෛՙঢ়گ (25%)
6. OPERATIONAL EXPERIENCE • 3ؒΫϥυͰӡ༻ • HW LBʹ”ݟΓΛ͚ͭͨ”ཧ༝ɿDoS߈ܸରԠ͕Ͱ͖ͳ͍ɺྗੑ (elasticity)͕ͳ͍ɺଳҬ૿ՃɾՁ֨ѹྗʹݟ߹Θͳ͍ •
SW LBͷى͖࣮ͨࡍʹى͖ͨͱ՝ • AM dual primary: ݹ͍primaryػ͔ΒMuxϦΫΤετɺMuxଆ͜ΕΛڋ൱ɻ • Muxଆ͕ϦΫΤετڋ൱Λͨ͠ΒτϥϯβΫγϣϯΛ࣮ߦɺͰղܾ • IP-in-IPͷͨΊMTUมߋ, HA͕MSSௐ͢Δ͕ͣԿނ͔֎ΕͯMTU͑Ͱdrop • ͋ΔϗʔϜϧʔλʹMSS͕ fi x͞ΕΔόά • ͋ΔϞόΠϧOSͷTCPόάͰTCP࠶ଓ࣌ʹϑϧαΠζͷηάϝϯτΛͦͷ··͏όά • NWશମͷMTUΛ্͛ͨ • BGPͱLB͕ಉډ͍ͯ͠ΔͷͰɺଳҬ͋;ΕΔͱڞΕɻ͔͠1མͪΔͱτϥϑΟοΫ͕دΔͷͰ࿈োͷՄೳੑ • BGP/LBͰI/FΛ͚ΔɾϧʔλଆͰτϥϑΟοΫϨʔτΛߜΔɻBGP/LBಉډͷ΄͏͕ઃܭ͕γϯϓϧ • HW LBͷΞΠυϧίωΫγϣϯλΠϜΞτʢ̒̌ඵʣ • SW LBͰstateᷓΕରࡦʢDoSରࡦʣΛҾ͖ܧ͍ͩɻϞόΠϧ௨৴ுΓͬͺͳ͕͠ଟ͍ -> ͦͦVIP mapping͕͋ΔͷͰstateΛ࡞Βͳͯ͘ྑ͍->ແࣄλΠϜΞτΛ͘ Ͱ͖ͨ 16
7. RELATED WORK • HW LBεέʔϧΞοϓܕʢ1+1ܕʣ • धཁʹԠͨ͡εέʔϧΞοϓɾμϯ͕Ͱ͖ͳ͍ • ΫϥυڥͰՔಇཁ݅ͷͨΊN+1ͷੑ͕ඞཁ
• ԾΞϓϥΠΞϯεOSS (HAProxyͳͲ) • N+1͕Ͱ͖ͳ͍ɻNWো࣌εϖΞIP (I/F)Λ͏ҝL2υϝΠϯ੍ݶ • 1VIPΛεέʔϧͰ͖ͳ͍ • Embrace: ϗετଆͰಈ࡞ɺEgiΒ/RouteBricks: ίϞσΟςΟHWͰߴੑೳϧʔλΛ࣮ݱ • ETTM: શͯͷΤϯυϗετ͕ύέοτॲཧɻAnantaLB͚ͩઐ༻ͷαʔόɻ 17
8. CONCLUSION • Ananta • ࢄܕL4LB/NAT • Ϛϧνςφϯτɺߴ৴པੑɺӡ༻ͷ(Azureͷ)ཁ݅Λຬͨ͢Α͏ʹઃܭ • AzureҎ֎Ͱʹཱͭϋζ
• େن༻్Ͱίετʹݟ߹͏ɾscale-outͰ͖Δઃܭ͕ඞཁ • ECMP, BGP, DSR, Fastpath, HostଆNAT, rate limit • LB100ɺ10ສਓҎ্ʹVIPαʔϏεΛఏڙ 18
3ߦ·ͱΊ • ઃܭࢥMaglev (google)VPPLB (YNWLB2)ͱಉ͡ • BGP/ECMP/Consistent-hash/L4LB/DSR • εέʔϧͤ͞ΔͨΊͷ •
LBͰඞཁͳॲཧʢNATॲཧɾϔϧενΣοΫͳͲʣΛHVଆʹΦϑϩʔυ͢Δ • 1%ϧʔϧʢLBϊʔυΫϥελͷαʔόͷ1%·Ͱʣ • LBઃఆมߋരʢ75msʣ • Fastpath (్த͔ΒLBΛհ͞ͳ͍௨৴ʹΓସ͑Δ)ͰߋʹޮԽ 19
EoP 20