Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#24 “Ananta: Cloud Scale Load Balancing”
Search
cafenero_777
June 19, 2023
Technology
0
280
#24 “Ananta: Cloud Scale Load Balancing”
ACM SIGCOM ’13
https://dl.acm.org/doi/10.1145/2534169.2486026
cafenero_777
June 19, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
480
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
120
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
130
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
95
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
63
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
130
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
46
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
230
#25 “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”
cafenero_777
0
160
Other Decks in Technology
See All in Technology
Oracle Base Database Service:サービス概要のご紹介
oracle4engineer
PRO
1
20k
リモートワークで心掛けていること 〜AI活用編〜
naoki85
0
190
マルチプロダクト×マルチテナントを支えるモジュラモノリスを中心としたアソビューのアーキテクチャ
disc99
1
660
Intro to Software Startups: Spring 2025
arnabdotorg
0
280
はじめての転職講座/The Guide of First Career Change
kwappa
5
4.4k
EKS Pod Identity における推移的な session tags
z63d
1
120
Backlog AI アシスタントが切り開く未来
vvatanabe
1
170
AIに目を奪われすぎて、周りの困っている人間が見えなくなっていませんか?
cap120
1
710
Telemetry APIから学ぶGoogle Cloud ObservabilityとOpenTelemetryの現在 / getting-started-telemetry-api-with-google-cloud
k6s4i53rx
0
160
生成AI活用のROI、どう測る? DMM.com 開発責任者から学ぶ「AI効果検証のノウハウ」 / ROI of AI
i35_267
4
130
LLM時代の検索とコンテキストエンジニアリング
shibuiwilliam
2
370
AIエージェントを現場で使う / 2025.08.07 著者陣に聞く!現場で活用するためのAIエージェント実践入門(Findyランチセッション)
smiyawaki0820
7
1.3k
Featured
See All Featured
Designing for Performance
lara
610
69k
A Modern Web Designer's Workflow
chriscoyier
695
190k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.6k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
The World Runs on Bad Software
bkeepers
PRO
70
11k
Imperfection Machines: The Place of Print at Facebook
scottboms
268
13k
Making Projects Easy
brettharned
117
6.3k
The Invisible Side of Design
smashingmag
301
51k
The Cult of Friendly URLs
andyhume
79
6.5k
For a Future-Friendly Web
brad_frost
179
9.9k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
Transcript
Research Paper Introduction #24 “Ananta: Cloud Scale Load Balancing” ௨ࢉ#75
@cafenero_777 2021/06/24 1
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. BACKGROUND 3.
DESIGN 4. IMPLEMENTATION 5. MEASUREMENTS 6. OPERATIONAL EXPERIENCE 7. RELATED WORK 8. CONCLUSION 2
ରจ • Ananta: Cloud Scale Load Balancing • Parveen Patel,
Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, Naveen Karri • Microsoft • ACM SIGCOM ’13 • https://dl.acm.org/doi/10.1145/2534169.2486026 3
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • Ananta: Scalable L4LB (DSR/NAT) • ୯ҰVIPͰ100Gbps,
߹ܭͰ1TbpsҎ্ͷଳҬ෯ • Azure্Ͱಈ࡞ • ಡ͏ͱͨ͠ཧ༝ͱײ • AzureͷVFPจʹҾ༻ • ଞͷLB (Maglevͱ͔)Ͱ݁ߏҾ༻͞Ε͍ͯͨͷͰɻ 4 https://www.connectedpapers.com/main/5c295df1a7f302c97f6f379eab6abba592811d42/Ananta-cloud-scale-load-balancing/graph Ananta Maglev SilkRoad Beamer Faild Middleboxܥ ࢄɾߴޮܥ
1. Introduction • ΫϥυίϯϐϡʔςΟϯάͷීٴ • ߴ͍Քಇ (SLA), ϚϧνςφϯτɺେنτϥϑΟοΫ • 1VIP
100Gbps, 1000host/VIP, 6~60ճૢ࡞/1min • શSLAҧ/োͷ36%LBؔ࿈ • Ananta (αϯεΫϦοτޠͰແݶ) • Scalable L4 LB (NAT/DSR) • D-plane: ECMP (in NW), LB, NAT (in VFP/HV) • C-plane: SDN/Paxos, S-NAT࿈ܞ • 2011/09ʹAzureʹ100ಋೖ, 1Tbps, 100k VIPs • L4LB@Cloud, NWࢄγεςϜͱsclaingʹ͍ͭͯɺଌఆ݁Ռͱӡ༻݁ՌΛհ 5
2. BACKGROUND • Data Center Clos NW: 10Gαʔό*40kɺoversub 1:4, 400Gbps@Border
• VIPτϥϑΟοΫͷੑ࣭ • 44%VIPτϥϑΟοΫʢDC:DC֎=2:1ʣ • DCؒin/out1:1, σʔλಉظܥ • ཁ݅·ͱΊ 1. “Scale, Scale and Scale”: • ίετʢαʔόίετͷ1%, 400ଟ͗͢ɻʣ • ࠷େ1VIP 100Gbps & 1M current-conn, 100ճઃఆมߋ/ 2. ৴པੑ: N+1ߏͰͷࣗಈճ෮ɺϝϯςରԠ 3. “Any Service Anywhere”: L2υϝΠϯ੍ݶʹറΒΕͳ͍Α͏ʹ͢Δ 4. ςφϯτ: LBڞ༗ʹΑΔDoSӨڹʢଞͷސ٬͕ଳҬΛୣΘΕΔʣͷରࡦ 6 400Gbps 100Tbps
3. DESIGN (1/5) Principles & Architecture 7 • Scale outͰ͖ΔΑ͏ʹ
• RouterͷΑ͏ʹϑϩʔҡ࣋ػߏΛ࣋ͨͳ͍Α͏ʹ͢Δ • શಉظ͕ඞཁͳͷΘͳ͍ • i.e. WRR (Weighted Round Robin)ͱWeighted Random • ͍͠ॲཧHVଆͰΔʢΦϑϩʔυ͢Δʣ • ACL, Rate Limit, Metering • Ananta Manager (AM), Multiplexer (Mux), Host Agent (HA) • Inbound: IP-in-IP, NAT and DSR • Outbound: DIP->VIPVIP:sportͷmappingΛMUXͱHAͰಉظ͓ͯ͘͠ VIPใ/ECMP Selection/IP-in-IP L3 Routing Decap/DNAT NAT͠ DSR (Encapͳ͠) sportͱVIPΛཁٻ sportͱVIPΛઃఆ dportͱVIPͰ VMʹৼΓ͚
3. DESIGN (2/5) Principles & Architecture 8 • Fastpath: VIP
to VIP௨৴ɿLBΛbypath͠ɺHVؒͰ௨৴ͤ͞Δ • ࠷ॳLBΛ௨ͯ͠௨৴ • 3WHSྃ͢ΔͱDIP mappingใΛϦμΠϨΫτ • HA͕௨৴ͤ͞Δ • ҎޙLBΛ௨Βͳ͍ • ͬऔΓରࡦඞཁ ͜ͷ௨৴DIP2ͱmapping͞ΕͯΔΑ DIP1ඥ͚ DIP1/DIP2௨৴
3. DESIGN (3/5) Mux/Host Agent 9 • Mux Pool (Muxͷηοτ)
• Mux: BGP Speaker: VIPΛใɻো࣌ܦ࿏ॖୀɻTCP MD5ೝূ • AM͕VIP/DIP mappingΛMuxʹσϓϩΠɻ5tupleͰselection, hashؔɾseedશMuxͰڞ௨ʢECMPͰͲͷMuxʹ౸ୡͯ͠ಉ͡ॲཧΛอূʣ • ҰmappingΛࢀর͞ΕΔͱϑϩʔΛอ࣋ɻͨͩ͠ϝϞϦׂྔผʢSYN-Flood߈ܸରࡦʣ • ৴པͰ͖Δϑϩʔɿෳύέοτ->timeoutΊʹ͢Δ • ৴པͰ͖ͳ͍ϑϩʔɿ1ύέοτ->timeoutΊʹ͢Δ • Mux͕μϯ͢ΔͱECMPΨϥΨϥϙϯ • μϯதʹmappingมߋ͞ΕΔͱϑϩʔҡ࣋Ͱ͖ͳ͍ ->DHT (Distributed hash table)Λར༻ • Host Agent: શHV্ʹଘࡏɺFastpath, NAT, Health checkΛߦ͏ ʢP.8ͷઆ໌ʣ • ϙʔτͷ࠶ར༻ػೳ • Health checkMuxͰͳ͘HAଆͰΔɻ
3. DESIGN (4/5) Ananta Manager/Tenant Isolation 10 • Ananta Manager
(AM) • Paxosϕʔεͷࢄίϯτϩʔϥ • 5ϨϓϦΧͰՔಇɺ3ϨϓϦΧҎ্Ͱਖ਼ৗॲཧ • S-NAT: portׂΛόϧΫॲཧ • ςφϯτ • Muxຖʹಠཱ֤ͯ͠ςφϯτΞΠιϨʔγϣϯΛ࣮͢Εྑ͍ • AM: ཁٻFCFS(ઌணॱ: fi rst-come- fi rst-serve)͞ΕΔɻ͔ͭɺಉ͡Α͏ͳ৽نϦΫΤετऔΓԼ͛ɻ(2) • Mux: దͳଳҬ෯Λ͑ͨ߹ɺաଳҬʹൺྫͨ֬͠Ͱdrop and rate limit͢Δ • Top talker(Ұ൪௨৴͍ͯ͠Δ) VIPΛMux͔ΒҠಈͤ͞Δ
3. DESIGN (5/5) Alternatives 11 • DNS-based LB • ෛՙࢄͷࣄલ༧ଌ͕͍͠ʢClient͔ΒͷϦΫΤετ͕ภΔʣ???
• DNSΩϟογϡফ͑Δ·Ͱ͕͔͔࣌ؒΔ • stateful (NATͳͲ)͕Ͱ͖ͳ͍ • OpenFlow-based LB • ࢢൢOpenFlowσόΠεͰ2-4kϑϩʔ·ͰʢMux~Mϑϩʔঢ়ଶΛอ͍࣋ͨ͠ʣ • ςφϯτͷػೳ • BGPใͰ͖ͳ͍ʢAMʹͤΔʁʣ
4. IMPLEMENTATION • AM: Ԡੑॏཁ • SEDA (Staged event-driven Arch.)తͳϩοΫϑϦʔઃܭ
• thread poolڞ༗ʢ૯੍ݶʣ • ༏ઌʢྫɿVIP࡞༏ઌʣ • Paxos SDK + Discovery + Health MonitoringͰ࣮ • ϓϥΠϚϦ͕ॲཧΛߦ͏͜ͱΛอূ • upgrade࣌ʹAMΠϯελϯε͕1ͭҎ্མͪͳ͍͜ͱΛอূ • Mux: ΧʔωϧʢυϥΠόʣͰͷύέοτॲཧ + ϢʔβϞʔυͷBGPॲཧ • ΧʔωϧػೳΛͦͷ··͏: IPIP/RSS/IPv6 etc • 1VIPͰ20k DIP, 1.6M SNAT port mapping. ~Mͷಉ࣌ίωΫγϣϯใΛอ࣋ 12 *5 *8 *all
5. MEASUREMENTS Micro-benchmark 13 10VM * 2 tenantͰ1MB௨৴/connection ͔ᷮʹHostෛՙ͕૿͑Δ͕ɺMuxෛՙେ෯ʹԼΔ 10VM
* 5 tenant (baseτϥϑΟοΫ+SYN- fl ood * 10ճ) தʙߴෛՙͰDoSͷݟ͚͕ͭ͘ʹ͘͘ͳΔɻ ΄΅શͯ75msҎʹऩ·Δ ϙʔτ֬อͰ͖ͳ͍߹Ճ͕࣌ؒlong-tailͰ͔͔Δ Fastpath༗Γແ͠ͰͷCPUෛՙൺֱ SYN- fl ood Attack Mitigation S-NAT·ͰͷϨΠςϯγʔ
5. MEASUREMENTS Real World Data (1/2) • ߹ܭ1Tbps, 3ӡ༻ɺinter/intranet, ༻్ɿblob,
table/queue, storage 14 %ileతʹࠔΔγφϦΦ΄΅ແ͍ɻ <- 50ms <- 200ms <- max 2s req/5min@test tenant ฏۉՔಇ99.95% Muxߴෛՙ ʢSYN- fl oodʣ NW ޡݕ <- 75ms@50%ile <- max 2s ςφϯτɾMuxͷنʹґଘɻ SLAʹऩ·͍ͬͯΔɻ S-NATͷϦΫΤετ࣌ؒ Մ༻ੑ con fi gྃ࣌ؒ
5. MEASUREMENTS Real World Data (2/2) 15 800Mbps (220Kpps) /
core ॲཧ͕͔֬ʹECMP͞Ε͍ͯΔ ߹ܭ33.6Gbps: 2.4Gbps*14 Mux14ͷଳҬͱෛՙঢ়گ (25%)
6. OPERATIONAL EXPERIENCE • 3ؒΫϥυͰӡ༻ • HW LBʹ”ݟΓΛ͚ͭͨ”ཧ༝ɿDoS߈ܸରԠ͕Ͱ͖ͳ͍ɺྗੑ (elasticity)͕ͳ͍ɺଳҬ૿ՃɾՁ֨ѹྗʹݟ߹Θͳ͍ •
SW LBͷى͖࣮ͨࡍʹى͖ͨͱ՝ • AM dual primary: ݹ͍primaryػ͔ΒMuxϦΫΤετɺMuxଆ͜ΕΛڋ൱ɻ • Muxଆ͕ϦΫΤετڋ൱Λͨ͠ΒτϥϯβΫγϣϯΛ࣮ߦɺͰղܾ • IP-in-IPͷͨΊMTUมߋ, HA͕MSSௐ͢Δ͕ͣԿނ͔֎ΕͯMTU͑Ͱdrop • ͋ΔϗʔϜϧʔλʹMSS͕ fi x͞ΕΔόά • ͋ΔϞόΠϧOSͷTCPόάͰTCP࠶ଓ࣌ʹϑϧαΠζͷηάϝϯτΛͦͷ··͏όά • NWશମͷMTUΛ্͛ͨ • BGPͱLB͕ಉډ͍ͯ͠ΔͷͰɺଳҬ͋;ΕΔͱڞΕɻ͔͠1མͪΔͱτϥϑΟοΫ͕دΔͷͰ࿈োͷՄೳੑ • BGP/LBͰI/FΛ͚ΔɾϧʔλଆͰτϥϑΟοΫϨʔτΛߜΔɻBGP/LBಉډͷ΄͏͕ઃܭ͕γϯϓϧ • HW LBͷΞΠυϧίωΫγϣϯλΠϜΞτʢ̒̌ඵʣ • SW LBͰstateᷓΕରࡦʢDoSରࡦʣΛҾ͖ܧ͍ͩɻϞόΠϧ௨৴ுΓͬͺͳ͕͠ଟ͍ -> ͦͦVIP mapping͕͋ΔͷͰstateΛ࡞Βͳͯ͘ྑ͍->ແࣄλΠϜΞτΛ͘ Ͱ͖ͨ 16
7. RELATED WORK • HW LBεέʔϧΞοϓܕʢ1+1ܕʣ • धཁʹԠͨ͡εέʔϧΞοϓɾμϯ͕Ͱ͖ͳ͍ • ΫϥυڥͰՔಇཁ݅ͷͨΊN+1ͷੑ͕ඞཁ
• ԾΞϓϥΠΞϯεOSS (HAProxyͳͲ) • N+1͕Ͱ͖ͳ͍ɻNWো࣌εϖΞIP (I/F)Λ͏ҝL2υϝΠϯ੍ݶ • 1VIPΛεέʔϧͰ͖ͳ͍ • Embrace: ϗετଆͰಈ࡞ɺEgiΒ/RouteBricks: ίϞσΟςΟHWͰߴੑೳϧʔλΛ࣮ݱ • ETTM: શͯͷΤϯυϗετ͕ύέοτॲཧɻAnantaLB͚ͩઐ༻ͷαʔόɻ 17
8. CONCLUSION • Ananta • ࢄܕL4LB/NAT • Ϛϧνςφϯτɺߴ৴པੑɺӡ༻ͷ(Azureͷ)ཁ݅Λຬͨ͢Α͏ʹઃܭ • AzureҎ֎Ͱʹཱͭϋζ
• େن༻్Ͱίετʹݟ߹͏ɾscale-outͰ͖Δઃܭ͕ඞཁ • ECMP, BGP, DSR, Fastpath, HostଆNAT, rate limit • LB100ɺ10ສਓҎ্ʹVIPαʔϏεΛఏڙ 18
3ߦ·ͱΊ • ઃܭࢥMaglev (google)VPPLB (YNWLB2)ͱಉ͡ • BGP/ECMP/Consistent-hash/L4LB/DSR • εέʔϧͤ͞ΔͨΊͷ •
LBͰඞཁͳॲཧʢNATॲཧɾϔϧενΣοΫͳͲʣΛHVଆʹΦϑϩʔυ͢Δ • 1%ϧʔϧʢLBϊʔυΫϥελͷαʔόͷ1%·Ͱʣ • LBઃఆมߋരʢ75msʣ • Fastpath (్த͔ΒLBΛհ͞ͳ͍௨৴ʹΓସ͑Δ)ͰߋʹޮԽ 19
EoP 20