Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Research Paper Introduction #6
Search
cafenero_777
November 25, 2019
Technology
0
120
Research Paper Introduction #6
“Balancing on the Edge: Transport Affinity without Network State”
cafenero_777
November 25, 2019
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
440
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
110
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
110
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
85
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
51
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
120
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
35
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
210
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
230
Other Decks in Technology
See All in Technology
OPENLOGI Company Profile
hr01
0
60k
大規模アジャイルフレームワークから学ぶエンジニアマネジメントの本質
staka121
PRO
3
1.6k
2025/3/1 公共交通オープンデータデイ2025
morohoshi
0
110
Cracking the Coding Interview 6th Edition
gdplabs
14
28k
あなたが人生で成功するための5つの普遍的法則 #jawsug #jawsdays2025 / 20250301 HEROZ
yoshidashingo
2
340
入門 PEAK Threat Hunting @SECCON
odorusatoshi
0
180
What's new in Go 1.24?
ciarana
1
120
クラウド食堂とは?
hiyanger
0
130
OPENLOGI Company Profile for engineer
hr01
1
20k
AI Agent時代なのでAWSのLLMs.txtが欲しい!
watany
3
370
JavaにおけるNull非許容性
skrb
2
2.7k
AIエージェント開発のノウハウと課題
pharma_x_tech
9
4.8k
Featured
See All Featured
Code Review Best Practice
trishagee
67
18k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
6
580
Adopting Sorbet at Scale
ufuk
75
9.2k
Raft: Consensus for Rubyists
vanstee
137
6.8k
The Cost Of JavaScript in 2023
addyosmani
47
7.4k
Product Roadmaps are Hard
iamctodd
PRO
51
11k
Agile that works and the tools we love
rasmusluckow
328
21k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
330
21k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
For a Future-Friendly Web
brad_frost
176
9.6k
Fontdeck: Realign not Redesign
paulrobertlloyd
83
5.4k
How STYLIGHT went responsive
nonsquared
99
5.4k
Transcript
Research Paper Introduction #6 “Balancing on the Edge: Transport Affinity
without Network State” @cafenero_777 2019/11/25
$ which • Balancing on the Edge: Transport Affinity without
Network State • João Taveira Araújo, Lorenzo Saino, Lennert Buytenhek, and Raul Landa • Fastly • Networked Systems Design and Implementation (NSDI ’18) • https://www.usenix.org/conference/nsdi18/presentation/araujo
Agenda • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Abstract • Introduction • Background and
motivation • Design • Implementation • Evaluation • Operational experience • Related work • Conclusion
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ੍ͷେ͖͍CDN POPͰͷLBΛઃܭɾ։ൃɾ࣮ͨ͠ • ઃܭΛͯ͠stale-less͔ͭϨΠςϯγʔΛ࣮ݱ • ʢ’18ͷ࣌Ͱʣ4ͷ࣮ɻӡ༻্ؾ͍ͮͨ͜ͱͳͲΛڞ༗
• ಡ͏ͱͨ͠ཧ༝ • CDNͷPODͰߴޮͳLBͷઃܭɾ࣮ • ࣮ࡍʹFastlyͰ։ൃɾӡ༻͞Ε͍ͯΔ • rebuild.fmͰMiyagawa͞Μ͕ͯͯ͠ؾʹͳͬͨ
Introduction • POPs (Point of Presence) • CDN/EdgeͰͷར༻ • video/image৴ɺAAA,
༗ྉ৴ • Ҭࢄ • Tbps and Mrps • ௨ৗͷDC NWͱҧ͏ • Efficiency: ཧతʹ”ڱ͍”தͰ࠷େݶϦΫΤετΛ͘͞ • Resilience: ੑೳ͕ݶΒΕΔ->DDoSʹڧ͘࡞Δ -> stateless • Gracefulness: ݸʑͷίϯϙʔωϯτॏཁɻscale-in/out࣌ʹӨڹ͕ແ͍Α͏ʹઃܭ
Background and motivation • DCNWͱࣅͯඇͳΔཁ݅ • High request processing density:
• LBͷػೳΛSW/hostʹೖΕɺhostͷిྗͱεϖʔεͷඅ༻ରޮՌΛ࠷େԽɻSW࠷খ, Closߏͱҧ͏ • ैདྷͷHW-LBిྗɾεϖʔεޮѱ͍ɺ SW-LBthrughput, latency͕ѱ͍ • 32 host @25G, 4 SW w/ full-meshed. 1.28Tbps = 40G*32host, 100Maglev ! • Traffic surges: • DDoSରࡦʢ͍͖ͳΓඦഒͷτϥϑΟοΫ͕ൃੜʣ • SilkRoad (10M-conn, ASIC/SRAM)ɺఆৗతʹͦΕΛ͑ΔͷΛड͚͍ͯΔ • Magrev/Duet, ଓ͕૿͑ΔͱύϑΥʔϚϯε͕Լ • Host churn: • ेnodeɺscale-in/outӨڹ͕૬ରతʹେ͖͍ɻdrainແࢹͰ͖ͳ͍ɻ • ਖ਼ৗʹfailover͠ͳ͍ͱPOP/ProviderؒͰτϥϑΟοΫ͕churn͞ΕΔ • cloud serviceͳͷͰsoftware upgrade(࣌ͷfailover)ͨΓલʹߦΘΕΔ • Faults࣌ࣗಈupgradeࢭ·ΔɻBGPௐͰSWupgrade͞ΕΔ
None
None
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ Eth1 Eth2 Eth3 Eth4 Eth5 Eth6 port
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ
Design: Faild (2/3) • Encoding failover decisions • L4ͷҡ࣋͢ΔͨΊɺѼઌMACʹҙຯΛ࣋ͨͤΔ •
Current target • Previous target • failover࣌ɺCurr/PrevΛຒΊࠐΜͰαʔόసૹ͢Δ • ྫɿ͖ͬ͞·ͰBͰॲཧɻBΛfailover͠AʹҠߦ Eth5 Eth1
Design: Faild (3/3) • Host-side processing • ARP/NDagent/controllerͰ੍ޚ • failover࣌ɺlocalॲཧ͔సૹ͔બΔ
• ϑϩʔͦͷ··ϗετʹΔʂޮత • ΧʔωϧϞδϡʔϧͱ࣮ͯ͠ A: ৽ن௨৴(SYN) or AͱͷطଘͳΒAॲཧ͢Δ ɹͦ͏Ͱͳ͚ΕBʹసૹ͢Δ B: ηογϣϯʢsocketʣΛҡ࣋ͯ͠ॲཧ
Implementation • Python: 3.5k LoC • control-plane in userspace daemon
on ൚༻εΠον • ϕϯμʔAPI. OpenFlow/P4/SAIͰҠ২Մೳ • ରtable • Routing table: ECMP VIP-set • ARP table: ԾMAC mapping • Bridging table: ԾMACѼ௨৴Λ”ͲͷI/F͔Βग़͔͢”Λࢦఆɻ • LLDPͰߏՄೳ • ϔϧενΣοΫʢup/down/disabledʣ • ࿈ଓతʹdownͷ߹ECMPʹϑΥʔϧόοΫ͢Δ߹͋Δ • FIB lookup (ECMPάϧʔϓ)ɺͱ5taple C-hash on SRAM • daemon (Python): 2k LoC • VIPઃఆ • ϔϧενΣοΫ • kernel module 1.2k LOC • ԾMACͷϋϯυϦϯά • ϩʔΧϧॲཧ͔ϦμΠϨΫτ͔ • NIC unicast filterʹԾMACΛՃ • tableݶքͳΒhash-base filter or ϓϩϛεΩϟεϞʔυʹҠߦ • SYN-Cookieαϙʔτ • listenΩϡʔ͕͍ͬͺ͍ʹͳΔͱSYN-CookieݕূൃಈʢσϑΥϧτʣ Switch controller Host agent
Evaluation • ߏ • ࠷খPOPߏ (2 SW, 8 host, half-rack),
400Gbps, 320Krps • ࠷େPOPߏ(4 SW, 64 host) • ධՁ • end2endͷτϥϑΟοΫӨڹΛग़ͣ͞ʹdrain • λΠϜϦʔʹdrain • drain࣌ͷlatencyӨڹͳ͠ • drain࣌ͷCPUΦʔόʔϔουͳ͠
Evaluation (1/5): Graceful failover • εΠονͷdrain/refill • τϥϑΟοΫ͕ภΔɺΔ • αʔόͷෛՙมΘΒͣ
-> graceful failover • ྆SWʹಉҰhashઃఆ͕ඞཁ • ϗετͷdrain/refill • ଞ7hostʹࢄ • drainedϗετͷϦΫΤετ͕ٸܹʹऩଋ • ϑϩʔʹґଘ • ͦͷࡍʹϑϩʔͷӨڹ(reset, retrans)ͳ͠ X X
Evaluation (2/5): Switch reconfiguration time • ARPςʔϒϧߋ৽࣌ؒΛଌఆ • ࣮ߦ࣌ؒಉ࣌ߋ৽ʹൺྫ •
ϫʔετέʔε • AࣾͷASICࡌSW: 119ms@95%ile • BࣾͷASICࡌSW: 134ms@95%ile • े͍ʢಛʹࠔΒͳ͍ʣ • ARPߋ৽ΞτϛοΫॲཧɺαʔϏεӨڹͳ͠
Evaluation (3/5): Detour-induced latency • ping/tracerouteͰhost socket tableʹͨΒͣʹଌఆͰ͖ͳ͍ • ଌఆํ๏
• ඇSYNύέοτΛΘ͟ͱdrainedϗετʹྲྀ͠ɺresetΛ͛ͤ͞Δ (f:r) • ௨ৗ࣌ͱroundtrip࣌ؒΛൺֱ • ݁Ռ • 14us@50%ile • 14.6us@95%ile • 19.52us@99%ile • drain͞Ε͍ͯΔϗετΛ௨ͬͨͱ͖ͷΈԆ • ௨ৗͷιϑτΣΞLB(Maglev, Duet)ͷυϨΠϯԆ50us-1ms Reset drainedϗετ
Evaluation (4/5): Host overhead • FaildͷΧʔωϧϞδϡʔϧΦʔόʔϔουΛଌఆ • ελοΫτϨʔεͷ૯ΛΧϯτɺਖ਼نԽͯ͠CPU༻Λਪఆ • ֤2ؒଌఆ
• ݁Ռ • drain/refill࣌ͷΦʔόʔϔουඇৗʹগͳ͍ • ฏۉͰ0.22%, ࠷େͰ0.5%૿ • ิ • ࣮ݧɿdrain࣌2Ҏʹϑϩʔ͕ऴྃ • ࣮ࡍɿ-70%͕10ඵະຬɺ-85%͕̍ະຬɻ͍ʂ • Φʔόʔϔου͕খ͍͞ɺ͔ͭɺ࣌ؒͱͱʹϑϩʔٸݮ
Evaluation (5/5): Load balancing accuracy • ECMP͕HW࣮ɺ͔ͭۉҰʹෛՙࢄ͞ΕΔඞཁ͋Γ • ݕূ •
SW ECMPͰ2ϗετࢄ • MAX/AvgΛଌఆ • AࣾBࣾڞʹ΄΅1ʹʹऩଋɺ࣭ྑ͍
Operational experience (1/2) • FaildγϯϓϧͳNWಈ࡞ɾτϥϑΟοΫΛԾఆͯ͠࡞ΒΕͨ • ӡ༻ܦݧʹরΒ͠߹ΘͤͯԾఆΛ࠶ݕ౼ • Recursive draining
and POP upgrades • drain͍ͯ͠ΔϗετΛdrainͰ͖ͳ͍ɻdrainػ͕ࢮ͵ͱΞτʢ̎ॏোʁʣ • ͦͦඞཁͳ͍ɺͯdrain͞ΕΔɻ࠶ؼతdrainΛ࣮͢Δͱෳࡶੑ͕૿͢ͷͰΓͨ͘ͳ͍ɻ • Scalability challenges • େنԽʹෳϨΠϠʔߏʢSpine/Leafతͳʣ͕ඞཁɻোϗετͱऩ͞Εͯͳ͍SWΛಉظͤ͞Δඞཁ͋Γ • MACΞυϨεΤϯίʔυ͔ΒPOP࠷େαΠζ256ϗετʢվળ༻ҙʣɻ͍·ͷ͜ͱ256ϗετʹ͍͍ۙͮͯͳ͍ • IPv6Խ༰қɻIP-in-IPΧϓηϦϯάͰL3ԽͰ͖Δ
Operational experience (2/2) • ECMP hashing assumptions • ECMPͰ࠷େ6ഒ͕ࠩग़Δͷ •
{1, 2, .., 15} * 2^n͔͠ઃఆͰ͖ͳ͍ASICɻ • ྫɿ63ݸͷnext-hopΛઃఆͯ͠60ݸ (15*2^2)ɺΓ3hopશ͘సૹ͞Εͳ͍ • I/F൪߸ΛECMPͷhashܭࢉʹ͏ʢͭ·Γϗετ͝ͱʹܭࢉ͕ҧ͏ʣ • ϥΠϯΧʔυͷbootॱ͕hashͷseedͷϕϯμʔʂ • Protocol assumptions • ϑϥάϝϯτύέοτ͕དྷͨ߹ɺ5taple hashܭࢉ͕ҟͳΔͨΊҧ͏ϗετసૹ͞Εͯ͠·͏ɻɻ • ͕ɺIPv4ύέοτ΄΅શͯDon’t Fragment bitཱ͕͍ͬͯΔʢݸਓͷݟղʣɻPv6ͦͦͳ͠ • ECN͖ύέοτͷϦηοτύέοτ͕૿Ճʢ2015ͷiOS/OSXͰσϑΥϧτ༗ޮͱಉ࣌ʣɻถࠃҰࣾɺٴ ͼถࠃҎ֎ͷෳΦϖϨʔλͰ؍ଌ • ҟͳΔpathΛ௨ΔՄೳੑ -> ECNωΰγΤʔγϣϯΛఀࢭ
Related work • HW SWͷΈͷECMPͰhostՃͰ͖ͳ͍ • Ananta, MaglevϑϩʔຖͷstateΛ࣋ͬͯ͠·͏ • Duet,
Rubik: HW/SWΛΈ߹ΘͤɻHW ECMPΛSWʹҠ২͢Δͷඇݱ࣮త • SilkRoad: HW SRAMʹϑϩʔΛѹॖ֨ೲɻDDoSʹऑ͍ • Beamer: Faildʹ͍ۙΞϓϩʔνɻ͕ɺઐ༻LBϗετͱίϯτϩʔϥ͕ඞཁ • Faildɿ ҰͭҰٕͭज़ผͷׂͰΘΕΔɻӡ༻ܦݧʹΑͬͯͦΕΒΛ߹Θͤͯઃܭɻ •
Conclusion • DCΫϥυΑΓܻߴີͳPOP(ΤοδΫϥυ)Ͱಈ࡞ • τϥϯεϙʔτΞϑΟχςΟΛαϙʔτ͢Δstale-less LB • Graceful failover •
Φʔόʔϔουগͳ͍ • DDoSड͚ʹ͍͘ʢϦιʔε࠷దԽͱεέʔϥϒϧʣ • drainԆগ • աڈ4ؒͷܦݧଇΛө • faildͰ7Mrps͍ͨ͞ܦݧ • HW੍ݶඇײతͳϓϩτίϧ૬ޓ࡞༻
EoP