Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Research Paper Introduction #6
Search
cafenero_777
November 25, 2019
Technology
0
120
Research Paper Introduction #6
“Balancing on the Edge: Transport Affinity without Network State”
cafenero_777
November 25, 2019
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
490
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
120
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
130
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
95
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
64
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
130
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
46
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
240
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
280
Other Decks in Technology
See All in Technology
CDK CLIで使ってたあの機能、CDK Toolkit Libraryではどうやるの?
smt7174
4
120
品質視点から考える組織デザイン/Organizational Design from Quality
mii3king
0
190
allow_retry と Arel.sql / allow_retry and Arel.sql
euglena1215
1
160
Webブラウザ向け動画配信プレイヤーの 大規模リプレイスから得た知見と学び
yud0uhu
0
230
AWSで始める実践Dagster入門
kitagawaz
1
590
Terraformで構築する セルフサービス型データプラットフォーム / terraform-self-service-data-platform
pei0804
1
160
20250903_1つのAWSアカウントに複数システムがある環境におけるアクセス制御をABACで実現.pdf
yhana
3
540
実践!カスタムインストラクション&スラッシュコマンド
puku0x
0
340
AI開発ツールCreateがAnythingになったよ
tendasato
0
120
LLMを搭載したプロダクトの品質保証の模索と学び
qa
0
1k
スマートファクトリーの第一歩 〜AWSマネージドサービスで 実現する予知保全と生成AI活用まで
ganota
1
200
初めてAWSを使うときのセキュリティ覚書〜初心者支部編〜
cmusudakeisuke
1
230
Featured
See All Featured
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
126
53k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
580
Typedesign – Prime Four
hannesfritz
42
2.8k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
131
19k
Building Adaptive Systems
keathley
43
2.7k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.7k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.1k
The Cult of Friendly URLs
andyhume
79
6.6k
Faster Mobile Websites
deanohume
309
31k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.4k
The Language of Interfaces
destraynor
161
25k
Six Lessons from altMBA
skipperchong
28
4k
Transcript
Research Paper Introduction #6 “Balancing on the Edge: Transport Affinity
without Network State” @cafenero_777 2019/11/25
$ which • Balancing on the Edge: Transport Affinity without
Network State • João Taveira Araújo, Lorenzo Saino, Lennert Buytenhek, and Raul Landa • Fastly • Networked Systems Design and Implementation (NSDI ’18) • https://www.usenix.org/conference/nsdi18/presentation/araujo
Agenda • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Abstract • Introduction • Background and
motivation • Design • Implementation • Evaluation • Operational experience • Related work • Conclusion
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ੍ͷେ͖͍CDN POPͰͷLBΛઃܭɾ։ൃɾ࣮ͨ͠ • ઃܭΛͯ͠stale-less͔ͭϨΠςϯγʔΛ࣮ݱ • ʢ’18ͷ࣌Ͱʣ4ͷ࣮ɻӡ༻্ؾ͍ͮͨ͜ͱͳͲΛڞ༗
• ಡ͏ͱͨ͠ཧ༝ • CDNͷPODͰߴޮͳLBͷઃܭɾ࣮ • ࣮ࡍʹFastlyͰ։ൃɾӡ༻͞Ε͍ͯΔ • rebuild.fmͰMiyagawa͞Μ͕ͯͯ͠ؾʹͳͬͨ
Introduction • POPs (Point of Presence) • CDN/EdgeͰͷར༻ • video/image৴ɺAAA,
༗ྉ৴ • Ҭࢄ • Tbps and Mrps • ௨ৗͷDC NWͱҧ͏ • Efficiency: ཧతʹ”ڱ͍”தͰ࠷େݶϦΫΤετΛ͘͞ • Resilience: ੑೳ͕ݶΒΕΔ->DDoSʹڧ͘࡞Δ -> stateless • Gracefulness: ݸʑͷίϯϙʔωϯτॏཁɻscale-in/out࣌ʹӨڹ͕ແ͍Α͏ʹઃܭ
Background and motivation • DCNWͱࣅͯඇͳΔཁ݅ • High request processing density:
• LBͷػೳΛSW/hostʹೖΕɺhostͷిྗͱεϖʔεͷඅ༻ରޮՌΛ࠷େԽɻSW࠷খ, Closߏͱҧ͏ • ैདྷͷHW-LBిྗɾεϖʔεޮѱ͍ɺ SW-LBthrughput, latency͕ѱ͍ • 32 host @25G, 4 SW w/ full-meshed. 1.28Tbps = 40G*32host, 100Maglev ! • Traffic surges: • DDoSରࡦʢ͍͖ͳΓඦഒͷτϥϑΟοΫ͕ൃੜʣ • SilkRoad (10M-conn, ASIC/SRAM)ɺఆৗతʹͦΕΛ͑ΔͷΛड͚͍ͯΔ • Magrev/Duet, ଓ͕૿͑ΔͱύϑΥʔϚϯε͕Լ • Host churn: • ेnodeɺscale-in/outӨڹ͕૬ରతʹେ͖͍ɻdrainແࢹͰ͖ͳ͍ɻ • ਖ਼ৗʹfailover͠ͳ͍ͱPOP/ProviderؒͰτϥϑΟοΫ͕churn͞ΕΔ • cloud serviceͳͷͰsoftware upgrade(࣌ͷfailover)ͨΓલʹߦΘΕΔ • Faults࣌ࣗಈupgradeࢭ·ΔɻBGPௐͰSWupgrade͞ΕΔ
None
None
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ Eth1 Eth2 Eth3 Eth4 Eth5 Eth6 port
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ
Design: Faild (2/3) • Encoding failover decisions • L4ͷҡ࣋͢ΔͨΊɺѼઌMACʹҙຯΛ࣋ͨͤΔ •
Current target • Previous target • failover࣌ɺCurr/PrevΛຒΊࠐΜͰαʔόసૹ͢Δ • ྫɿ͖ͬ͞·ͰBͰॲཧɻBΛfailover͠AʹҠߦ Eth5 Eth1
Design: Faild (3/3) • Host-side processing • ARP/NDagent/controllerͰ੍ޚ • failover࣌ɺlocalॲཧ͔సૹ͔બΔ
• ϑϩʔͦͷ··ϗετʹΔʂޮత • ΧʔωϧϞδϡʔϧͱ࣮ͯ͠ A: ৽ن௨৴(SYN) or AͱͷطଘͳΒAॲཧ͢Δ ɹͦ͏Ͱͳ͚ΕBʹసૹ͢Δ B: ηογϣϯʢsocketʣΛҡ࣋ͯ͠ॲཧ
Implementation • Python: 3.5k LoC • control-plane in userspace daemon
on ൚༻εΠον • ϕϯμʔAPI. OpenFlow/P4/SAIͰҠ২Մೳ • ରtable • Routing table: ECMP VIP-set • ARP table: ԾMAC mapping • Bridging table: ԾMACѼ௨৴Λ”ͲͷI/F͔Βग़͔͢”Λࢦఆɻ • LLDPͰߏՄೳ • ϔϧενΣοΫʢup/down/disabledʣ • ࿈ଓతʹdownͷ߹ECMPʹϑΥʔϧόοΫ͢Δ߹͋Δ • FIB lookup (ECMPάϧʔϓ)ɺͱ5taple C-hash on SRAM • daemon (Python): 2k LoC • VIPઃఆ • ϔϧενΣοΫ • kernel module 1.2k LOC • ԾMACͷϋϯυϦϯά • ϩʔΧϧॲཧ͔ϦμΠϨΫτ͔ • NIC unicast filterʹԾMACΛՃ • tableݶքͳΒhash-base filter or ϓϩϛεΩϟεϞʔυʹҠߦ • SYN-Cookieαϙʔτ • listenΩϡʔ͕͍ͬͺ͍ʹͳΔͱSYN-CookieݕূൃಈʢσϑΥϧτʣ Switch controller Host agent
Evaluation • ߏ • ࠷খPOPߏ (2 SW, 8 host, half-rack),
400Gbps, 320Krps • ࠷େPOPߏ(4 SW, 64 host) • ධՁ • end2endͷτϥϑΟοΫӨڹΛग़ͣ͞ʹdrain • λΠϜϦʔʹdrain • drain࣌ͷlatencyӨڹͳ͠ • drain࣌ͷCPUΦʔόʔϔουͳ͠
Evaluation (1/5): Graceful failover • εΠονͷdrain/refill • τϥϑΟοΫ͕ภΔɺΔ • αʔόͷෛՙมΘΒͣ
-> graceful failover • ྆SWʹಉҰhashઃఆ͕ඞཁ • ϗετͷdrain/refill • ଞ7hostʹࢄ • drainedϗετͷϦΫΤετ͕ٸܹʹऩଋ • ϑϩʔʹґଘ • ͦͷࡍʹϑϩʔͷӨڹ(reset, retrans)ͳ͠ X X
Evaluation (2/5): Switch reconfiguration time • ARPςʔϒϧߋ৽࣌ؒΛଌఆ • ࣮ߦ࣌ؒಉ࣌ߋ৽ʹൺྫ •
ϫʔετέʔε • AࣾͷASICࡌSW: 119ms@95%ile • BࣾͷASICࡌSW: 134ms@95%ile • े͍ʢಛʹࠔΒͳ͍ʣ • ARPߋ৽ΞτϛοΫॲཧɺαʔϏεӨڹͳ͠
Evaluation (3/5): Detour-induced latency • ping/tracerouteͰhost socket tableʹͨΒͣʹଌఆͰ͖ͳ͍ • ଌఆํ๏
• ඇSYNύέοτΛΘ͟ͱdrainedϗετʹྲྀ͠ɺresetΛ͛ͤ͞Δ (f:r) • ௨ৗ࣌ͱroundtrip࣌ؒΛൺֱ • ݁Ռ • 14us@50%ile • 14.6us@95%ile • 19.52us@99%ile • drain͞Ε͍ͯΔϗετΛ௨ͬͨͱ͖ͷΈԆ • ௨ৗͷιϑτΣΞLB(Maglev, Duet)ͷυϨΠϯԆ50us-1ms Reset drainedϗετ
Evaluation (4/5): Host overhead • FaildͷΧʔωϧϞδϡʔϧΦʔόʔϔουΛଌఆ • ελοΫτϨʔεͷ૯ΛΧϯτɺਖ਼نԽͯ͠CPU༻Λਪఆ • ֤2ؒଌఆ
• ݁Ռ • drain/refill࣌ͷΦʔόʔϔουඇৗʹগͳ͍ • ฏۉͰ0.22%, ࠷େͰ0.5%૿ • ิ • ࣮ݧɿdrain࣌2Ҏʹϑϩʔ͕ऴྃ • ࣮ࡍɿ-70%͕10ඵະຬɺ-85%͕̍ະຬɻ͍ʂ • Φʔόʔϔου͕খ͍͞ɺ͔ͭɺ࣌ؒͱͱʹϑϩʔٸݮ
Evaluation (5/5): Load balancing accuracy • ECMP͕HW࣮ɺ͔ͭۉҰʹෛՙࢄ͞ΕΔඞཁ͋Γ • ݕূ •
SW ECMPͰ2ϗετࢄ • MAX/AvgΛଌఆ • AࣾBࣾڞʹ΄΅1ʹʹऩଋɺ࣭ྑ͍
Operational experience (1/2) • FaildγϯϓϧͳNWಈ࡞ɾτϥϑΟοΫΛԾఆͯ͠࡞ΒΕͨ • ӡ༻ܦݧʹরΒ͠߹ΘͤͯԾఆΛ࠶ݕ౼ • Recursive draining
and POP upgrades • drain͍ͯ͠ΔϗετΛdrainͰ͖ͳ͍ɻdrainػ͕ࢮ͵ͱΞτʢ̎ॏোʁʣ • ͦͦඞཁͳ͍ɺͯdrain͞ΕΔɻ࠶ؼతdrainΛ࣮͢Δͱෳࡶੑ͕૿͢ͷͰΓͨ͘ͳ͍ɻ • Scalability challenges • େنԽʹෳϨΠϠʔߏʢSpine/Leafతͳʣ͕ඞཁɻোϗετͱऩ͞Εͯͳ͍SWΛಉظͤ͞Δඞཁ͋Γ • MACΞυϨεΤϯίʔυ͔ΒPOP࠷େαΠζ256ϗετʢվળ༻ҙʣɻ͍·ͷ͜ͱ256ϗετʹ͍͍ۙͮͯͳ͍ • IPv6Խ༰қɻIP-in-IPΧϓηϦϯάͰL3ԽͰ͖Δ
Operational experience (2/2) • ECMP hashing assumptions • ECMPͰ࠷େ6ഒ͕ࠩग़Δͷ •
{1, 2, .., 15} * 2^n͔͠ઃఆͰ͖ͳ͍ASICɻ • ྫɿ63ݸͷnext-hopΛઃఆͯ͠60ݸ (15*2^2)ɺΓ3hopશ͘సૹ͞Εͳ͍ • I/F൪߸ΛECMPͷhashܭࢉʹ͏ʢͭ·Γϗετ͝ͱʹܭࢉ͕ҧ͏ʣ • ϥΠϯΧʔυͷbootॱ͕hashͷseedͷϕϯμʔʂ • Protocol assumptions • ϑϥάϝϯτύέοτ͕དྷͨ߹ɺ5taple hashܭࢉ͕ҟͳΔͨΊҧ͏ϗετసૹ͞Εͯ͠·͏ɻɻ • ͕ɺIPv4ύέοτ΄΅શͯDon’t Fragment bitཱ͕͍ͬͯΔʢݸਓͷݟղʣɻPv6ͦͦͳ͠ • ECN͖ύέοτͷϦηοτύέοτ͕૿Ճʢ2015ͷiOS/OSXͰσϑΥϧτ༗ޮͱಉ࣌ʣɻถࠃҰࣾɺٴ ͼถࠃҎ֎ͷෳΦϖϨʔλͰ؍ଌ • ҟͳΔpathΛ௨ΔՄೳੑ -> ECNωΰγΤʔγϣϯΛఀࢭ
Related work • HW SWͷΈͷECMPͰhostՃͰ͖ͳ͍ • Ananta, MaglevϑϩʔຖͷstateΛ࣋ͬͯ͠·͏ • Duet,
Rubik: HW/SWΛΈ߹ΘͤɻHW ECMPΛSWʹҠ২͢Δͷඇݱ࣮త • SilkRoad: HW SRAMʹϑϩʔΛѹॖ֨ೲɻDDoSʹऑ͍ • Beamer: Faildʹ͍ۙΞϓϩʔνɻ͕ɺઐ༻LBϗετͱίϯτϩʔϥ͕ඞཁ • Faildɿ ҰͭҰٕͭज़ผͷׂͰΘΕΔɻӡ༻ܦݧʹΑͬͯͦΕΒΛ߹Θͤͯઃܭɻ •
Conclusion • DCΫϥυΑΓܻߴີͳPOP(ΤοδΫϥυ)Ͱಈ࡞ • τϥϯεϙʔτΞϑΟχςΟΛαϙʔτ͢Δstale-less LB • Graceful failover •
Φʔόʔϔουগͳ͍ • DDoSड͚ʹ͍͘ʢϦιʔε࠷దԽͱεέʔϥϒϧʣ • drainԆগ • աڈ4ؒͷܦݧଇΛө • faildͰ7Mrps͍ͨ͞ܦݧ • HW੍ݶඇײతͳϓϩτίϧ૬ޓ࡞༻
EoP