Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Research Paper Introduction #6
Search
cafenero_777
November 25, 2019
Technology
0
120
Research Paper Introduction #6
“Balancing on the Edge: Transport Affinity without Network State”
cafenero_777
November 25, 2019
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
440
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
110
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
110
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
85
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
51
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
120
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
35
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
210
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
230
Other Decks in Technology
See All in Technology
日経のデータベース事業とElasticsearch
hinatades
PRO
0
260
ディスプレイ広告(Yahoo!広告・LINE広告)におけるバックエンド開発
lycorptech_jp
PRO
0
480
生成AI×財務経理:PoCで挑むSlack AI Bot開発と現場巻き込みのリアル
pohdccoe
1
780
Global Databaseで実現するマルチリージョン自動切替とBlue/Greenデプロイ
j2yano
0
110
JAWS FESTA 2024「バスロケ」GPS×サーバーレスの開発と運用の舞台裏/jawsfesta2024-bus-gps-serverless
ma2shita
3
270
Exadata Database Service on Cloud@Customer セキュリティ、ネットワーク、および管理について
oracle4engineer
PRO
2
1.5k
Potential EM 制度を始めた理由、そして2年後にやめた理由 - EMConf JP 2025
hoyo
2
2.8k
Amazon Aurora のバージョンアップ手法について
smt7174
2
170
急成長する企業で作った、エンジニアが輝ける制度/ 20250227 Rinto Ikenoue
shift_evolve
0
160
【詳説】コンテンツ配信 システムの複数機能 基盤への拡張
hatena
0
280
クラウド食堂とは?
hiyanger
0
120
Oracle Database Technology Night #87-1 : Exadata Database Service on Exascale Infrastructure(ExaDB-XS)サービス詳細
oracle4engineer
PRO
1
200
Featured
See All Featured
Dealing with People You Can't Stand - Big Design 2015
cassininazir
366
25k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.2k
Code Review Best Practice
trishagee
67
18k
Making the Leap to Tech Lead
cromwellryan
133
9.1k
A Modern Web Designer's Workflow
chriscoyier
693
190k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
10
520
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
30
2.2k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
30
4.6k
What's in a price? How to price your products and services
michaelherold
244
12k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
Into the Great Unknown - MozCon
thekraken
35
1.6k
Scaling GitHub
holman
459
140k
Transcript
Research Paper Introduction #6 “Balancing on the Edge: Transport Affinity
without Network State” @cafenero_777 2019/11/25
$ which • Balancing on the Edge: Transport Affinity without
Network State • João Taveira Araújo, Lorenzo Saino, Lennert Buytenhek, and Raul Landa • Fastly • Networked Systems Design and Implementation (NSDI ’18) • https://www.usenix.org/conference/nsdi18/presentation/araujo
Agenda • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Abstract • Introduction • Background and
motivation • Design • Implementation • Evaluation • Operational experience • Related work • Conclusion
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ੍ͷେ͖͍CDN POPͰͷLBΛઃܭɾ։ൃɾ࣮ͨ͠ • ઃܭΛͯ͠stale-less͔ͭϨΠςϯγʔΛ࣮ݱ • ʢ’18ͷ࣌Ͱʣ4ͷ࣮ɻӡ༻্ؾ͍ͮͨ͜ͱͳͲΛڞ༗
• ಡ͏ͱͨ͠ཧ༝ • CDNͷPODͰߴޮͳLBͷઃܭɾ࣮ • ࣮ࡍʹFastlyͰ։ൃɾӡ༻͞Ε͍ͯΔ • rebuild.fmͰMiyagawa͞Μ͕ͯͯ͠ؾʹͳͬͨ
Introduction • POPs (Point of Presence) • CDN/EdgeͰͷར༻ • video/image৴ɺAAA,
༗ྉ৴ • Ҭࢄ • Tbps and Mrps • ௨ৗͷDC NWͱҧ͏ • Efficiency: ཧతʹ”ڱ͍”தͰ࠷େݶϦΫΤετΛ͘͞ • Resilience: ੑೳ͕ݶΒΕΔ->DDoSʹڧ͘࡞Δ -> stateless • Gracefulness: ݸʑͷίϯϙʔωϯτॏཁɻscale-in/out࣌ʹӨڹ͕ແ͍Α͏ʹઃܭ
Background and motivation • DCNWͱࣅͯඇͳΔཁ݅ • High request processing density:
• LBͷػೳΛSW/hostʹೖΕɺhostͷిྗͱεϖʔεͷඅ༻ରޮՌΛ࠷େԽɻSW࠷খ, Closߏͱҧ͏ • ैདྷͷHW-LBిྗɾεϖʔεޮѱ͍ɺ SW-LBthrughput, latency͕ѱ͍ • 32 host @25G, 4 SW w/ full-meshed. 1.28Tbps = 40G*32host, 100Maglev ! • Traffic surges: • DDoSରࡦʢ͍͖ͳΓඦഒͷτϥϑΟοΫ͕ൃੜʣ • SilkRoad (10M-conn, ASIC/SRAM)ɺఆৗతʹͦΕΛ͑ΔͷΛड͚͍ͯΔ • Magrev/Duet, ଓ͕૿͑ΔͱύϑΥʔϚϯε͕Լ • Host churn: • ेnodeɺscale-in/outӨڹ͕૬ରతʹେ͖͍ɻdrainແࢹͰ͖ͳ͍ɻ • ਖ਼ৗʹfailover͠ͳ͍ͱPOP/ProviderؒͰτϥϑΟοΫ͕churn͞ΕΔ • cloud serviceͳͷͰsoftware upgrade(࣌ͷfailover)ͨΓલʹߦΘΕΔ • Faults࣌ࣗಈupgradeࢭ·ΔɻBGPௐͰSWupgrade͞ΕΔ
None
None
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ Eth1 Eth2 Eth3 Eth4 Eth5 Eth6 port
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ
Design: Faild (2/3) • Encoding failover decisions • L4ͷҡ࣋͢ΔͨΊɺѼઌMACʹҙຯΛ࣋ͨͤΔ •
Current target • Previous target • failover࣌ɺCurr/PrevΛຒΊࠐΜͰαʔόసૹ͢Δ • ྫɿ͖ͬ͞·ͰBͰॲཧɻBΛfailover͠AʹҠߦ Eth5 Eth1
Design: Faild (3/3) • Host-side processing • ARP/NDagent/controllerͰ੍ޚ • failover࣌ɺlocalॲཧ͔సૹ͔બΔ
• ϑϩʔͦͷ··ϗετʹΔʂޮత • ΧʔωϧϞδϡʔϧͱ࣮ͯ͠ A: ৽ن௨৴(SYN) or AͱͷطଘͳΒAॲཧ͢Δ ɹͦ͏Ͱͳ͚ΕBʹసૹ͢Δ B: ηογϣϯʢsocketʣΛҡ࣋ͯ͠ॲཧ
Implementation • Python: 3.5k LoC • control-plane in userspace daemon
on ൚༻εΠον • ϕϯμʔAPI. OpenFlow/P4/SAIͰҠ২Մೳ • ରtable • Routing table: ECMP VIP-set • ARP table: ԾMAC mapping • Bridging table: ԾMACѼ௨৴Λ”ͲͷI/F͔Βग़͔͢”Λࢦఆɻ • LLDPͰߏՄೳ • ϔϧενΣοΫʢup/down/disabledʣ • ࿈ଓతʹdownͷ߹ECMPʹϑΥʔϧόοΫ͢Δ߹͋Δ • FIB lookup (ECMPάϧʔϓ)ɺͱ5taple C-hash on SRAM • daemon (Python): 2k LoC • VIPઃఆ • ϔϧενΣοΫ • kernel module 1.2k LOC • ԾMACͷϋϯυϦϯά • ϩʔΧϧॲཧ͔ϦμΠϨΫτ͔ • NIC unicast filterʹԾMACΛՃ • tableݶքͳΒhash-base filter or ϓϩϛεΩϟεϞʔυʹҠߦ • SYN-Cookieαϙʔτ • listenΩϡʔ͕͍ͬͺ͍ʹͳΔͱSYN-CookieݕূൃಈʢσϑΥϧτʣ Switch controller Host agent
Evaluation • ߏ • ࠷খPOPߏ (2 SW, 8 host, half-rack),
400Gbps, 320Krps • ࠷େPOPߏ(4 SW, 64 host) • ධՁ • end2endͷτϥϑΟοΫӨڹΛग़ͣ͞ʹdrain • λΠϜϦʔʹdrain • drain࣌ͷlatencyӨڹͳ͠ • drain࣌ͷCPUΦʔόʔϔουͳ͠
Evaluation (1/5): Graceful failover • εΠονͷdrain/refill • τϥϑΟοΫ͕ภΔɺΔ • αʔόͷෛՙมΘΒͣ
-> graceful failover • ྆SWʹಉҰhashઃఆ͕ඞཁ • ϗετͷdrain/refill • ଞ7hostʹࢄ • drainedϗετͷϦΫΤετ͕ٸܹʹऩଋ • ϑϩʔʹґଘ • ͦͷࡍʹϑϩʔͷӨڹ(reset, retrans)ͳ͠ X X
Evaluation (2/5): Switch reconfiguration time • ARPςʔϒϧߋ৽࣌ؒΛଌఆ • ࣮ߦ࣌ؒಉ࣌ߋ৽ʹൺྫ •
ϫʔετέʔε • AࣾͷASICࡌSW: 119ms@95%ile • BࣾͷASICࡌSW: 134ms@95%ile • े͍ʢಛʹࠔΒͳ͍ʣ • ARPߋ৽ΞτϛοΫॲཧɺαʔϏεӨڹͳ͠
Evaluation (3/5): Detour-induced latency • ping/tracerouteͰhost socket tableʹͨΒͣʹଌఆͰ͖ͳ͍ • ଌఆํ๏
• ඇSYNύέοτΛΘ͟ͱdrainedϗετʹྲྀ͠ɺresetΛ͛ͤ͞Δ (f:r) • ௨ৗ࣌ͱroundtrip࣌ؒΛൺֱ • ݁Ռ • 14us@50%ile • 14.6us@95%ile • 19.52us@99%ile • drain͞Ε͍ͯΔϗετΛ௨ͬͨͱ͖ͷΈԆ • ௨ৗͷιϑτΣΞLB(Maglev, Duet)ͷυϨΠϯԆ50us-1ms Reset drainedϗετ
Evaluation (4/5): Host overhead • FaildͷΧʔωϧϞδϡʔϧΦʔόʔϔουΛଌఆ • ελοΫτϨʔεͷ૯ΛΧϯτɺਖ਼نԽͯ͠CPU༻Λਪఆ • ֤2ؒଌఆ
• ݁Ռ • drain/refill࣌ͷΦʔόʔϔουඇৗʹগͳ͍ • ฏۉͰ0.22%, ࠷େͰ0.5%૿ • ิ • ࣮ݧɿdrain࣌2Ҏʹϑϩʔ͕ऴྃ • ࣮ࡍɿ-70%͕10ඵະຬɺ-85%͕̍ະຬɻ͍ʂ • Φʔόʔϔου͕খ͍͞ɺ͔ͭɺ࣌ؒͱͱʹϑϩʔٸݮ
Evaluation (5/5): Load balancing accuracy • ECMP͕HW࣮ɺ͔ͭۉҰʹෛՙࢄ͞ΕΔඞཁ͋Γ • ݕূ •
SW ECMPͰ2ϗετࢄ • MAX/AvgΛଌఆ • AࣾBࣾڞʹ΄΅1ʹʹऩଋɺ࣭ྑ͍
Operational experience (1/2) • FaildγϯϓϧͳNWಈ࡞ɾτϥϑΟοΫΛԾఆͯ͠࡞ΒΕͨ • ӡ༻ܦݧʹরΒ͠߹ΘͤͯԾఆΛ࠶ݕ౼ • Recursive draining
and POP upgrades • drain͍ͯ͠ΔϗετΛdrainͰ͖ͳ͍ɻdrainػ͕ࢮ͵ͱΞτʢ̎ॏোʁʣ • ͦͦඞཁͳ͍ɺͯdrain͞ΕΔɻ࠶ؼతdrainΛ࣮͢Δͱෳࡶੑ͕૿͢ͷͰΓͨ͘ͳ͍ɻ • Scalability challenges • େنԽʹෳϨΠϠʔߏʢSpine/Leafతͳʣ͕ඞཁɻোϗετͱऩ͞Εͯͳ͍SWΛಉظͤ͞Δඞཁ͋Γ • MACΞυϨεΤϯίʔυ͔ΒPOP࠷େαΠζ256ϗετʢվળ༻ҙʣɻ͍·ͷ͜ͱ256ϗετʹ͍͍ۙͮͯͳ͍ • IPv6Խ༰қɻIP-in-IPΧϓηϦϯάͰL3ԽͰ͖Δ
Operational experience (2/2) • ECMP hashing assumptions • ECMPͰ࠷େ6ഒ͕ࠩग़Δͷ •
{1, 2, .., 15} * 2^n͔͠ઃఆͰ͖ͳ͍ASICɻ • ྫɿ63ݸͷnext-hopΛઃఆͯ͠60ݸ (15*2^2)ɺΓ3hopશ͘సૹ͞Εͳ͍ • I/F൪߸ΛECMPͷhashܭࢉʹ͏ʢͭ·Γϗετ͝ͱʹܭࢉ͕ҧ͏ʣ • ϥΠϯΧʔυͷbootॱ͕hashͷseedͷϕϯμʔʂ • Protocol assumptions • ϑϥάϝϯτύέοτ͕དྷͨ߹ɺ5taple hashܭࢉ͕ҟͳΔͨΊҧ͏ϗετసૹ͞Εͯ͠·͏ɻɻ • ͕ɺIPv4ύέοτ΄΅શͯDon’t Fragment bitཱ͕͍ͬͯΔʢݸਓͷݟղʣɻPv6ͦͦͳ͠ • ECN͖ύέοτͷϦηοτύέοτ͕૿Ճʢ2015ͷiOS/OSXͰσϑΥϧτ༗ޮͱಉ࣌ʣɻถࠃҰࣾɺٴ ͼถࠃҎ֎ͷෳΦϖϨʔλͰ؍ଌ • ҟͳΔpathΛ௨ΔՄೳੑ -> ECNωΰγΤʔγϣϯΛఀࢭ
Related work • HW SWͷΈͷECMPͰhostՃͰ͖ͳ͍ • Ananta, MaglevϑϩʔຖͷstateΛ࣋ͬͯ͠·͏ • Duet,
Rubik: HW/SWΛΈ߹ΘͤɻHW ECMPΛSWʹҠ২͢Δͷඇݱ࣮త • SilkRoad: HW SRAMʹϑϩʔΛѹॖ֨ೲɻDDoSʹऑ͍ • Beamer: Faildʹ͍ۙΞϓϩʔνɻ͕ɺઐ༻LBϗετͱίϯτϩʔϥ͕ඞཁ • Faildɿ ҰͭҰٕͭज़ผͷׂͰΘΕΔɻӡ༻ܦݧʹΑͬͯͦΕΒΛ߹Θͤͯઃܭɻ •
Conclusion • DCΫϥυΑΓܻߴີͳPOP(ΤοδΫϥυ)Ͱಈ࡞ • τϥϯεϙʔτΞϑΟχςΟΛαϙʔτ͢Δstale-less LB • Graceful failover •
Φʔόʔϔουগͳ͍ • DDoSड͚ʹ͍͘ʢϦιʔε࠷దԽͱεέʔϥϒϧʣ • drainԆগ • աڈ4ؒͷܦݧଇΛө • faildͰ7Mrps͍ͨ͞ܦݧ • HW੍ݶඇײతͳϓϩτίϧ૬ޓ࡞༻
EoP