Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Research Paper Introduction #6
Search
cafenero_777
November 25, 2019
Technology
0
120
Research Paper Introduction #6
“Balancing on the Edge: Transport Affinity without Network State”
cafenero_777
November 25, 2019
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
370
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
93
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
72
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
71
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
38
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
100
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
26
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
190
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
200
Other Decks in Technology
See All in Technology
プロダクト活用度で見えた真実 ホリゾンタルSaaSでの顧客解像度の高め方
tadaken3
0
120
10XにおけるData Contractの導入について: Data Contract事例共有会
10xinc
6
650
OCI Network Firewall 概要
oracle4engineer
PRO
0
4.1k
New Relicを活用したSREの最初のステップ / NRUG OKINAWA VOL.3
isaoshimizu
2
610
【Pycon mini 東海 2024】Google Colaboratoryで試すVLM
kazuhitotakahashi
2
520
DynamoDB でスロットリングが発生したとき/when_throttling_occurs_in_dynamodb_short
emiki
0
230
Taming you application's environments
salaboy
0
190
Why App Signing Matters for Your Android Apps - Android Bangkok Conference 2024
akexorcist
0
130
iOSチームとAndroidチームでブランチ運用が違ったので整理してます
sansantech
PRO
0
140
アジャイルでの品質の進化 Agile in Motion vol.1/20241118 Hiroyuki Sato
shift_evolve
0
170
Lexical Analysis
shigashiyama
1
150
適材適所の技術選定 〜GraphQL・REST API・tRPC〜 / Optimal Technology Selection
kakehashi
1
660
Featured
See All Featured
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
31
2.7k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.5k
Building Better People: How to give real-time feedback that sticks.
wjessup
364
19k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
93
16k
Rebuilding a faster, lazier Slack
samanthasiow
79
8.7k
Building an army of robots
kneath
302
43k
Typedesign – Prime Four
hannesfritz
40
2.4k
The Cost Of JavaScript in 2023
addyosmani
45
6.8k
Statistics for Hackers
jakevdp
796
220k
Scaling GitHub
holman
458
140k
Stop Working from a Prison Cell
hatefulcrawdad
267
20k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5k
Transcript
Research Paper Introduction #6 “Balancing on the Edge: Transport Affinity
without Network State” @cafenero_777 2019/11/25
$ which • Balancing on the Edge: Transport Affinity without
Network State • João Taveira Araújo, Lorenzo Saino, Lennert Buytenhek, and Raul Landa • Fastly • Networked Systems Design and Implementation (NSDI ’18) • https://www.usenix.org/conference/nsdi18/presentation/araujo
Agenda • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Abstract • Introduction • Background and
motivation • Design • Implementation • Evaluation • Operational experience • Related work • Conclusion
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ੍ͷେ͖͍CDN POPͰͷLBΛઃܭɾ։ൃɾ࣮ͨ͠ • ઃܭΛͯ͠stale-less͔ͭϨΠςϯγʔΛ࣮ݱ • ʢ’18ͷ࣌Ͱʣ4ͷ࣮ɻӡ༻্ؾ͍ͮͨ͜ͱͳͲΛڞ༗
• ಡ͏ͱͨ͠ཧ༝ • CDNͷPODͰߴޮͳLBͷઃܭɾ࣮ • ࣮ࡍʹFastlyͰ։ൃɾӡ༻͞Ε͍ͯΔ • rebuild.fmͰMiyagawa͞Μ͕ͯͯ͠ؾʹͳͬͨ
Introduction • POPs (Point of Presence) • CDN/EdgeͰͷར༻ • video/image৴ɺAAA,
༗ྉ৴ • Ҭࢄ • Tbps and Mrps • ௨ৗͷDC NWͱҧ͏ • Efficiency: ཧతʹ”ڱ͍”தͰ࠷େݶϦΫΤετΛ͘͞ • Resilience: ੑೳ͕ݶΒΕΔ->DDoSʹڧ͘࡞Δ -> stateless • Gracefulness: ݸʑͷίϯϙʔωϯτॏཁɻscale-in/out࣌ʹӨڹ͕ແ͍Α͏ʹઃܭ
Background and motivation • DCNWͱࣅͯඇͳΔཁ݅ • High request processing density:
• LBͷػೳΛSW/hostʹೖΕɺhostͷిྗͱεϖʔεͷඅ༻ରޮՌΛ࠷େԽɻSW࠷খ, Closߏͱҧ͏ • ैདྷͷHW-LBిྗɾεϖʔεޮѱ͍ɺ SW-LBthrughput, latency͕ѱ͍ • 32 host @25G, 4 SW w/ full-meshed. 1.28Tbps = 40G*32host, 100Maglev ! • Traffic surges: • DDoSରࡦʢ͍͖ͳΓඦഒͷτϥϑΟοΫ͕ൃੜʣ • SilkRoad (10M-conn, ASIC/SRAM)ɺఆৗతʹͦΕΛ͑ΔͷΛड͚͍ͯΔ • Magrev/Duet, ଓ͕૿͑ΔͱύϑΥʔϚϯε͕Լ • Host churn: • ेnodeɺscale-in/outӨڹ͕૬ରతʹେ͖͍ɻdrainແࢹͰ͖ͳ͍ɻ • ਖ਼ৗʹfailover͠ͳ͍ͱPOP/ProviderؒͰτϥϑΟοΫ͕churn͞ΕΔ • cloud serviceͳͷͰsoftware upgrade(࣌ͷfailover)ͨΓલʹߦΘΕΔ • Faults࣌ࣗಈupgradeࢭ·ΔɻBGPௐͰSWupgrade͞ΕΔ
None
None
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ Eth1 Eth2 Eth3 Eth4 Eth5 Eth6 port
Design: Faild (1/3) • Consistent hashing • SWͰnext-hop(ECMP VIP-set), ARP
lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌BGP-adΛൈ͘ɻશSWಉҰͷhashΛ࣋ͭ • next-hopͰࢄ͕ܾఆ->MACͰߋʹࢄ • ࠷େεΠον༷ɺϕϯμʔC-hashΘͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰطଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ͕ඞཁ
Design: Faild (2/3) • Encoding failover decisions • L4ͷҡ࣋͢ΔͨΊɺѼઌMACʹҙຯΛ࣋ͨͤΔ •
Current target • Previous target • failover࣌ɺCurr/PrevΛຒΊࠐΜͰαʔόసૹ͢Δ • ྫɿ͖ͬ͞·ͰBͰॲཧɻBΛfailover͠AʹҠߦ Eth5 Eth1
Design: Faild (3/3) • Host-side processing • ARP/NDagent/controllerͰ੍ޚ • failover࣌ɺlocalॲཧ͔సૹ͔બΔ
• ϑϩʔͦͷ··ϗετʹΔʂޮత • ΧʔωϧϞδϡʔϧͱ࣮ͯ͠ A: ৽ن௨৴(SYN) or AͱͷطଘͳΒAॲཧ͢Δ ɹͦ͏Ͱͳ͚ΕBʹసૹ͢Δ B: ηογϣϯʢsocketʣΛҡ࣋ͯ͠ॲཧ
Implementation • Python: 3.5k LoC • control-plane in userspace daemon
on ൚༻εΠον • ϕϯμʔAPI. OpenFlow/P4/SAIͰҠ২Մೳ • ରtable • Routing table: ECMP VIP-set • ARP table: ԾMAC mapping • Bridging table: ԾMACѼ௨৴Λ”ͲͷI/F͔Βग़͔͢”Λࢦఆɻ • LLDPͰߏՄೳ • ϔϧενΣοΫʢup/down/disabledʣ • ࿈ଓతʹdownͷ߹ECMPʹϑΥʔϧόοΫ͢Δ߹͋Δ • FIB lookup (ECMPάϧʔϓ)ɺͱ5taple C-hash on SRAM • daemon (Python): 2k LoC • VIPઃఆ • ϔϧενΣοΫ • kernel module 1.2k LOC • ԾMACͷϋϯυϦϯά • ϩʔΧϧॲཧ͔ϦμΠϨΫτ͔ • NIC unicast filterʹԾMACΛՃ • tableݶքͳΒhash-base filter or ϓϩϛεΩϟεϞʔυʹҠߦ • SYN-Cookieαϙʔτ • listenΩϡʔ͕͍ͬͺ͍ʹͳΔͱSYN-CookieݕূൃಈʢσϑΥϧτʣ Switch controller Host agent
Evaluation • ߏ • ࠷খPOPߏ (2 SW, 8 host, half-rack),
400Gbps, 320Krps • ࠷େPOPߏ(4 SW, 64 host) • ධՁ • end2endͷτϥϑΟοΫӨڹΛग़ͣ͞ʹdrain • λΠϜϦʔʹdrain • drain࣌ͷlatencyӨڹͳ͠ • drain࣌ͷCPUΦʔόʔϔουͳ͠
Evaluation (1/5): Graceful failover • εΠονͷdrain/refill • τϥϑΟοΫ͕ภΔɺΔ • αʔόͷෛՙมΘΒͣ
-> graceful failover • ྆SWʹಉҰhashઃఆ͕ඞཁ • ϗετͷdrain/refill • ଞ7hostʹࢄ • drainedϗετͷϦΫΤετ͕ٸܹʹऩଋ • ϑϩʔʹґଘ • ͦͷࡍʹϑϩʔͷӨڹ(reset, retrans)ͳ͠ X X
Evaluation (2/5): Switch reconfiguration time • ARPςʔϒϧߋ৽࣌ؒΛଌఆ • ࣮ߦ࣌ؒಉ࣌ߋ৽ʹൺྫ •
ϫʔετέʔε • AࣾͷASICࡌSW: 119ms@95%ile • BࣾͷASICࡌSW: 134ms@95%ile • े͍ʢಛʹࠔΒͳ͍ʣ • ARPߋ৽ΞτϛοΫॲཧɺαʔϏεӨڹͳ͠
Evaluation (3/5): Detour-induced latency • ping/tracerouteͰhost socket tableʹͨΒͣʹଌఆͰ͖ͳ͍ • ଌఆํ๏
• ඇSYNύέοτΛΘ͟ͱdrainedϗετʹྲྀ͠ɺresetΛ͛ͤ͞Δ (f:r) • ௨ৗ࣌ͱroundtrip࣌ؒΛൺֱ • ݁Ռ • 14us@50%ile • 14.6us@95%ile • 19.52us@99%ile • drain͞Ε͍ͯΔϗετΛ௨ͬͨͱ͖ͷΈԆ • ௨ৗͷιϑτΣΞLB(Maglev, Duet)ͷυϨΠϯԆ50us-1ms Reset drainedϗετ
Evaluation (4/5): Host overhead • FaildͷΧʔωϧϞδϡʔϧΦʔόʔϔουΛଌఆ • ελοΫτϨʔεͷ૯ΛΧϯτɺਖ਼نԽͯ͠CPU༻Λਪఆ • ֤2ؒଌఆ
• ݁Ռ • drain/refill࣌ͷΦʔόʔϔουඇৗʹগͳ͍ • ฏۉͰ0.22%, ࠷େͰ0.5%૿ • ิ • ࣮ݧɿdrain࣌2Ҏʹϑϩʔ͕ऴྃ • ࣮ࡍɿ-70%͕10ඵະຬɺ-85%͕̍ະຬɻ͍ʂ • Φʔόʔϔου͕খ͍͞ɺ͔ͭɺ࣌ؒͱͱʹϑϩʔٸݮ
Evaluation (5/5): Load balancing accuracy • ECMP͕HW࣮ɺ͔ͭۉҰʹෛՙࢄ͞ΕΔඞཁ͋Γ • ݕূ •
SW ECMPͰ2ϗετࢄ • MAX/AvgΛଌఆ • AࣾBࣾڞʹ΄΅1ʹʹऩଋɺ࣭ྑ͍
Operational experience (1/2) • FaildγϯϓϧͳNWಈ࡞ɾτϥϑΟοΫΛԾఆͯ͠࡞ΒΕͨ • ӡ༻ܦݧʹরΒ͠߹ΘͤͯԾఆΛ࠶ݕ౼ • Recursive draining
and POP upgrades • drain͍ͯ͠ΔϗετΛdrainͰ͖ͳ͍ɻdrainػ͕ࢮ͵ͱΞτʢ̎ॏোʁʣ • ͦͦඞཁͳ͍ɺͯdrain͞ΕΔɻ࠶ؼతdrainΛ࣮͢Δͱෳࡶੑ͕૿͢ͷͰΓͨ͘ͳ͍ɻ • Scalability challenges • େنԽʹෳϨΠϠʔߏʢSpine/Leafతͳʣ͕ඞཁɻোϗετͱऩ͞Εͯͳ͍SWΛಉظͤ͞Δඞཁ͋Γ • MACΞυϨεΤϯίʔυ͔ΒPOP࠷େαΠζ256ϗετʢվળ༻ҙʣɻ͍·ͷ͜ͱ256ϗετʹ͍͍ۙͮͯͳ͍ • IPv6Խ༰қɻIP-in-IPΧϓηϦϯάͰL3ԽͰ͖Δ
Operational experience (2/2) • ECMP hashing assumptions • ECMPͰ࠷େ6ഒ͕ࠩग़Δͷ •
{1, 2, .., 15} * 2^n͔͠ઃఆͰ͖ͳ͍ASICɻ • ྫɿ63ݸͷnext-hopΛઃఆͯ͠60ݸ (15*2^2)ɺΓ3hopશ͘సૹ͞Εͳ͍ • I/F൪߸ΛECMPͷhashܭࢉʹ͏ʢͭ·Γϗετ͝ͱʹܭࢉ͕ҧ͏ʣ • ϥΠϯΧʔυͷbootॱ͕hashͷseedͷϕϯμʔʂ • Protocol assumptions • ϑϥάϝϯτύέοτ͕དྷͨ߹ɺ5taple hashܭࢉ͕ҟͳΔͨΊҧ͏ϗετసૹ͞Εͯ͠·͏ɻɻ • ͕ɺIPv4ύέοτ΄΅શͯDon’t Fragment bitཱ͕͍ͬͯΔʢݸਓͷݟղʣɻPv6ͦͦͳ͠ • ECN͖ύέοτͷϦηοτύέοτ͕૿Ճʢ2015ͷiOS/OSXͰσϑΥϧτ༗ޮͱಉ࣌ʣɻถࠃҰࣾɺٴ ͼถࠃҎ֎ͷෳΦϖϨʔλͰ؍ଌ • ҟͳΔpathΛ௨ΔՄೳੑ -> ECNωΰγΤʔγϣϯΛఀࢭ
Related work • HW SWͷΈͷECMPͰhostՃͰ͖ͳ͍ • Ananta, MaglevϑϩʔຖͷstateΛ࣋ͬͯ͠·͏ • Duet,
Rubik: HW/SWΛΈ߹ΘͤɻHW ECMPΛSWʹҠ২͢Δͷඇݱ࣮త • SilkRoad: HW SRAMʹϑϩʔΛѹॖ֨ೲɻDDoSʹऑ͍ • Beamer: Faildʹ͍ۙΞϓϩʔνɻ͕ɺઐ༻LBϗετͱίϯτϩʔϥ͕ඞཁ • Faildɿ ҰͭҰٕͭज़ผͷׂͰΘΕΔɻӡ༻ܦݧʹΑͬͯͦΕΒΛ߹Θͤͯઃܭɻ •
Conclusion • DCΫϥυΑΓܻߴີͳPOP(ΤοδΫϥυ)Ͱಈ࡞ • τϥϯεϙʔτΞϑΟχςΟΛαϙʔτ͢Δstale-less LB • Graceful failover •
Φʔόʔϔουগͳ͍ • DDoSड͚ʹ͍͘ʢϦιʔε࠷దԽͱεέʔϥϒϧʣ • drainԆগ • աڈ4ؒͷܦݧଇΛө • faildͰ7Mrps͍ͨ͞ܦݧ • HW੍ݶඇײతͳϓϩτίϧ૬ޓ࡞༻
EoP