Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Paper Introduction #6

cafenero_777
November 25, 2019

Research Paper Introduction #6

“Balancing on the Edge: Transport Affinity without Network State”

cafenero_777

November 25, 2019
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #6 “Balancing on the Edge: Transport Affinity

    without Network State” @cafenero_777 2019/11/25
  2. $ which • Balancing on the Edge: Transport Affinity without

    Network State • João Taveira Araújo, Lorenzo Saino, Lennert Buytenhek, and Raul Landa • Fastly • Networked Systems Design and Implementation (NSDI ’18) • https://www.usenix.org/conference/nsdi18/presentation/araujo
  3. Agenda • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Abstract • Introduction • Background and

    motivation • Design • Implementation • Evaluation • Operational experience • Related work • Conclusion
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ੍໿ͷେ͖͍CDN POPͰͷLBΛઃܭɾ։ൃɾ࣮૷ͨ͠ • ઃܭΛ޻෉ͯ͠stale-less͔ͭ௿ϨΠςϯγʔΛ࣮ݱ • ʢ’18ͷ࣌఺Ͱʣ4೥ͷ࣮੷ɻӡ༻্ؾ͍ͮͨ͜ͱͳͲΛڞ༗

    • ಡ΋͏ͱͨ͠ཧ༝ • CDNͷPODͰߴޮ཰ͳLBͷઃܭɾ࣮૷ • ࣮ࡍʹFastlyͰ։ൃɾӡ༻͞Ε͍ͯΔ • rebuild.fmͰMiyagawa͞Μ͕࿩ͯͯ͠ؾʹͳͬͨ
  5. Introduction • POPs (Point of Presence) • CDN/EdgeͰͷར༻ • video/image഑৴ɺAAA,

    ༗ྉ഑৴౳ • ஍Ҭ෼ࢄ • Tbps and Mrps • ௨ৗͷDC NWͱ͸ҧ͏ • Efficiency: ෺ཧతʹ”ڱ͍”தͰ࠷େݶϦΫΤετΛ͞͹͘ • Resilience: ੑೳ͕ݶΒΕΔ->DDoSʹڧ͘࡞Δ -> stateless • Gracefulness: ݸʑͷίϯϙʔωϯτ΋ॏཁɻscale-in/out࣌ʹӨڹ͕ແ͍Α͏ʹઃܭ
  6. Background and motivation • DCNWͱ͸ࣅͯඇͳΔཁ݅ • High request processing density:

    • LBͷػೳΛSW/hostʹೖΕɺhost΁ͷిྗͱεϖʔεͷඅ༻ରޮՌΛ࠷େԽɻSW਺͸࠷খ, Closߏ੒ͱ͸ҧ͏ • ैདྷͷHW-LB͸ిྗɾεϖʔεޮ཰ѱ͍ɺ SW-LB͸thrughput, latency͕ѱ͍ • 32 host @25G, 4 SW w/ full-meshed. 1.28Tbps = 40G*32host, 100Maglev ! • Traffic surges: • DDoSରࡦʢ͍͖ͳΓ਺ඦഒͷτϥϑΟοΫ͕ൃੜʣ • SilkRoad (10M-conn, ASIC/SRAM)ɺఆৗతʹͦΕΛ௒͑Δ΋ͷΛड͚͍ͯΔ • Magrev/Duet, ઀ଓ਺͕૿͑ΔͱύϑΥʔϚϯε͕௿Լ • Host churn: • ਺ेnodeɺscale-in/outӨڹ͕૬ରతʹେ͖͍ɻdrainແࢹͰ͖ͳ͍ɻ • ਖ਼ৗʹfailover͠ͳ͍ͱPOP/ProviderؒͰτϥϑΟοΫ͕churn͞ΕΔ • cloud serviceͳͷͰsoftware upgrade(࣌ͷfailover)͸౰ͨΓલʹߦΘΕΔ • Faults࣌͸ࣗಈupgradeࢭ·ΔɻBGPௐ੔ͰSW΋upgrade͞ΕΔ
  7. Design: Faild (1/3) • Consistent hashing • SW಺Ͱnext-hop(ECMP VIP-set), ARP

    lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌͸BGP-adΛൈ͘ɻશSW͸ಉҰͷhashΛ࣋ͭ • next-hop਺Ͱ෼ࢄ౓͕ܾఆ->MACͰߋʹ෼ࢄ • ࠷େ਺͸εΠον࢓༷ɺϕϯμʔC-hash͸࢖Θͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰ͸طଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ΋޻෉͕ඞཁ Eth1 Eth2 Eth3 Eth4 Eth5 Eth6 port
  8. Design: Faild (1/3) • Consistent hashing • SW಺Ͱnext-hop(ECMP VIP-set), ARP

    lookup, output I/FΛࢦఆ • SWυϨΠϯ࣌͸BGP-adΛൈ͘ɻશSW͸ಉҰͷhashΛ࣋ͭ • next-hop਺Ͱ෼ࢄ౓͕ܾఆ->MACͰߋʹ෼ࢄ • ࠷େ਺͸εΠον࢓༷ɺϕϯμʔC-hash͸࢖Θͳ͍ • ্هΛagent͕ϋϯυϦϯά • ͜Ε͚ͩͰ͸طଘϑϩʔʹӨڹग़ͯ͠·͏ • ϗετଆͰ΋޻෉͕ඞཁ
  9. Design: Faild (2/3) • Encoding failover decisions • L4ͷҡ࣋͢ΔͨΊɺѼઌMACʹҙຯΛ࣋ͨͤΔ •

    Current target • Previous target • failover࣌ɺCurr/PrevΛຒΊࠐΜͰαʔό΁సૹ͢Δ • ྫɿ͖ͬ͞·ͰBͰॲཧɻBΛfailover͠AʹҠߦ Eth5 Eth1
  10. Design: Faild (3/3) • Host-side processing • ARP/ND͸agent/controllerͰ੍ޚ • failover࣌ɺlocalॲཧ͔సૹ͔બ΂Δ

    • ϑϩʔ͸ͦͷ··ϗετʹ࢒Δʂޮ཰త • ΧʔωϧϞδϡʔϧͱ࣮ͯ͠૷ A: ৽ن௨৴(SYN) or AͱͷطଘͳΒAॲཧ͢Δ ɹͦ͏Ͱͳ͚Ε͹Bʹసૹ͢Δ B: ηογϣϯʢsocketʣΛҡ࣋ͯ͠ॲཧ
  11. Implementation • Python: 3.5k LoC • control-plane in userspace daemon

    on ൚༻εΠον • ϕϯμʔAPI. OpenFlow/P4/SAIͰ΋Ҡ২Մೳ • ର৅table • Routing table: ECMP VIP-set • ARP table: Ծ૝MAC mapping • Bridging table: Ծ૝MACѼ௨৴Λ”ͲͷI/F͔Βग़͔͢”Λࢦఆɻ • LLDPͰ΋ߏ੒Մೳ • ϔϧενΣοΫʢup/down/disabledʣ • ࿈ଓతʹdownͷ৔߹͸ECMPʹϑΥʔϧόοΫ͢Δ৔߹΋͋Δ • FIB lookup (ECMPάϧʔϓ)ɺͱ5taple C-hash on SRAM • daemon (Python): 2k LoC • VIPઃఆ • ϔϧενΣοΫ • kernel module 1.2k LOC • Ծ૝MACͷϋϯυϦϯά • ϩʔΧϧॲཧ͔ϦμΠϨΫτ͔ • NIC unicast filterʹԾ૝MACΛ௥Ճ • table਺ݶքͳΒhash-base filter or ϓϩϛεΩϟεϞʔυʹҠߦ • SYN-Cookieαϙʔτ • listenΩϡʔ͕͍ͬͺ͍ʹͳΔͱSYN-CookieݕূൃಈʢσϑΥϧτʣ Switch controller Host agent
  12. Evaluation • ߏ੒ • ࠷খPOPߏ੒ (2 SW, 8 host, half-rack),

    400Gbps, 320Krps • ࠷େPOPߏ੒(4 SW, 64 host) • ධՁ • end2endͷτϥϑΟοΫӨڹΛग़ͣ͞ʹdrain • λΠϜϦʔʹdrain • drain࣌ͷlatencyӨڹͳ͠ • drain࣌ͷCPUΦʔόʔϔουͳ͠
  13. Evaluation (1/5): Graceful failover • εΠονͷdrain/refill • τϥϑΟοΫ͕ภΔɺ໭Δ • αʔόͷෛՙ͸มΘΒͣ

    -> graceful failover • ྆SWʹಉҰhashઃఆ͕ඞཁ • ϗετͷdrain/refill • ଞ7hostʹ෼ࢄ • drainedϗετ΁ͷϦΫΤετ͕ٸܹʹऩଋ • ϑϩʔ਺ʹґଘ • ͦͷࡍʹϑϩʔ΁ͷӨڹ(reset, retrans)ͳ͠ X X
  14. Evaluation (2/5): Switch reconfiguration time • ARPςʔϒϧߋ৽࣌ؒΛଌఆ • ࣮ߦ࣌ؒ͸ಉ࣌ߋ৽਺ʹൺྫ •

    ϫʔετέʔε • AࣾͷASIC౥ࡌSW: 119ms@95%ile • BࣾͷASIC౥ࡌSW: 134ms@95%ile • े෼௿͍஋ʢಛʹࠔΒͳ͍ʣ • ARPߋ৽͸ΞτϛοΫॲཧɺαʔϏεӨڹͳ͠
  15. Evaluation (3/5): Detour-induced latency • ping/traceroute౳Ͱ͸host socket tableʹ౰ͨΒͣʹଌఆͰ͖ͳ͍ • ଌఆํ๏

    • ඇSYNύέοτΛΘ͟ͱdrainedϗετʹྲྀ͠ɺresetΛ౤͛ͤ͞Δ (f:r) • ௨ৗ࣌ͱroundtrip࣌ؒΛൺֱ • ݁Ռ • 14us@50%ile • 14.6us@95%ile • 19.52us@99%ile • drain͞Ε͍ͯΔϗετΛ௨ͬͨͱ͖ͷΈ஗Ԇ • ௨ৗͷιϑτ΢ΣΞLB(Maglev, Duet)ͷυϨΠϯ஗Ԇ͸50us-1ms Reset drainedϗετ
  16. Evaluation (4/5): Host overhead • FaildͷΧʔωϧϞδϡʔϧΦʔόʔϔουΛଌఆ • ελοΫτϨʔεͷ૯਺ΛΧ΢ϯτɺਖ਼نԽͯ͠CPU࢖༻཰Λਪఆ • ֤2෼ؒଌఆ

    • ݁Ռ • drain/refill࣌ͷΦʔόʔϔου͸ඇৗʹগͳ͍ • ฏۉͰ0.22%, ࠷େͰ΋0.5%૿ • ิ଍ • ࣮ݧɿdrain࣌2෼Ҏ಺ʹϑϩʔ͕ऴྃ • ࣮ࡍɿ-70%͕10ඵະຬɺ-85%͕̍෼ະຬɻ୹͍ʂ • Φʔόʔϔου͕খ͍͞ɺ͔ͭɺ࣌ؒͱͱ΋ʹϑϩʔ਺΋ٸݮ
  17. Evaluation (5/5): Load balancing accuracy • ECMP͕HW࣮૷ɺ͔ͭۉҰʹෛՙ෼ࢄ͞ΕΔඞཁ͋Γ • ݕূ •

    SW ECMPͰ2ϗετ෼ࢄ • MAX/AvgΛଌఆ • AࣾBࣾڞʹ΄΅1ʹʹऩଋɺ඼࣭ྑ͍
  18. Operational experience (1/2) • Faild͸γϯϓϧͳNWಈ࡞ɾτϥϑΟοΫΛԾఆͯ͠࡞ΒΕͨ • ӡ༻ܦݧʹরΒ͠߹ΘͤͯԾఆΛ࠶ݕ౼ • Recursive draining

    and POP upgrades • drain͍ͯ͠ΔϗετΛdrainͰ͖ͳ͍ɻdrainػ͕ࢮ͵ͱΞ΢τʢ̎ॏো֐ʁʣ • ͦ΋ͦ΋ඞཁͳ͍ɺ਺෼଴ͯ͹drain͞ΕΔɻ࠶ؼతdrainΛ࣮૷͢Δͱෳࡶੑ͕૿͢ͷͰ΍Γͨ͘ͳ͍ɻ • Scalability challenges • େن໛Խʹ͸ෳ਺ϨΠϠʔߏ੒ʢSpine/Leafతͳʣ͕ඞཁɻো֐ϗετͱ௚ऩ͞Εͯͳ͍SWΛಉظͤ͞Δඞཁ͋Γ • MACΞυϨεΤϯίʔυ͔ΒPOP࠷େαΠζ͸256ϗετʢվળ͸༻ҙʣɻ͍·ͷ͜ͱ256ϗετʹ͸͍͍ۙͮͯͳ͍ • IPv6Խ΋༰қɻIP-in-IPΧϓηϦϯάͰL3Խ΋Ͱ͖Δ
  19. Operational experience (2/2) • ECMP hashing assumptions • ECMPͰ࠷େ6ഒ͕ࠩग़Δ΋ͷ •

    {1, 2, .., 15} * 2^n͔͠ઃఆͰ͖ͳ͍ASICɻ • ྫɿ63ݸͷnext-hopΛઃఆͯ͠΋60ݸ (15*2^2)ɺ࢒Γ3hop͸શ͘సૹ͞Εͳ͍ • I/F൪߸ΛECMPͷhashܭࢉʹ࢖͏ʢͭ·Γϗετ͝ͱʹܭࢉ͕ҧ͏ʣ • ϥΠϯΧʔυͷbootॱ͕hashͷseedͷϕϯμʔ΋ʂ • Protocol assumptions • ϑϥάϝϯτύέοτ͕དྷͨ৔߹ɺ5taple hashܭࢉ஋͕ҟͳΔͨΊҧ͏ϗετ΁సૹ͞Εͯ͠·͏ɻɻ • ͕ɺIPv4ύέοτ͸΄΅શͯDon’t Fragment bitཱ͕͍ͬͯΔʢݸਓͷݟղʣɻPv6͸ͦ΋ͦ΋໰୊ͳ͠ • ECN෇͖ύέοτͷϦηοτύέοτ͕૿Ճʢ2015೥ͷiOS/OSXͰσϑΥϧτ༗ޮͱಉ࣌ʣɻถࠃҰࣾɺٴ ͼถࠃҎ֎ͷෳ਺ΦϖϨʔλͰ؍ଌ • ҟͳΔpathΛ௨ΔՄೳੑ -> ECNωΰγΤʔγϣϯΛఀࢭ
  20. Related work • HW SWͷΈͷECMPͰ͸host௥ՃͰ͖ͳ͍ • Ananta, Maglev͸ϑϩʔຖͷstateΛ࣋ͬͯ͠·͏ • Duet,

    Rubik: HW/SWΛ૊Έ߹ΘͤɻHW ECMPΛSWʹҠ২͢Δͷ͸ඇݱ࣮త • SilkRoad: HW SRAMʹϑϩʔΛѹॖ֨ೲɻDDoSʹऑ͍ • Beamer: Faildʹ͍ۙΞϓϩʔνɻ͕ɺઐ༻LBϗετͱίϯτϩʔϥ͕ඞཁ • Faildɿ ҰͭҰͭ͸ٕज़͸ผͷ໾ׂͰ΋࢖ΘΕΔɻӡ༻ܦݧʹΑͬͯͦΕΒΛ߹Θͤͯઃܭɻ •
  21. Conclusion • DC಺Ϋϥ΢υΑΓ਺ܻߴີ౓ͳPOP(ΤοδΫϥ΢υ)Ͱಈ࡞ • τϥϯεϙʔτΞϑΟχςΟΛαϙʔτ͢Δstale-less LB • Graceful failover •

    Φʔόʔϔουগͳ͍ • DDoSड͚ʹ͍͘ʢϦιʔε࠷దԽͱεέʔϥϒϧʣ • drain஗Ԇগ • աڈ4೥ؒͷܦݧଇΛ൓ө • faildͰ7Mrps͞͹͍ͨܦݧ • HW੍ݶ΍ඇ௚ײతͳϓϩτίϧ૬ޓ࡞༻
  22. EoP