Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Paper Introduction #08 Net bouncer active device and link failure localization in data center networks

Research Paper Introduction #08 Net bouncer active device and link failure localization in data center networks

464d1574ef169d2457c907e77794d973?s=128

cafenero_777

May 04, 2021
Tweet

Transcript

  1. Research Paper Introduction #8 “NetBouncer: Active Device and Link Failure

    Localization in Data Center Networks” @cafenero_777 2020/02/18
  2. $ which • NetBouncer: Active Device and Link Failure Localization

    in Data Center Networks • Cheng Tan1, Ze Jin2, Chuanxiong Guo3, Tianrong Zhang4, Haitao Wu5, Karl Deng4, Dongming Bi4, and Dong Xiang4 • 1New York University, 2Cornell University, 3Bytedance, 4Microsoft, 5Google • Networked Systems Design and Implementation (NSDI ’19)
  3. Agenda • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Abstract • Introduction • NetBouncer overview

    • Path probing via packet bouncing • Probing plan and device failure detection • Link failure inference • Simulation studies • Implementation and evaluation • Deployment experiences • Conclusion
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • େن໛Closߏ੒ͷނোՕॴʢσόΠεɾϦϯΫʣΛϐϯϙΠϯτͰಛఆ • ʢ’19ͷ࣌఺Ͱʣ3೥ͷ࣮੷ɻӡ༻্ؾ͍ͮͨ͜ͱͳͲΛڞ༗ • ಡ΋͏ͱͨ͠ཧ༝ •

    େن໛ClosʢͰ͸ͳͯ͘΋ʣͷނোՕॴಛఆ͕େมͳͷͰɺվળ͍ͨ͠ • pingmeshΑΓྑ͍Β͍͠ • AzureͰ3೥͙Β͍ӡ༻͞Ε͍ͯΔ
  5. Introduction • ࠓͲ͖ͷσʔληϯλʔͱՄ༻ੑ • େن໛Closߏ੒ʢ਺ສ୯ҐͷSWʣͱαʔόʢ਺ඦສ୯ҐʣΛ࢖͍ɺECMPͰ৑௕ • ো֐ϙΠϯτʢσόΠεɾϦϯΫ౳ʣ͕ඇৗʹଟ͍ • Ұ෦ͷো֐ʢಛʹ൒ࢮ: Gray

    failuresʣ͕શମʹ೾ٴ͢Δ৔߹΋ɻ͔͠΋ো֐࣌ʹݪҼڀ໌͕ࠔ೉ • ͜Ε·Ͱͷํ๏ • SNMP΍NetFlowͰ͸൒ࢮݕ஌Ͱ͖ͣ • ಛผͳHW, HVվम, ϓϩτίϧͷbitௐ੔͸ຊ൪΁ͷରԠ͕େม • ʢ֬཰తʹʣਖ਼֬ʹɺʢނোϙΠϯτతʹʣਖ਼֬ʹނোՕॴΛಛఆ͍ͨ͠ɺFP, FNແ͠ʹɻ • NetBouncer • σόΠεো֐ɾϦϯΫো֐ΛಛఆͰ͖ΔϑϨʔϜϫʔΫ • IP-in-IPΛར༻ͨ͠ύεϓϩʔϒํࣜɺϦϯΫΛࣝผͰ͖Δύεू߹Ͱௐࠪɺෆ੔߹σʔλ͸ਪ࿦ΞϧΰϦζϜͰΧόʔ • AzureͰ3೥ӡ༻ࡁΈ
  6. NetBouncer overview • Probing plan design • ίϯτϩʔϥ͕pathΛઃܭ • શpathΛ௨͢ͷ͸࣮ࡍ͸ࠔ೉

    • େ෦෼͸ਖ਼ৗલఏ • E ff i cient path probing via IP-in-IP • IP-in-IPͰΧϓηϧԽ͠ɺtop-layer SW (Spine)ʹ౤͛Δ • HWॲཧͤ͞Δ • ݁ՌΛProcessorʹసૹ • NetBouncer’s targets and limitations • non-transient (probingத͸ঢ়ଶ͕มΘΒͳ͍)͜ͱΛ૝ఆ • FP, FNͷ৔߹΋༗Γ
  7. Path probing via packet bouncing • Probing΁ͷཁٻ • Routing path

    (ECMP)ΛಛఆͰ͖Δ͜ͱ • ར༻CPU, NWϦιʔεΛ཈͑Δ͜ͱ (pingmesh!!!) • IP-in-IP basics • ΧϓηϧԽͯ͠໨తdstʹͨͲΓண͘ • ʢࠓͲ͖ͷʣSwitch಺ͰHWॲཧͤ͞Δ • Packet bouncing • Spine·ͰIP-in-IPɺ࣮ࡍ͸਺ճఔ౓ͷΧϓηϧԽ • ໭Γ͸ඇΧϓηϧԽͰ௓Ͷฦͬͯ͘Δɺ૒ํ޲ͰධՁʢ࣮ࡍʹ͍ۙʣ • ૹ৴ɾड৴͸ಉ͡αʔόͷͨΊγϯϓϧ (2 nodeͷผো֐Λߟ͑ͳͯ͘ྑ͍)
  8. Probing plan and device failure detection • Underlying model •

    ͢΂ͯͷlinkؒ (Xi)ͷ֬཰ͷੵ͕pathj(Yj)ͷ֬཰ • Real-world challenges for path selection • Ұҙʹղ͚ͳ͍৔߹΋͋Δ • Link-identi fi able probing plan • Ұൠʹղ͘͜ͱ͸ࠔ೉ • શpath (ϗετ͔Β্ҐSpineʣΛ௨ͯ͠pathࣝผ • ΄΅͢΂ͯਖ਼ৗલఏͳΒྑ͍ਫ਼౓Ͱਪఆ • Device failure detection • શ݁Ռ͔Βಉ͡Α͏ͳpath͕ࣦഊ͍ͯ͠Ε͹σόΠεো֐
  9. None
  10. None
  11. Link failure inference • ϦϯΫো֐ݕ஌͸શE2EͷYi ͔ΒXi Λਪଌ͢Δɺͱ͍͏໰୊ʢઢܗ୅਺ɺ࠷খೋ৐๏Ͱ͸ղ͚ͳ͍ʣ • Data inconsistency

    • ಉ࣌ଌఆ͸ແཧʂۮൃతͳΤϥʔ༗Γ -> ޡݕ஌ͷݪҼʹɻ؇࿨ࡦ͕ඞཁ • NetBouncer’s latent factor model • Algorithm for link failure inference • ࠲ඪ߱Լ๏ʢCoordinate DescentʣΛར༻ • ֬཰తޯ഑߱Լ (Stochastic Gradient Descent)ʹൺ΂ͯੑೳ͕Ұܻྑ͍ʢ਺ඦGB/hʣ up/downʹ2ۃԽ͠΍͍͢ ऩଋ͠΍͍͢
  12. Simulation studies (1) • Simulation setup • 2.8k SW(48port), 27.6k

    server, 82.9k link, 3૚Clos • loss (0.2~1), non-loss (0~0.001) • ϥϯμϜͳσόΠε10୆Λো֐ • Probing plan • NetBouncerͱHop-by-hop (શͯͷSWʹରͯ͠Probing͢ Δ)ͱͰൺֱ • Device failure detection • ݕ஌ΛೖΕΔ͔ೖΕͳ͍͔ • NetBouncer design choices • ඇತܕͱͷൺֱɺλͷௐ੔ • Comparison with existing systems • DeTector, NetScope, KDD14
  13. Simulation studies (2) CD๏ͷํ͕SGD๏ʹൺ΂ͯҰܻૣ͍ λ͸1͙Β͍͕ྑ͍ ઌߦݚڀͱൺ΂ͯFN, FPͱ΋ʹྑ͍

  14. Implementation and evaluation (1/2) • Implementation • ίϯτϩʔϥɿෳ਺ϨϓϦΧ৑௕ • ΤʔδΣϯτɿίϯτϩʔϥ͔ΒProbing

    Plan (Path, packet size, UDPѼઌ ϙʔτ, TTL, ToS஋౳)Λड͚औΓ࣮ࡍʹύέοτૹ৴ • ϓϩηοαʔɿ • ϑϩϯτΤϯυɿΤʔδΣϯτ͔Β݁ՌΛऩू • όοΫΤϯυɿਪ࿦ॲཧ
  15. Implementation and evaluation (2/2) • Data processor runtime evaluation •

    2.4GHz Xeon 24ίΞ (48HT), 128GB • Windows Server 2016 ! • શੈք෼ͷσʔλɿ130GB/h • ̍εϨου = ̍region, 30ฒྻ͙Β͍
  16. Deployment experiences (1/2) • ൒ࢮো֐ରԠ׬ྃ·Ͱ਺೔->਺෼ʹʂ • ਂ͍ཧղ΁ʢsilent packet drop, ᫔᫓,

    link/route fl apping, rebootʣ • Case 1: spine router gray failure • Spine͕൒ࢮ • ϥΠϯΧʔυʹ໰୊ɺӨڹ޿ൣғ • pingmesh͸ݕ஌ͷΈ • 15%ϩε
  17. Deployment experiences (2/2) • Case 2: polarized tra ff i

    c • fi rmware bugͰECMP͕ภΓɺ᫔᫓ (35%Өڹ) • ECMP hashؔ਺ͷΩʔ࠶ੜ੒ͰରԠ • Case 3: miscounting TTL • TTL͕2ͣͭݮΔ͜ͱΛൃݟɺɺ͜Ε΋ fi rmware bug • FN, FP • FN: DHCP OFFERΛड৴Ͱ͖ͳ͍ɻݪҼ͸NICෆྑɻݕ஌Ͱ͖ͣ(FN) • εΠονͷACLઃఆϛεͰಛఆIP͚ͩdrop͍ͯͨ͠ɻݕ஌Ͱ͖ͣ(FN)
  18. Conclusion • NetBouncer • Path Probing͔ΒσόΠεো֐ͱϦϯΫো֐Λਪఆ͢Δ • طଘݚڀͱൺ΂ͯੑೳ (Error/FP/FN, ܭࢉ࣌ؒ)͕ྑ͍

    • AzureͰ3೥Քಇ͠ɺ࣮ࡍʹ͏·͘ಈ࡞͍ͯ͠Δ
  19. EoP