Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Paper Introduction #21 Running BGP in Data Centers at Scale

Research Paper Introduction #21 Running BGP in Data Centers at Scale

cafenero_777

May 14, 2021
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #21 “Running BGP in Data Centers at

    Scale” ௨ࢉ#71 @cafenero_777 2021/05/13 1
  2. Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Transience–Equilibrium Tension

    3. On-Ramp Design 4. Implementation 5. Evaluation 6. Evaluation in Facebook’s Network 7. On-Ramp Deep Dive 8. Related Work 9. Conclusion and Future Work 2
  3. ର৅࿦จ • Running BGP in Data Centers at Scale •

    Anubhavnidhi Abhashkumar#* †, Kausik Subramanian#*, Alexey Andreyev◇, Hyojeong Kim◇, Nanda Kishore Salem◇, Jingyi Yang◇, Petr Lapukhov◇, Aditya Akella#, Hongyi Zeng◇ • # University of Wisconsin - Madison • ◇ Facebook • * Work done while at Facebook. Authors contributed equally to this work. • † Currently works at ByteDance. • NSDI 2021 3
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • DC಺ͰBGP͸ΠϯλʔωοτͷͦΕͱ͸ӡ༻ํ๏͕ҧ͏ • ASN, ϧʔτू໿ɺBGP policy setͷઃܭͰ৴པੑ޲্ͤ͞Δ

    • BGP agent, CD pipeline, աڈ2೥ͷӡ༻ɾࣄނͷ঺հ • ಡ΋͏ͱͨ͠ཧ༝ • Clos NW BGPӡ༻ͲΜͳײ͡Ͱ΍ͬͯΔΜͩΖ͏ • όοΫΞοϓܦ࿏ͷͱΓํʢBGP community attribute: tagʣ • IP Anycast௨৴Λ͏·͘΍Δώϯτ͕ཉ͍͠ 4
  5. 1. Introduction • ͜Ε·Ͱ • L2 treeߏ੒ɿBUM௨৴ɺϙʔτϒϩοΩϯάͰଳҬ੍ݶ • SDN: தԝूݖͳͨΊɺಉ࣌ʹԿສnodeʹσϓϩΠɾো֐ݕ஌ରԠ͕೉͍͠

    • ͜Ε͔Β • DC BGP (L3) Closߏ੒ɿେن໛͔࣮ͭફతͳઃܭɾӡ༻ํ๏ͷ৘ใ͕ͳ͍ -> ຊ࿦จͰ঺հ͢Δ • BGPɿBGPࣗମ͸ISP޲͚Ͱ25೥Ҏ্࣮੷ • ֦ுੑߴ͍ɺhop by hopϙϦγʔɺTCPͳͷͰdebugָɺϕϯμʔػث͋ΓɺNWΤϯδχΞͳΒ୭Ͱ΋࢖͑Δ • FBͷ΍Γํͱ݁Ռ঺հ • ઃܭʢӡ༻ɾઃఆʣͷ౷Ұ͕جຊɿASN/topology/summary scheme • ϧʔςΟϯάϙϦγʔɿܦ࿏ͷ఻೻ํ๏ɾόοΫΞοϓܦ࿏ɾো֐੾Γ཭͠ɾܦ࿏௥Ճɾ࡟আ࣌ͷڍಈ • ΠϯλʔωοτBGPར༻࣌ͷऑ఺ͱɺDC BGPར༻࣌ͷࠀ෰ํ๏ • in-house BGP agent: ඞཁ࠷௿ݶͷRFC࣮૷ɺCI/CD, • ͦΕͰࣄނ͸ආ͚ΒΕͳ͍ɿ࣮ࡍͷܦݧΛݩʹɺ༷ʑͳςετͷ֦ॆํ๏ͱBGPઃܭͷվળ 6
  6. 2. Routing Design (1/3) • eBGPͷΈ࢖༻ • ଞͷIGPͱڞ༻͠ͳ͍ʢcf. OSPFΛ࢖ͬͨ৔߹ͷো֐Өڹൣғʣ •

    ϕϯμʔػث্ͷBGPd͔ΒΧελϜHW (FBOSS)ͱࣗࣾ੡BGPd΁Ҡߦ • τϙϩδઃܭ • 48 RSWs in 1Pod, 16 FSWs in 1Pod • SSW͸Pod਺ʹԠ֦ͯ͡ு • ϧʔςΟϯάઃܭݪଇ: Uniformity & simplicity • ֤૚಺ͷઃఆ͸౷Ұ • ந৅ઃఆ͔ΒλʔήοτػثʹԠͯ͡ઃఆΛม׵ɾ౤ೖ͢Δ • ୭͕΍Δʁ -> FB: Robtron, Google: Espresso 7 https://research.fb.com/publications/robotron-top-down-network-management-at-facebook-scale/
  7. 2. Routing Design (2/3) • ϐΞϦϯά • peer-groupΛར༻͠ɺ֤छઃఆɾύϥϝʔλΛ౷Ұ • point-to-pointͳγϯάϧϗοϓeBGPηογϣϯ

    • ෺ཧI/Fͱ1:1Ͱඥ෇͚͞ΕΔͨΊো֐੾Γ෼͚͠΍͍͢ • ϩʔυγΣΞϦϯά • ECMPΛ࢖͏ɻWCMP͸࢖Θͳ͍(FIBαΠζ࠷খԽ͍ͨ͠ҝ) • BGPϕετύεબ୒ͷεςοϓɿ8·Ͱಉ͡ͳΒECMP͢ΔʢCiscoͷ৔߹ʣ • https://www.cisco.com/c/ja_jp/support/docs/ip/border-gateway-protocol-bgp/13753-25.html 8
  8. 2. Routing Design (3/3) • 2byte private AS Numbering •

    ͋ΔSpine Plane಺ͷ֤SSW͸ಉ͡ϢχʔΫͳprivate ASN • SSWؒ͸௚઀ϐΞϦϯά͠ͳ͍ʢSSW AS͸ϧʔϓ͠ͳ͍ʣ • ֤Pod͸confederation AS • FSW, RSW͸αϒASͰར༻ʢ֎͔Β͸1ASʹݟ͑Δʣ • ͲͷPod಺ͷFSW, RSW΋ಉ͡ASNΛར༻ʁ • Route summarizationʢܦ࿏ू໿ʣ • Pod಺ͷࡉ͔͍ܦ࿏͸Pod֎ʹग़ͨ͘͠ͳ͍ • ྫɿϥοΫू໿ͨ͠pre fi x͸pod಺ͷΈ޿ใɻผPod΁͸Podू໿ͨ͠pre fi xΛ޿ใ • ػثͷϧʔςΟϯάςʔϒϧαΠζ͸਺ઍϧʔτఔ౓ɺnot ਺ेສϧʔτ • ίϞσΟςΟASICͰrouting table sizeͷޮ཰ԽɺίϯόʔδΣϯεߴ଎Խ 9
  9. 3. Routing Policies • ISPؒBGPͰ͸࣮ݱࠔ೉ͳ౷Ұ͞ΕͨϙϦγʔΛ࢖ͬͯɺDC಺NWͷ҆ఆੑ޲্͍ͤͨ͞ • Reliability • tagར༻ͱbackupܦ࿏ͷࣄલ഑෍ •

    ྫɿfsw1-rsw1͕அͯ͠΋sswʹ͸޿ใ(withdrawal)͠ͳ͍ɻrsw2ܦ༝ʹ͢Δ
 ͜ͷো֐͸SSWʹ”ؾ͔ͮΕͳ͍” • ֤σόΠεʹinbound, outboundͷ྆ํͰ30ݸఔ౓ͷmach/actionϧʔϧઃఆ • ίϛϡχςΟͱAS_PATH regex, pre fi xΛ௚઀ॻ͔ͳ͍ʂ • Maintainability • WARMঢ়ଶதʢᷖճதʣʹϙϦγʔมߋʢroute priority΍ECMP groupͷมߋͳͲʣ • drain <-> undrain͸242ճ/೔, 36ඵ/ճ • Scalability • Podຖʹܦ࿏ू໿͠ɺϧʔςΟϯάςʔϒϧαΠζΛ࠷খԽʢ&େن໛ͳ޿ใͷճආʣ • Service reachability • ֤Πϯελϯε͸injector libraryΛ࢖ͬͯRSWʹVIPΛ޿ใ͠VIP͕Anycast routing͞ΕΔ 10 มߋྔ͸Θ͔͕ͣͩɺӨڹ͕େ͖͍ͷͰϐΞϨϏϡʔͱςετඞਢʂ
  10. 4. BGP in DCs versus the Internet 
 ʢBGPͬͯͦΜͳʹ҆ఆͯ͠·͚ͨͬ͠ʁʣ •

    BGP Convergent • ܦ࿏͕૿͑ͳ͍Α͏ͳ࢓૊Έʢpod಺ͷΈrack pre fi x, backupܦ࿏ͷࣄલ޿ใɺdrainதͷϙϦγʔʣ • Routing Instability: සൟͳupdateͰBGPdͷߴෛՙɻઃఆɾ࣮૷ɾӡ༻ͰରԠ • AADi ff : ผܦ࿏Λ௥ՃͰ޿ใɻMED/LOCAL_PREFઃఆͰ΋ൃੜɻLPݻఆ, MED࢖Θͳ͍ɻ • WWDup: ౸ୡෆՄೳͳܦ࿏Λൈ͘͜͏ͱ͢Δɻstateful࣮૷ʹมߋ • AADup: طଘܦ࿏ͱʢҙຯతʹʣಉ͡ܦ࿏ʹஔ͖׵ΘΔɻstateful࣮૷΁ • TUp/TDown: I/F down/upʹΑΔ޿ใৼಈɻ؂ࢹͰࣗಈᷖճͤ͞Δɻ • BGP Miscon fi gurations • ઃఆϛε͸؂ࢹɾ؂ࠪπʔϧͰݕ஌ɾରԠ • ੈքͷϓϨϑΟοΫεͷ1%ʹӨڹC-planeߴෛՙ ʢcf. BGPϋΠδϟοΫʣ • in/out྆ଆͰϑΟϧλϦϯάɻ࠶഑෍(redistribution)͸࢖Θͳ͍ • ઃఆ͕σϓϩΠ͞ΕΔ·ͰBGPுΒͳ͍ʢࣗಈىಈ͠ͳ͍ʣ 11 ࢀߟɿ https://tex2e.github.io/rfc-translater/html/rfc4098.html http://library.naist.jp/mylimedio/dllimedio/show.cgi?bookid=100019126&oldid=24345
  11. 5. Software Implementation • OSSͩͱطଘ࣮૷ͷ֦ு͕೉͍͠ɾupstreamʹ͕͔͔࣌ؒΔ -> C++Ͱ࡞ͬͨɻ • ػೳ͸ߜͬͨɿBGP(RFC4271, 5065,

    4724)ͱmatch/actionͷΈ • ϚϧνεϨουɿRIBεϨου΍ϐΞεϨουͳͲ • ϙϦγʔ࣮ߦɿϐΞάϧʔϓ୯ҐͰόον/Ωϟογϡॲཧ • αʔϏεVIP޿ใɿಉҰϐΞΞυϨεͱෳ਺BGPηογϣϯαϙʔτ • ODS/Thrift: ऩଋ࣌ؒɾαʔϏεͷϐΞ਺ͳͲ؂ࢹɾ౷ܭ෼ੳ • ֤BGPdͷঢ়ଶ͕෼͔Δ 12
  12. 6. Testing and Deployment • ΤϛϡϨʔγϣϯɿ link fl ap/BGPd restart/con

    fi g upgrade͸OK. • ΧφϦʔɿ BGPd/con fi g upgrade, downgrade, graceful-restart, 1೔ఔ౓༷ࢠݟ • σϓϩΠ • ඇഁյతͳσϓϩΠ͸GRͰɻഁյతͳ࣮૷มߋ͸pool͖ͯ͠ɺdrain࣌ʹ·ͱΊͯద༻ • BGPMonitorͰܦ࿏ऩଋΛ؂ࢹ͠ͳ͕Βphaseຖʹগͣͭ͠σϓϩΠ • ೥9ճ(1ճ1-2िؒ), ೥ؒ52%͸upgradeதʢʂʣɻσϓϩΠ׬ྃ͸99%Ҏ্ɻະ׬ྃ͸࣍ͷλΠϛϯάͰɻ • ͦΕͰ΋ো֐ൃੜ • ઃఆɾ࣮૷ϛεɻજࡏతόά • ҟৗ؂ࢹ͸ODS, net sonarʢαʔόؒύέϩε؂ࢹʣ, netnoradʢϨΠςϯγʔଌఆʣ • աڈ̎೥Ͱ14ճ • ϙϦγʔin/outͷద༻ॱϛεʢίϛϡχςΟΛมߋ͔ͯ͠Βɺड͚ଆͰಈ࡞มߋʣ • ܦ࿏্ݶ੍ݶͷ࣮૷ϛεɾGRͷ଴ͪ࣌ؒͷόʔδϣϯؒෆҰக • ΤϛϡϨʔγϣϯγφϦΦͷ֦ॆ 13
  13. 7. Future Work • ϙϦγʔͷveri fi cation, FBن໛ͩͱܭࢉͰ͖ͳ͍ɻɻ • intent-basedͩͱFB؀ڥΛϞσϧԽͰ͖ͣɺ֦ு΋Ͱ͖ͳ͍ɻɻ

    • ʢͪͳΈʹGoogleͷOrion͸intent-basedʣ • ΤϛϡϨʔγϣϯγφϦΦͷ֦ॆ • protocol validation, fuzz testing͸Ͱ͖ͦ͏͕ͩHW/SWෳ߹γφϦΦ͸ݫ͍͠ɻɻ • ো֐࣌͸ECMPͰ͸ͳ͘WCMPͷํ͕خ͍͠ʢ͚Ͳຊ౰ʹ࢖͏ʁʣ 14
  14. 8. Related Work • DC BGP: RFC7938 (Use of BGP

    for Routing in Large-Scale Data Centers)ͱͷҧ͍ • pod (=cluster)ʹBGP confederationར༻ɺܦ࿏ू໿ɾϙϦγʔͷଟ༻ • SDNͰରԠ͢Δͷ͸ʢن໛΋ٕज़΋ʣGoogleͳΒͰ͸ɻ • Operation framework • MSͷCrystalNetͰSONiCͷ͢΂ͯͷӡ༻खॱΛݕূ • FB͸Janus (risk baseͳupdateϓϥϯφʔ)ͬΆ͍΋ͷΛࣗલͰ࡞ͬͯར༻ • ଟ͘ͷࣄނ͸࡞ۀதʹൃੜ • EdgeFabric, Espresso: EdgeͰͷBGP, ओʹCDN؅ཧʁ 15