Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Paper Introduction #27 “Sailfish: Acce...

cafenero_777
September 26, 2021

Research Paper Introduction #27 “Sailfish: Accelerating Cloud-Scale Multi-Tenant Multi-Service Gateways with Programmable Switches”

https://dl.acm.org/doi/10.1145/3452296.3472889
Alibaba Cloud's Paper on Luoshen Cloud Network Management Has Entered SIGCOMM Again

cafenero_777

September 26, 2021
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #27 “Sail fi sh: Accelerating Cloud-Scale Multi-Tenant

    Multi-Service Gateways with Programmable Switches” ௨ࢉ#82 @cafenero_777 2021/09/09 1
  2. ࡾߦ·ͱΊ • Sail fi sh: P4 + ϓϩάϥϚϒϧεΠονASICΛ࢖ͬͨ௒ߴ଎Ϋϥ΢υGW • Alibaba

    CloudͰ2೥લ͔ΒՔಇத • ϓϩάϥϚϒϧASICͷϝϞϦ༰ྔ੍ݶΛ༷ʑͳٕज़ͰղܾʢϝϞϦѹॖٕज़ ΍x86-based Ϋϥ΢υGWฒߦར༻ͳͲʣ 2
  3. Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. BACKGROUND AND

    MOTIVATION 3. HARDWARE GATEWAY AND CHALLENGES 4. DESIGN AND IMPLEMENTATION 5. EVALUATION 6. EXPERIENCES 7. RELATED WORK 8. CONCLUSION AND FUTURE WORK 3
  4. ର৅࿦จ • Sail fi sh: Accelerating Cloud-Scale Multi-Tenant Multi-Service Gateways

    with Programmable Switches • Tian Pan†, Nianbing Yu†, Chenhao Jia†, Jianwen Pi†, Liang Xu†, Yisong Qiao†, Zhiguo Li†, Kun Liu†, Jie Lu†, Jianyuan Lu†□, Enge Song†, Jiao Zhang∗, Tao Huang∗, Shunmin Zhu⋆†□ • †Alibaba Group ∗Purple Mountain Laboratories ⋆Tsinghua University , □ □Co-corresponding authors • ACM SIGCOMM ‘21 • https://dl.acm.org/doi/10.1145/3452296.3472889 • Alibaba Cloud's Paper on Luoshen Cloud Network Management Has Entered SIGCOMM Again 4
  5. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • Sail fi sh: P4 + programable

    Switch ASIC cloud GW • 2us, 3.2Tbps, 1.8GppsΛ࣮ݱ • 2೥Ҏ্ӡ༻͠ɺ૒11ͷϐʔΫ΋଱͑ͨ • ಡ΋͏ͱͨ͠ཧ༝ • P4Λ࢖ͬͨproduction use case • public cloud gatewayઃܭɾ࣮૷ 5 https://en.wikipedia.org/wiki/Sail fi sh Sail fi sh = όγϣ΢ΧδΩ ΞϦόόͷఱೣμϒϧΠϨϒϯʢಠ਎ͷ೔ʣ
  6. 1. INTRODUCTION • public cloud: computing resource (server/storage/network)Λ֤ςφϯτʹఏڙ • cloud

    gateway: • ςφϯτؒ (VPCؒ)௨৴΍to internet/ސ٬DC௨৴ͷGW (VXLANऴ୺) • AlibabaͷE-commerce: N Tbpsग़ͤͯΔ • ςφϯτɾVMΛࣝผͯ͠stateful/lessʹసૹ (਺ඦສݸΦʔμʔ) • εϧʔϓοτɾtableαΠζɾछྨ -> scaleout࡞ઓ • ͜ͷن໛ͷ෼ࢄγεςϜ͸CapEx, OpEx͕ߴՁա͗ʂɺ”heavy-hitter fl ow”ͰCPUίΞӨڹɺCPUίΞੑೳ੒௕ͷಷԽ • Sail fi sh: P4 (programable) switch ASIC multi tenant/service gateway • ੈքॳͷmulti tenant/service GW on programable switchͷ։ൃɾઃܭɾߏ੒ͷ஌ݟ • ςʔϒϧѹॖʢSRAM -> 38%ݮ, TCAM -> 96%ݮʣ • per node x86ൺͰlatency 95%ݮɺbps 20ഒʢ3.2Tbpsʣɼpps 71ഒ(1.8Gpps) 6
  7. 2. BACKGROUND AND MOTIVATION (1/3) Gateway for Cloud Networks •

    ސ٬ςφϯτຖʹಠཱͨ͠NW/SDNΛఏڙʢͨ͠Α͏ʹݟͤΔʣ • VXLANͳͲͰςφϯτ(VPC)෼཭ɺVNIͰࣝผɻunderlay͸ී௨ʹసૹ • Cloud Gateway • east-west, north-south, IDC/cross-regionΛసૹ • ྫɿIDC -> VM௨৴ • Cloud Gatewayͷத਎ • Ѽઌ(VM)ΞυϨεʹԠͯ͡సૹઌΛݟ͚ͭΔ&సૹ • DIP/VNIʹԠͯ͡scope (region/IDC/VPC)Λ෼͚Δ • DIPʹԠͯ͡outer IP (= HV IP)Λม׵ 7 NC: Next-hop Compute-node ͲͷςφϯτѼ͔ʁ Ͳͷnode্͔ʁ
  8. 2. BACKGROUND AND MOTIVATION (2/3) Evolution of Software Gateways •

    Durable architecture • ҆ఆͨ͠సૹʢV.S. όάɾҰ࣌తͳΫϥογϡʣ • τϥϑΟοΫ૿େʹରͯ͠ܧଓతɾ༰қʹupgradable • XGW-x86: DPDK based, 1Mpps/core • Ϋϥελ෼ׂσϓϩΠ: ར༻ͷภΓCapEx, ༻్ͷภΓͰOpEx૿Ճ • feature-modules౷߹ʢ൚༻GWԽʣ & HWਐԽͰ՝୊ղফ 8
  9. 2. BACKGROUND AND MOTIVATION (3/3) Limitations of Software Gateways •

    CapEx, OpEx૿Ճ • 15Tbpsॲཧ: 100G NICͰ΋࠷௿150୆ඞཁɻ • O($10k)/node -> ~O($10M)/region • scale-outϞʔυ: શGWͰసૹtableಉظ͕ίετɻτϥγϡʔࠔ೉ • ECMPͷ64ҎԼ੍ݶ: ΫϥελΛখ͘͞ɾଟ਺࡞Δඞཁ • τϥϑΟοΫόʔετ (૒11ͳͲ)Ͱcoreෛՙ -> loss • core unbalancing: “heavy-hitter”͕τϥϑΟοΫΛ৯͍ͭͿ͢ • run-to-competion model: 5 tuple hash -> RSS -> core෼ࢄ • pipeline model: x86 CPUͩͱL3 cacheʹ౰ͨΒͣੑೳྼԽ • NW ASICൺͰͷCPUੑೳ޲্ͷಷԽ ʢ40ഒ V.S 4ഒʣ 9
  10. 3. HARDWARE GATEWAY AND CHALLENGES (1/3) Hardware Options • Fixed-function

    ASIC • ੑೳɾফඅిྗɾՁ֨ྑ͍͕cloud gatewayʹ͸޲͔ͳ͍ • programability͕ແ͍ɻಠࣗػೳ (vtraceͳͲ)ʹରԠͰ͖ͳ͍ɻࢢ৔౤ೖ·Ͱͷ͕࣌ؒ௕͍ • FPGA • programmableֶ͕ͩशۂઢ͕ٸա͗Δɻϗετ༻్޲͖ʢ~400Gbpsʣ • Programmable ASIC • ASIC + FPGAͷ྆ํͷϝϦοτ • XGW-x86 GWͱTo fi no SWͰunit priceಉ͡ -> replaceͰసૹੑೳര૿ • 2017೥຤࣌఺ͰTo fi noΛબ୒ 10
  11. 3. HARDWARE GATEWAY AND CHALLENGES (2/3) Programmable Switching ASICs 11

    ಠཱͨ͠4ͭͷpipeline ֤stageຖʹಠཱͨ͠SRAM/TCAM SRAM: O(10MB) TCAM: ང͔ʹগͳ͍ ϝμσʔλ (action/lookup݁Ռ)͸ingress or egress pipelineશମͰڞ༗
  12. 3. HARDWARE GATEWAY AND CHALLENGES (3/3) Technical Challenges • public

    cloudͰͷςφϯτ༻tableΤϯτϦͷن໛ • ΤϯτϦ਺: VXLAN routing table: O(1M), VM-NC mapping table: O(1M) • ༰ྔ: VNI (24bit)෼ͷΤϯτϦαΠζ૿ • on-chipͷϝϞϦن໛ɿO(10MB)ఔ౓͔͠ͳ͍ͨΊࠔ೉ • ͦͷଞtableɿO(100M)ͳSNAT, LB, ACL, QoS, metering, … • IPv6ͷ૿Ճɿbit਺ଟ͍ɻv4, v6ͷςʔϒϧͷׂ߹ • HWΞʔΩςΫνϟͷ੍໿ (To fi noΛྫʹ) • stageຖɾpipelineຖͷϝϞϦྖҬͷׂΓ౰ͯํ๏ • ingress/egressؒͰͷϝλσʔλͷඇڞ༗ 12
  13. 4. DESIGN AND IMPLEMENTATION (1/4) Design Overview • on-chipϝϞϦෆ଍΁ͷΞϓϩʔν •

    (a) HW/SW co-design • গ਺ͷtableʹhit͢Δେ෦෼ͷτϥϑΟοΫ͸XGW-H • ͦͷଞશͯʢലେͳtableʹhit͢Δ௨৴ʣ͸XGW-x86΁ • (b) ΫϥελؒͰtable෼ׂ • 1ͭͷΫϥελ͸̍෦ͷςφϯτͷΈ୲౰ • ෳ਺ΫϥελͰ1ͭͷregionΛ୲౰ • (c) ୯ҰϊʔυͰͷtableѹॖ • pipeline folding, pipelineؒͷtable splitting/mapping, memory resource pooling, TCAMઅ໿ɺtable entryѹॖ 13
  14. 4. DESIGN AND IMPLEMENTATION (2/4) Hardware and Software Co-Design •

    Programmable ASIC (XGW-H) + Software GW (XGW-x86) • ࣮ࡍɺ5%ͷtable entry͕95%ͷτϥϑΟοΫΛసૹʢ80/20ͷ๏ଇʣ • લஈ͸XGW-HʢτϥϑΟοΫͷେ෦෼ɻগྔtableʣ • Ҿ͔͔ͬΒͳ͚Ε͹ϨʔτΛ੍ݶ͠ͳ͕ΒXGW-x86΁సૹ • ޙஈʹXGW-x86Λઃஔʢͦͷଞશͯɻlong-tail/stateful, ߴස౓ߋ৽, େ༰ྔtableʣ • খ͞ΊͳαʔϏε΋ͬͪ͜Ͱ୲౰ • ྫɿSNAT • ਺ेTbps؀ڥ: ਺ඦ୆ͷXGW-x86 -> 10୆ͷXGW-H + 4୆ͷXGW-x86 14 VNIࣝผ SNATඞཁ NATॲཧ
  15. 4. DESIGN AND IMPLEMENTATION (3/4) Table Splitting Among XGW-H Clusters

    • VNIͰͷcluster෼ׂͷϝϦοτ • fault isolation • scalability: VNI (ςφϯτ)୯ҐͰͷΩϟύϓϥ • LBͷ͠΍͢͞: fl ow-based hashʹൺ΂ͯτϥϑΟοΫΛ௥͍΍͍͢ • ෳࡶ͞ͷܰݮɿΫϥελ಺Ͱ͸ಉ͡tableΛอ࣋ɻϝϯς౳΋༰қ 15
  16. 4. DESIGN AND IMPLEMENTATION (4/4) Single-Node Table Compression (1/2) •

    ΑΓଟ͘table entry/node΁: O(1M), CapEx/OpEx • Pipeline folding • ଳҬ͸े෼: 4pipelineฒྻͳΒ6.4Tbps * N୆SW • ଳҬΛ٘ਜ਼ʹͯ͠ϝϞϦΛ૿΍͢ • pipelineΛ2ճճ͢-> ϝϞϦഒɾ஗ԆഒɾଳҬ൒෼ • in/egressͰϝλσʔλ෇༩ͯ͠pipeؒͰڞ༗ʢଳҬӨڹ༗Γʣ • PipelineؒͰͷtable෼ׂ • ̍ճ໨ͷingress->egress࣌ʹhashͯ͠ɺ΍Δ͜ͱΛ෼͚Δ • ྫɿVN΍Dst IPͳͲʹԠͯ͡෼ׂ 16
  17. 4. DESIGN AND IMPLEMENTATION (4/4) Single-Node Table Compression (2/2) •

    Pipelineؒͷlarge table mapping • ྫɿڊେͳTable DΛstageʹ෼ׂͯ͠ஔ͘ • IPv4/v6 pooling • ಠཱʹׂ౰͢Δͱແବ -> pooling͢Δ • IPv4Λkeyͱͯ͠128bitʹ֦ுͯ͠IPv6ͱҰகͤ͞ΔɿVXLAN routing table (LPM) • IPv6Λkyeͱͯ͠32bitʹѹॖͯ͠IPv4ͱҰகͤ͞ΔɿVM-NC mapping (exact match) • TCAMͷઅ໿: ΞϧΰϦζϜLPM (ALPM) • ݕࡧޮ཰ͱTCAM༰ྔͷτϨʔυΦϑ • ௕͍entryͷѹॖ • IPv6 -> 32bitԽ, ͦͷٯ΋ɻʢৄࡉׂѪʣ 17
  18. 5. EVALUATION • XGW-H: P4-16 1k LoC on To fi

    no 6.4TͰ࣮૷ 18 @128B-1024B/pkt ૒11ͷτϥϑΟοΫʢϐʔΫͰ਺ेTbpsʣͰ΋loss཰͕6ܻ௿͍ tableѹॖɿinitial -> aͰଳҬͱ஗Ԇ٘ਜ਼ NWੑೳͷେ෯ͳ޲্
  19. 6. EXPERIENCES • σϓϩΠपΓ • Ϋϥελߏங • controller͔Βtable entryΛGWʹ౤ೖɻverifcation΋΍Δɻ •

    ϓϩʔϒύέοτͰςετγφϦΦΧόʔ • OKͳΒຊ൪ͷτϥϑΟοΫΛGWʹྲྀ͢ • Ϋϥελ؅ཧ • table size͕ϝϞϦྔΛ௒͑ͨ৔߹ʢ༧૝͠΍͍͢ʣ or τϥϑΟοΫྔ͕௒͑ͨ৔߹ʢ༧૝͠ʹ͍͘ʣ • τϥϑΟοΫͷฏۉ஋ʢͱ҆શϚʔδϯʣ͔ΒΫϥελ֦ுΛܾΊΔ 19
  20. 6. EXPERIENCES (1/2) Deployment Experience • Ϋϥελߏங • controller͔Βtable entryΛGWʹ౤ೖɻveri

    fi cation΋΍Δ • ϓϩʔϒύέοτͰͷςετγφϦΦΛ࣮ߦ͔ͯ͠Βɺຊ൪ͷτϥϑΟοΫΛGWʹྲྀ͢ • Ϋϥελ؅ཧ • table size͕ϝϞϦྔΛ௒͑ͨ৔߹ʢ༧૝͠΍͍͢ʣ or τϥϑΟοΫྔ͕௒͑ͨ৔߹ʢ༧૝͠ʹ͍͘ʣ • τϥϑΟοΫͷฏۉ஋ʢͱ҆શϚʔδϯʣ͔ΒΫϥελ֦ுΛܾΊΔ • ૒11͸ҙਤతʹ҆શϚʔδϯΛҾ͖Լ͛Δ • Disaster Recovery • Ϋϥελ͸ಉ͡΋ͷΛ2ͭ༻ҙʢ1:1ͰϗοτελϯόΠʣ͠ɺτϥϑΟοΫᷖճͰ੾Γସ͑ • ϊʔυނো࣌͸Ϋϥελ಺ͷଞͷϊʔυ͕୲౰ɻΩϟύ௒͑ͦ͏ͳΒίʔϧυελϯόΠػΛىಈ • ϙʔτނো͸shut͢Δ͚ͩɻ 20
  21. 6. EXPERIENCES (2/2) Lessons Learned • Sail fi shͷํ๏͸AlibabaҎ֎Ͱ΋༗ޮʁ •

    ୹ظతʹ͸XGW-x86ͷํ๏ɻXGW-HͰ΋ϝϞϦෆ଍ʹ͸ͳΒͳ͍͔΋ɻ • Sail fi shͷϝϞϦɾଳҬ͸͍ͭ·Ͱ৯͑Δͷʁ • ܏޲ͱͯ͠VM਺૿ՃΑΓVM͋ͨΓͷτϥϑΟοΫྔ૿Ճͷํ͕େ͖͍ • Gbps͋ͨΓͷSRAM/TCAP࢖༻ྔ͸ݮগɻͭ·Γ༗ར • ։ൃͷ೉қ౓͸ʁ • Cݴޠ(XGW-x86)ͱൺֱͯ͠P4(XGW-H)ͷ։ൃͷํ͕”even slightly simpler” • HWΛ஌Δඞཁ͕͋ΔɻςετπʔϧνΣʔϯෆ଍->ςετέʔεॻ͖·͘Γ • XGW-HͱXGW-x86Λ྆ํ؅ཧ͢ΔʢϔςϩδχΞεΫϥελ؅ཧʣ͸େมʁ • ϊʔυ্ʹந৅ԽϨΠϠʔΛڬΜͰ؅ཧ͍ͯ͠ΔͷͰɺಛʹҙࣝ͢Δ͜ͱ͸ͳ͍ 21
  22. 7. RELATED WORK • ιϑτ΢ΣΞ࣮૷ͷGWɿεέʔϧ͠ͳ͍ • ্هͷΫϥελϦϯάɿcoreͷภΓˍաෛՙ • VPN GW

    on Azure: XGW-x86ͱಉ͡Α͏ͳํ๏ʢେ༰ྔϝϞϦΛ࢖͏ʣ • LB౳: hashͰtableѹॖͰ͖Δ͕ɺCloud GW͸LPMͳͷͰѹॖ͚ͩͰ͸ࠔ೉ • TEA: cacheಉظ໰୊ʢon-chip memory͸cache, hostଆDRAM͕όοΫΞοϓʣ 22
  23. 8. CONCLUSION AND FUTURE WORK • Sail fi sh •

    Alibaba Cloud޲͚ɺϓϩάϥϚϒϧεΠονASICͷϚϧνςφϯτɾϚϧναʔϏεGW • To fi noνοϓͷϝϞϦෆ଍໰୊ͷղܾ • ΫϥελϦϯάɺύΠϓϥΠϯߟྀͨ͠tableѹॖɺXGW-x86ར༻ • ࣍͸N+1ߏ੒Ͱϊʔυ਺ࣗମΛݮΒ͢ • N૚ͷΫϥελ૚+1૚ͷશςφϯτ༻Ϋϥελ 23