Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#28 “Understanding Host Network Stack Overheads”

#28 “Understanding Host Network Stack Overheads”

cafenero_777

June 06, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. PRELIMINARIES 3.

    LINUX NETWORK STACK OVERHEADS ʢ௕Ίʣ 4. FUTURE DIRECTIONS 5. CONCLUSION 2
  2. ର৅࿦จ • Understanding Host Network Stack Overheads • Qizhe Cai,

    Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, Rachit Agarwal • Cornell University • ACM SIGCOMM ‘21 • https://dl.acm.org/doi/10.1145/3452296.3472888 3
  3. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • DCNWͷٸ଎ͳߴ଎Խ͕༷ʑͳoverheadΛى͜͢ • 100Gbps Linux NW stackͷϘτϧωοΫΛௐࠪ

    • কདྷͷOS/HW/ProtocolઃܭΛٞ࿦ • ಡ΋͏ͱͨ͠ཧ༝ • 100Gbps+ͷNWϘτϧωοΫΛ஌Γ͍ͨ • 100Gbps+Ͱͷbest practice͸ʁ 4 Q1. LinuxͰૹ৴ͱड৴ɺͲ͕ͬͪେมʁ Q2. SlackͰ͸ʁ
  4. 1. INTRODUCTION • DCNWଳҬ͸4-10ഒ@਺೥ V.S. ͦͷଞͷϦιʔε͸ఀ଺ • ͜Ε·Ͱͷ޻෉ɿطଘNW stack࠷దԽɺHW-o ffl

    oad, RDMA, શ͘৽͍͠OSઃܭɺಛघͳNW༻HW • Կ͕ϘτϧωοΫʁ 5 3. ϗετϦιʔεͷڞ༗ • ҟͳΔ௨৴ύλʔϯͷॲཧͰεϧʔϓοτ͕ࠩൃੜ 4. layer/packetॲཧͷݟ௚͠ • long/short fl owࠞͥΔͱੑೳྼԽ 1. σʔλίϐʔ͕ॏ͍ʢϓϩτίϧॲཧͰ͸ͳ͘ʣ • 42Gbps/core, CPU 50%αΠΫϧ࢖͏ɻHW-o ff l oad? 2. BDPͱΩϟογϡαΠζͷണ཭ͰੑೳྼԽ • DCA (DDIO)Ͱ௚઀L3 cacheʹDMAͰ͖Δɻ͕ɺίϐʔ͢ΔલʹଞͰ্ॻ͖ > ܏޲ͱҰൠݪଇͷ෼ੳͱཧղ΁
  5. 2. PRELIMINARIES (1/2) LinuxϜζΧγΠͷͰલஔ͖ʢE2E data pathʣ 6 GSO (TSO)ͰskbΛ MTUαΠζʹνϟϯΫ

    User: write system call skbʹίϐʔ TXΩϡʔʹDMA mapping ࢀߟɿ https://www.kernel.org/doc/Documentation/networking/segmentation-o ff l oads.txt https://engineering.linecorp.com/ja/blog/tso-problems-srv6-based-multi-tenancy-environment https://cloud.watch.impress.co.jp/docs/column/hwtrend/516386.html RXΩϡʔͷmappingઌʹDMA ʢDDIO: L3 cache΁DMAʣ IRQͯ͠NAPIͰpolling GRO(LRO)ͰRX desc࡟ݮ skbʹࢀর skb͔Βίϐʔ TCP/IPॲཧ TCP/IPॲཧ GRO (Generic Segmentation O ffl oad) TSO (TCP Segmentation O ffl oad) GRO (Generic Receive O ffl oad) LRO (Large Receive O ffl oad) DCA (Direct Cache Access) DDIO (intel Data Direct I/O) User: read system call
  6. 2. PRELIMINARIES (2/2) LinuxϜζΧγΠͷͰલஔ͖ʢଌఆํ๏ʣ 7 • testbed setup • 2୆Ͱ100Gbps

    (CX-5) back-to-back • Ubuntu 16.04 (kernel 5.4.43) on Xeon Gold 6128 3.4GHz (6core)* 4 socket • 32KB(L1), 1MB(L2), 20MB(L3) • ࣮ݧγφϦΦ • https://github.com/Terabit-Ethernet/terabit-network-stack-pro fi ling • ̑ύλʔϯʢӈਤʣ+ long &| short fl ow • iperf (long), netperf (short) • CUBICͱBBR/DCTCPൺֱ • ੑೳࢦඪ • CPU࢖༻཰(sysstat, pro fi le)ͱεϧʔϓοτ
  7. 3. LINUX NETWORK STACK OVERHEADS (1/9) single fl ow 3.

    CPUαΠΫϧৄࡉ TSOͰskbαΠζ૿΍͠NICॲཧ GROͰ্Ґʹ౉͢skbαΠζΛ૿΍͢ɻCPU͸৯͏ aRFS (ಉҰNUMAར༻)ͰL3$౰ͨΔ 8 1. 1coreͰ~42Gbps
 Jumbo frame, TSO/GRO͕༗ޮ (skbͷϖΠϩʔυ ૿ՃͰύέοτ਺গͳ͍) 2. ड৴ଆCPU͕ϘτϧωοΫ: σʔλίϐʔͱskbׂ౰ ʢड৴࣌͸υϥΠό͕MTUαΠζͷskbΛ֬อɺGRO ૚ͰskbΛ݁߹ʣ 4. single fl owͰ΋$miss TCP window size΍RxQ(=ring bu ff er size) ૿΍͢ͱmiss૿ - BDP͕L3$ΑΓେ͖͍ -> ࣍ͷσʔλ͕དྷͯ$miss - ࠷దͰͳ͍$ <- Rx Qଟ͍->$missى͖΍͍͢ ࠓճͷ࠷ద஋͸໿3MB ʢL3 20MB, 6coreʣ aRFS (accelerated Receive Flow Steering)
  8. 3. LINUX NETWORK STACK OVERHEADS (2/9) one-to-one௨৴ • core/ fl

    ow: 1~24: εϧʔϓοτ๞࿨ɻcore ๞࿨ʢNUMA΋๞࿨ʣ • L3$ locality௿Լʢڞ༗ͯ͠͠·͏ʣ • sched૿Ճɺfree଴ͪ௿Լ 9
  9. 3. LINUX NETWORK STACK OVERHEADS (3/9) Incast௨৴ • fl ow਺(1~24)Ͱthroughput/core͕ྼԽ

    • CPU overhead͸มԽແ͠ • ड৴ଆͰ fl owڝ߹ -> L3$ miss -> ଳҬ/coreྼԽ • TCPͰ͸ड৴͕ݪཧతʹෆར • ͜ͷͨΊɺpHost (pFabric)΍HOMA͸༗ར 10 ड৴ଆ
  10. 3. LINUX NETWORK STACK OVERHEADS (4/9) Outcast௨৴ • fl ow਺(1~24)/core@senderͰͷଳҬ/core

    • Max 89Gbps, /coreੑೳ͸incastͷ2ഒ • TSO͸HW (GRO͸SW) • aRFS -> $ warm -> ༗ޮ • ड৴ଆ͸ʢincastͱൺֱͯ͠ʣmissগͳ͍ 11
  11. 3. LINUX NETWORK STACK OVERHEADS (5/9) All-to-All௨৴ • N:N௨৴(N=1~24) •

    throughput/coreܹݮ(67%ݮ) • remote-NUMAར༻ɺ$miss • packet਺/core͕௿͍->GROू໿཰͕ѱ͍-> skbখ͍͞->overhead • ࣮ࡍskbͷαΠζ෼෍͕มԽ 12
  12. 3. LINUX NETWORK STACK OVERHEADS (6/9) ᫔᫓௨৴ • single fl

    owΛεΠονͰlossͤ͞Δ • throughput/coreͰ24%ݮ • ड৴ଆCPU࢖༻཰ݮ->coreޮ཰͕ʢݟ্͔͚ʣྑ͘ͳΔ • ड৴ଆackੜ੒ɾૹ৴ॲཧ • CPUαΠΫϧͰ໿̑ഒ૿Ճ • ૹ৴ଆͷ᫔᫓੍ޚ΍࠶ૹॲཧ • lossଟͯ͘΋ෛՙ͸ʢड৴ଆൺֱͰ͸ʣԼ͕Βͳ͍ 13 $hit཰޲্
  13. 3. LINUX NETWORK STACK OVERHEADS (7/9) fl ow size •

    short: 16TX->1RX (incast RPC௨৴) • size૿ͰଳҬ૿ • σʔλίϐʔෛՙӨڹগͳ͍ • $miss͕গͳͯ͘΋ੑೳ޲্গͳ͍ • pkt਺૿ -> protocol, schedॲཧ਺૿->ෛՙӨڹ૿ • long, short mix • #16Ͱlong, shortͦΕͧΕ40%Ҏ্଎౓ݮ 14 Өڹগ ଟ ଟ shortͷΈɻ short/longࠞ߹ɻ
  14. 3. LINUX NETWORK STACK OVERHEADS (8/9) DCA/IOMMUͷӨڹ • single fl

    owͰࢼ͢ • DCAແޮԽ • NIC͔ΒL3$΁DMAग़དྷͳ͍ • IOMMU༗ޮԽ: 26%ੑೳྼԽ͸2ͭͷ௥ՃॲཧʹىҼ • DMA༻ͷpageׂ౰࣌ʹɺIOMMU্ʹpage-tableૠೖ • DMAऴྃޙʹpage-tableΛunmap 15
  15. 3. LINUX NETWORK STACK OVERHEADS (9/9) ϓϩτίϧӨڹ • single fl

    owͰBBR, DCTCP΋ଌఆ • coreลΓͷੑೳӨڹ΄΅ͳ͍ • ड৴ଆ͕ϘτϧωοΫ͸มΘΒͣ • ૹ৴ଆͷ᫔᫓੍ޚͰ͸ղܾͰ͖ͣ 16 qdiscͰϖʔγϯά
  16. 4. FUTURE DIRECTIONS • zero-copy • ૹ৴࣌ɿγεςϜίʔϧ࣌ʹϝϞϦݻఆ -> NIC͕DMAͰ͖ΔΑ͏ʹ͢Δ •

    ड৴࣌ɿsocket͕Ծ૝ΞυϨεར༻ʢNIC͕DMAͰ͖Δ෺ཧΞυϨεΛmmap͢Δʣ=ΞϓϦมߋඞཁ • XDP/AF_XDP socket: zero-copy͕ͩNW/protocolͷ࠶࣮૷ • τϥϯεϙʔτϓϩτίϧͷઃܭ • ͜Ε·Ͱͷࢦඪʢlatency, throughputʣ͚ͩͰͳ͘ϗετࢦඪʢcore/$/DMAʣ΋ʁ • ड৴ଆͷΦʔέετϨʔγϣϯʁ • Host stack࠶ઃܭ • socket࡞੒ޙʢ௨৴தʣͰ΋ॲཧύΠϓϥΠϯมߋͰ͖Δͱྑ͍ʢbu ff er/protocolॲཧ/CPU resource/schedʣ 17
  17. 5. CONCLUSION • DCͰͷ100Gbpsར༻͸HostଆʹϘτϧωοΫ͕༷ʑ͋Δ • Throughput, core, bu ff er,

    cache @ ༷ʑͳ௨৴ύλʔϯͰଌఆɾղੳ • Host OSઃܭɾ࣮૷΍HWͷ޻෉͕ඞཁ 18
  18. ࡾߦ·ͱΊ • DCͰͷ100Gbpsར༻͸HostଆʹϘτϧωοΫ͕༷ʑ͋Δ • Throughput, core, bu ff er, cache

    @ ༷ʑͳ௨৴ύλʔϯͰଌఆɾղੳ • Host OSઃܭɾ࣮૷΍HWͷ޻෉͕ඞཁ 19