Slide 1

Slide 1 text

Research Paper Introduction #28 “Understanding Host Network Stack Overheads” ௨ࢉ#83 @cafenero_777 2021/09/30 1

Slide 2

Slide 2 text

Agenda • ର৅࿦จ • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. PRELIMINARIES 3. LINUX NETWORK STACK OVERHEADS ʢ௕Ίʣ 4. FUTURE DIRECTIONS 5. CONCLUSION 2

Slide 3

Slide 3 text

ର৅࿦จ • Understanding Host Network Stack Overheads • Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, Rachit Agarwal • Cornell University • ACM SIGCOMM ‘21 • https://dl.acm.org/doi/10.1145/3452296.3472888 3

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • DCNWͷٸ଎ͳߴ଎Խ͕༷ʑͳoverheadΛى͜͢ • 100Gbps Linux NW stackͷϘτϧωοΫΛௐࠪ • কདྷͷOS/HW/ProtocolઃܭΛٞ࿦ • ಡ΋͏ͱͨ͠ཧ༝ • 100Gbps+ͷNWϘτϧωοΫΛ஌Γ͍ͨ • 100Gbps+Ͱͷbest practice͸ʁ 4 Q1. LinuxͰૹ৴ͱड৴ɺͲ͕ͬͪେมʁ Q2. SlackͰ͸ʁ

Slide 5

Slide 5 text

1. INTRODUCTION • DCNWଳҬ͸4-10ഒ@਺೥ V.S. ͦͷଞͷϦιʔε͸ఀ଺ • ͜Ε·Ͱͷ޻෉ɿطଘNW stack࠷దԽɺHW-o ffl oad, RDMA, શ͘৽͍͠OSઃܭɺಛघͳNW༻HW • Կ͕ϘτϧωοΫʁ 5 3. ϗετϦιʔεͷڞ༗ • ҟͳΔ௨৴ύλʔϯͷॲཧͰεϧʔϓοτ͕ࠩൃੜ 4. layer/packetॲཧͷݟ௚͠ • long/short fl owࠞͥΔͱੑೳྼԽ 1. σʔλίϐʔ͕ॏ͍ʢϓϩτίϧॲཧͰ͸ͳ͘ʣ • 42Gbps/core, CPU 50%αΠΫϧ࢖͏ɻHW-o ff l oad? 2. BDPͱΩϟογϡαΠζͷണ཭ͰੑೳྼԽ • DCA (DDIO)Ͱ௚઀L3 cacheʹDMAͰ͖Δɻ͕ɺίϐʔ͢ΔલʹଞͰ্ॻ͖ > ܏޲ͱҰൠݪଇͷ෼ੳͱཧղ΁

Slide 6

Slide 6 text

2. PRELIMINARIES (1/2) LinuxϜζΧγΠͷͰલஔ͖ʢE2E data pathʣ 6 GSO (TSO)ͰskbΛ MTUαΠζʹνϟϯΫ User: write system call skbʹίϐʔ TXΩϡʔʹDMA mapping ࢀߟɿ https://www.kernel.org/doc/Documentation/networking/segmentation-o ff l oads.txt https://engineering.linecorp.com/ja/blog/tso-problems-srv6-based-multi-tenancy-environment https://cloud.watch.impress.co.jp/docs/column/hwtrend/516386.html RXΩϡʔͷmappingઌʹDMA ʢDDIO: L3 cache΁DMAʣ IRQͯ͠NAPIͰpolling GRO(LRO)ͰRX desc࡟ݮ skbʹࢀর skb͔Βίϐʔ TCP/IPॲཧ TCP/IPॲཧ GRO (Generic Segmentation O ffl oad) TSO (TCP Segmentation O ffl oad) GRO (Generic Receive O ffl oad) LRO (Large Receive O ffl oad) DCA (Direct Cache Access) DDIO (intel Data Direct I/O) User: read system call

Slide 7

Slide 7 text

2. PRELIMINARIES (2/2) LinuxϜζΧγΠͷͰલஔ͖ʢଌఆํ๏ʣ 7 • testbed setup • 2୆Ͱ100Gbps (CX-5) back-to-back • Ubuntu 16.04 (kernel 5.4.43) on Xeon Gold 6128 3.4GHz (6core)* 4 socket • 32KB(L1), 1MB(L2), 20MB(L3) • ࣮ݧγφϦΦ • https://github.com/Terabit-Ethernet/terabit-network-stack-pro fi ling • ̑ύλʔϯʢӈਤʣ+ long &| short fl ow • iperf (long), netperf (short) • CUBICͱBBR/DCTCPൺֱ • ੑೳࢦඪ • CPU࢖༻཰(sysstat, pro fi le)ͱεϧʔϓοτ

Slide 8

Slide 8 text

3. LINUX NETWORK STACK OVERHEADS (1/9) single fl ow 3. CPUαΠΫϧৄࡉ TSOͰskbαΠζ૿΍͠NICॲཧ GROͰ্Ґʹ౉͢skbαΠζΛ૿΍͢ɻCPU͸৯͏ aRFS (ಉҰNUMAར༻)ͰL3$౰ͨΔ 8 1. 1coreͰ~42Gbps
 Jumbo frame, TSO/GRO͕༗ޮ (skbͷϖΠϩʔυ ૿ՃͰύέοτ਺গͳ͍) 2. ड৴ଆCPU͕ϘτϧωοΫ: σʔλίϐʔͱskbׂ౰ ʢड৴࣌͸υϥΠό͕MTUαΠζͷskbΛ֬อɺGRO ૚ͰskbΛ݁߹ʣ 4. single fl owͰ΋$miss TCP window size΍RxQ(=ring bu ff er size) ૿΍͢ͱmiss૿ - BDP͕L3$ΑΓେ͖͍ -> ࣍ͷσʔλ͕དྷͯ$miss - ࠷దͰͳ͍$ <- Rx Qଟ͍->$missى͖΍͍͢ ࠓճͷ࠷ద஋͸໿3MB ʢL3 20MB, 6coreʣ aRFS (accelerated Receive Flow Steering)

Slide 9

Slide 9 text

3. LINUX NETWORK STACK OVERHEADS (2/9) one-to-one௨৴ • core/ fl ow: 1~24: εϧʔϓοτ๞࿨ɻcore ๞࿨ʢNUMA΋๞࿨ʣ • L3$ locality௿Լʢڞ༗ͯ͠͠·͏ʣ • sched૿Ճɺfree଴ͪ௿Լ 9

Slide 10

Slide 10 text

3. LINUX NETWORK STACK OVERHEADS (3/9) Incast௨৴ • fl ow਺(1~24)Ͱthroughput/core͕ྼԽ • CPU overhead͸มԽແ͠ • ड৴ଆͰ fl owڝ߹ -> L3$ miss -> ଳҬ/coreྼԽ • TCPͰ͸ड৴͕ݪཧతʹෆར • ͜ͷͨΊɺpHost (pFabric)΍HOMA͸༗ར 10 ड৴ଆ

Slide 11

Slide 11 text

3. LINUX NETWORK STACK OVERHEADS (4/9) Outcast௨৴ • fl ow਺(1~24)/core@senderͰͷଳҬ/core • Max 89Gbps, /coreੑೳ͸incastͷ2ഒ • TSO͸HW (GRO͸SW) • aRFS -> $ warm -> ༗ޮ • ड৴ଆ͸ʢincastͱൺֱͯ͠ʣmissগͳ͍ 11

Slide 12

Slide 12 text

3. LINUX NETWORK STACK OVERHEADS (5/9) All-to-All௨৴ • N:N௨৴(N=1~24) • throughput/coreܹݮ(67%ݮ) • remote-NUMAར༻ɺ$miss • packet਺/core͕௿͍->GROू໿཰͕ѱ͍-> skbখ͍͞->overhead • ࣮ࡍskbͷαΠζ෼෍͕มԽ 12

Slide 13

Slide 13 text

3. LINUX NETWORK STACK OVERHEADS (6/9) ᫔᫓௨৴ • single fl owΛεΠονͰlossͤ͞Δ • throughput/coreͰ24%ݮ • ड৴ଆCPU࢖༻཰ݮ->coreޮ཰͕ʢݟ্͔͚ʣྑ͘ͳΔ • ड৴ଆackੜ੒ɾૹ৴ॲཧ • CPUαΠΫϧͰ໿̑ഒ૿Ճ • ૹ৴ଆͷ᫔᫓੍ޚ΍࠶ૹॲཧ • lossଟͯ͘΋ෛՙ͸ʢड৴ଆൺֱͰ͸ʣԼ͕Βͳ͍ 13 $hit཰޲্

Slide 14

Slide 14 text

3. LINUX NETWORK STACK OVERHEADS (7/9) fl ow size • short: 16TX->1RX (incast RPC௨৴) • size૿ͰଳҬ૿ • σʔλίϐʔෛՙӨڹগͳ͍ • $miss͕গͳͯ͘΋ੑೳ޲্গͳ͍ • pkt਺૿ -> protocol, schedॲཧ਺૿->ෛՙӨڹ૿ • long, short mix • #16Ͱlong, shortͦΕͧΕ40%Ҏ্଎౓ݮ 14 Өڹগ ଟ ଟ shortͷΈɻ short/longࠞ߹ɻ

Slide 15

Slide 15 text

3. LINUX NETWORK STACK OVERHEADS (8/9) DCA/IOMMUͷӨڹ • single fl owͰࢼ͢ • DCAແޮԽ • NIC͔ΒL3$΁DMAग़དྷͳ͍ • IOMMU༗ޮԽ: 26%ੑೳྼԽ͸2ͭͷ௥ՃॲཧʹىҼ • DMA༻ͷpageׂ౰࣌ʹɺIOMMU্ʹpage-tableૠೖ • DMAऴྃޙʹpage-tableΛunmap 15

Slide 16

Slide 16 text

3. LINUX NETWORK STACK OVERHEADS (9/9) ϓϩτίϧӨڹ • single fl owͰBBR, DCTCP΋ଌఆ • coreลΓͷੑೳӨڹ΄΅ͳ͍ • ड৴ଆ͕ϘτϧωοΫ͸มΘΒͣ • ૹ৴ଆͷ᫔᫓੍ޚͰ͸ղܾͰ͖ͣ 16 qdiscͰϖʔγϯά

Slide 17

Slide 17 text

4. FUTURE DIRECTIONS • zero-copy • ૹ৴࣌ɿγεςϜίʔϧ࣌ʹϝϞϦݻఆ -> NIC͕DMAͰ͖ΔΑ͏ʹ͢Δ • ड৴࣌ɿsocket͕Ծ૝ΞυϨεར༻ʢNIC͕DMAͰ͖Δ෺ཧΞυϨεΛmmap͢Δʣ=ΞϓϦมߋඞཁ • XDP/AF_XDP socket: zero-copy͕ͩNW/protocolͷ࠶࣮૷ • τϥϯεϙʔτϓϩτίϧͷઃܭ • ͜Ε·Ͱͷࢦඪʢlatency, throughputʣ͚ͩͰͳ͘ϗετࢦඪʢcore/$/DMAʣ΋ʁ • ड৴ଆͷΦʔέετϨʔγϣϯʁ • Host stack࠶ઃܭ • socket࡞੒ޙʢ௨৴தʣͰ΋ॲཧύΠϓϥΠϯมߋͰ͖Δͱྑ͍ʢbu ff er/protocolॲཧ/CPU resource/schedʣ 17

Slide 18

Slide 18 text

5. CONCLUSION • DCͰͷ100Gbpsར༻͸HostଆʹϘτϧωοΫ͕༷ʑ͋Δ • Throughput, core, bu ff er, cache @ ༷ʑͳ௨৴ύλʔϯͰଌఆɾղੳ • Host OSઃܭɾ࣮૷΍HWͷ޻෉͕ඞཁ 18

Slide 19

Slide 19 text

ࡾߦ·ͱΊ • DCͰͷ100Gbpsར༻͸HostଆʹϘτϧωοΫ͕༷ʑ͋Δ • Throughput, core, bu ff er, cache @ ༷ʑͳ௨৴ύλʔϯͰଌఆɾղੳ • Host OSઃܭɾ࣮૷΍HWͷ޻෉͕ඞཁ 19

Slide 20

Slide 20 text

EoP 20