DPDK MEETUP TOKYO #1 DPDK INTRODUCTION

2019/10/28th @Yahoo! JAPAN LODGE, hosted by Linux Foundation https://www.meetup.com/osn-tokyo/events/265399278/ IAGS/CPDP/CEE
AE Naoyuki Mori

2 Legal Disclaimers Intel technologies’ features and benefits depend on
system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks . Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks . Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2019 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.

3 DPDK Overview DPDK Performance Concepts DPDK Roadmap Agenda New
Roadmap Items Positioning, History & Current Status DPDK with VMware* & Microsoft* Performance Challenge

This section gives a high level overview of the overall
DPDK Framework, including the software libraries, APIs and drivers that it contains, and the sample applications that can be used to demonstrate how it works and test performance.

5 Network Workloads Signal Processing Video Transcoding, Baseband Processing, Speech
Codecs Video Transcode Control Plane Processing BGP, OSPF, RIP, SCTP, SIP, MGCP, RANAP, RNSAP BGP Routing Application Processing Firewall, ADC, Call Processing Stack, BSS/OSS Packet Processing Classification, Traffic Management, QoS Scheduler, NW Overlays, IPSEC/DTLS QoS Scheduling L7 ~ L3-4 L2-3 L1

6 * Other names and brands may be claimed as
the property of others. DPDK Components Kernel Packet classification Software libraries for hash/exact match, LPM, ACL, etc. Accelerated SW libraries Common functions such as fragmentation, reassembly, reordering etc. Stats Libraries for collecting and reporting statistics & metrics. QoS Libraries for QoS scheduling and metering/policing Packet Framework Libraries for creating complex pipelines in software. Core libraries Core functions such as memory management, software rings, timers etc. DPDK Applications - Network Functions (Cloud, Enterprise, Telco) DPDK Fundamentals • Implements run-to-completion and pipeline models • No scheduler - all devices accessed by polling • Supports 32-bit and 64-bit OSs, with and without NUMA • Scales from Intel® Atom® to Intel® Xeon® processors • Number of cores and processors is not limited • Optimal packet allocation across DRAM channels • Use of 2M & 1G hugepages and cache aligned structures • Uses bulk concepts - processing ‘n’ packets simultaneously • Open source and BSD licensed •Ease of Development - quick prototyping with samples, debugging (gdb), Analysis (VTune™, Intel® Performance Counter Monitor (Intel® PCM), PROX) RAWDEV PMDs for raw devices CRYPTODEV PMDs for HW and SW crypto accelerators COMPRESSDEV PMDs for HW and SW compression accelerators EVENTDEV Event-driven PMDs BBDEV PMDs for HW and SW wireless accelerators ETHDEV Flow API, MTR API, TM API PMDs for physical and virtual Ethernet devices IGB_UIO KNI UIO_PCI_GENERIC VFIO AF_XDP Security HW and SW acceleration of security protocols

7 Open Source Timeline 2010-12 2013 2014 2015 2016 2017
2018 • Initial DPDK releases from Intel, provided under open source BSD license. • DPDK.org open source community established. Helps to facilitate an increase in the use of, and contributions to, DPDK. • First fully open source release (1.7). • First multi-vendor CPU and NIC support. • First OS distro packaging of DPDK (Fedora, FreeBSD etc.). • First DPDK Summit community event held. • Continued increase in multi-vendor CPU and NIC support. • Technical Board created to aid technical decision making. • Community decision to adopt a more formal governance structure. • Rapid increase in multi-vendor CPU and NIC support. • Increased OS distro packaging (RHEL, CentOS, Ubuntu etc.). • DPDK Summits extended to PRC and Europe. • Support for hardware and software accelerators added. • DPDK transitions to The Linux Foundation. • DPDK Summits extended to India. • Strong focus on generic APIs to abstract application from underlying platform. • Continued expansion of accelerator support (FPGAs, smart NICs, cryptodev, bbdev, compressdev, eventdev etc.).

8 Open Source Stats A fully open source software project
with a strong development community: • BSD licensed • Website: http://dpdk.org; Git: http://dpdk.org/browse/dpdk/ Major Contributors DPDK Release Stats * Other names and brands may be claimed as the property of others. 0 20 40 60 80 100 120 140 160 180 200 0 200 400 600 800 1000 1200 1400 1600 1800 1.7 1.8 2.0 2.1 2.2 16.04 16.07 16.11 17.02 17.05 17.08 17.11 18.02 18.05 18.08 18.11 Total Contributors Total Commits Total Contributors Total Commits

9 Multi-Architecture/Multi-Vendor Support * Other names and brands may be
claimed as the property of others. 2014 2015 2016 2017 First non-IA contributions. Non-Intel NIC support. Significant ARM vendor engagement. SoC enhancements. Non-Intel crypto. ENIC POWER 8 TILE-Gx ARM v7/v8 BNX2X MLX4/MLX5 NFP CXGBE SZEDATA2 ThunderX PMD QEDE ENA BNXT DPAA2 ARMv8 Crypto OcteonTX LiquidIO SoC Enhancements Event API Enhanced ARM Support SFC CPU Architectures Poll Mode Drivers AVP ARK

DPDK for VMware and Microsoft Presenting in 2017 https://dpdksummit.com/Archive/pdf/2017USA/Accelerating%20NFV%20with%2 0VMware%27s%20Enhanced%20Network%20Stack%20(ENS)%20and%20Intel%
27s%20Poll%20Mode%20Drivers%20(PMD).pdf https://dpdksummit.com/Archive/pdf/2017USA/Making%20networking%20apps %20scream%20on%20Windows%20with%20DPDK.pdf DPDK Summit North America 2017 - November 14 - 15, 2017 10

Why is DPDK is fast? How do Intel hardware features
help performance

12 < 2 Mpps / core @ 64B1 - Data
not touched < 52 Mpps / core @ 64B2 - Data not touched – 1C1T 32.6 Mpps @ 64B, 50Gbps @ 256B3 - Data touched – 1C2T 41.6 Mpps @ 64B, 50Gbps @ 256B3 - Data touched Packet Performance and Today’s Needs Today 2020 (400+ Gbps) 1 Gbps Residential CPE 10 Gbps Enterprise CPE 10-40 Gbps CO, Cable Headend 40+ Gbps National-Regional DC Core Networks 100+ Gbps High End Performance advantage of DPDK over Linux* More free cores for customer’s applications * Other names and brands may be claimed as the property of others.

ドライバーカーネルとユーザー空間のDPDK の違いカーネルバイパス、ユーザー空間で動作、ポール・モード・ドライバーで高速化パケットデータソケットバッファーメモリーディスクプリターコンフィグユーザー空間
カーネル空間アプリケーションシステムコールプロトコルスタックハードウェア NIC デバイス割り込み Ring CSR コピー 1 2 3 カーネル空間ネットワーク・ドライバー UIO ドライバーパケットデータメモリーディスクプリターコンフィグユーザー空間カーネル空間アプリケーションハードウェア NIC デバイス Ring CSR 2 ユーザー空間ネットワーク・ドライバー (DPDK) ディスクプリターコンフィグマッピング DPDK PMD 1

14 PCIe* Connectivity and Core Usage Using run-to-completion or pipeline
software models Processor 0 Physical Core 0 Linux* Control Plane 10 GbE 10 GbE Physical Core 1 PMD Packet I/O Packet work Rx Tx Physical Core 2 PMD Packet I/O Flow work Rx Tx Physical Core 3 PMD Packet I/O Flow Classification App A, B, C Rx Tx Run to Completion Model • I/O and Application workload can be handled on a single core • I/O can be scaled over multiple cores 10 GbE Pipeline Model • I/O application disperses packets to other cores • Application work performed on other cores Processor 1 Physical Core 4 10 GbE Physical Core 0 PMD Packet I/O Hash Physical Core 1 App A App B App C Physical Core 2 App A App B App C Physical Core 3 Rx Tx 10 GbE Pkt Pkt Physical Core 4 PMD Packet I/O Flow Classification App A, B, C Rx Tx Pkt Pkt Pkt Pkt Pkt Pkt RSS Mode QPI/UPI PCIe PCIe PCIe PCIe PCIe PCIe Can handle more I/O on fewer cores with vectorization NUMA Pool Caches Queue/Rings Buffers Mempool NUMA Pool Caches Queue/Rings Buffers Mempool GRUB settings are important for performance – see performance reports on www.dpdk.org default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048 isolcpus=1-11,22-33 nohz_full=1-11,22-33 rcu_nocbs=1-11,22-33 Note: nohz_full and rcu_nocbs is to disable Linux* kernel interrupts, and it’s important for zero-packet loss test. Generally, 1G huge pages are used for performance test. Vectorization good for some applications to boost performance

15 High Performance Challenges The system can’t keep up with
the number of interrupts for packet Rx:  Switch from an interrupt-driven network device driver to a polled-mode driver. The Linux scheduler causes too much overhead for task switches:  Bind a single software thread to a logical core. Memory and PCIe access is really slow compared to CPU operations: • Process a bunch of packets during each software iteration and amortize the access cost over multiple packets. Data doesn’t seem to be near the CPU when it needs to be:  For memory access, use HW or SW controlled prefetching. For PCIe access, use Data Direct IO to write data directly into cache. Access to shared data structures is a bottleneck:  Use access schemes that reduce the amount of sharing (e.g. lockless queues for message passing). Page tables are constantly evicted (DTLB Thrashing):  Allow Linux to use Huge Pages (2MB, 1GB)

16 Achieving Performance on the Processor HW Concepts Comments Vector
instruction set CPU core supports vector instructions SSE: 128-bit integer, AVX1:128-bit integer, AVX2: 256-bit integer, AVX3 512-bit integer Huge-pages Intel CPUs support 4K, 2MB and 1GB page sizes. Picking the right page size for the data structure minimizes TLB thrashing. DPDK uses hugetlbfs to manage physically mapped huge page area Hardware prefetch Intel CPUs support prefetching data into all levels of the cache hierarchy (L1, L2, LLC). Cache and memory alignment DPDK aligns all its data structures to 64B multiples. This avoids elements straddling cache lines and DDR memory lines fulfilling requests with single read cycles Intel Data Direct I/O (DDIO) Intel® Xeon® E5, E7, Scalable processors, packet I/O data is placed directly in LLC on ingress and sourced from LLC on egress Cache QoS Allows way-allocation control of LLC between multiple applications, controlled by software CPU Frequency scaling/Turbo Allows the core to temporarily boost CPU frequency higher for single threaded performance NUMA Non-Uniform Memory Architecture – DPDK tries to allocate memory as close to the core where the code is executing. SW Concepts Comments Complete user space implementation Allows quick prototyping and development. Compiler can aggressively optimize to use complete instruction set Software prefetch DPDK also uses SW prefetch instructions to limit the effect of memory latency for software pipelining Core-thread affinity Threads are affinitized to a particular core, dedicated to certain functions. Prevents reloading L1, L2 with instructions/data when threads hop from core to core Use of vector instructions The code implements algorithms using as much of the instruction set as possible – we use vector (SSE, AVX) providing significant speed up Function in-lining DPDK implements a number of performance critical functions in header files for easier compiler in-lining. Algorithmic optimizations To implement functions common in network processing e.g. n-tuple lookups, wildcards, ACLs etc. Hardware offload libraries Hardware offloads can complement the software implementation when the required hardware capability is available. E.g. 5-tuple lookups can be done on most modern NICs, and act in conjunction with a software classifier implementation Bulk functions Most functions support a “bulk” mode – processing ‘n’ packets simultaneously. Allows for software pipelining to overcome memory latency

PERFORMANCE BENCHMARK DISCLOSURES Software and workloads used in performance tests
may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Results are based on internal testing and are provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. 1: Linux <2 Mpps / core @ 64 Byte based on Intel® Xeon® E5-2658v2 Processor (DP)/Intel® C600 Series Chipset: Linux* Layer 3 IPv4 Forwarding using 10GbE. Baseline config: Intel® Xeon® E5-2658v2 processors M0 Stepping, 2.40GHz, 10 cores, 8 GT/s QPI, 25MB L3 cache, Dual-Processor configuration, Intel® C600 Series Chipset (C0 stepping), Crown Pass Platform (W2600CR), DDR3 1867MHz, 8 x dual rank registered ECC 16GB (total 128GB), 4 memory channels per socket Configuration, 1 DIMM per channel, 4 x Intel® X520-DA4 PCI-Express Gen3 x8 10 Gb Ethernet NIC (40G/card) (PLX rev. ca) Source: http://cat.intel.com/LaunchLink.aspx?LinkID=3023 2: DPDK < 52 Mpps / core @ 64 Byte based on Intel® Xeon® Processor Platinum 8180(DP): DPDK testpmd. Baseline Config: based on Intel® Xeon® Processor Platinum 8180 (38.5 M Cache, 2.50 GHz, 28 core), Dual-Processor configuration, 98304 MBs over 12 channels @ 2666 MHz, 2 x Intel® Ethernet Converged Network Adapter XL710-QDA2 (2X40G) or 4 x Intel ® Ethernet Converged Network Adapter 82599ES, Source: DPDK 18.11 Intel NIC Performance Report 3: DPDK < 41 Mpps / core @ 64 Byte based on Intel® Xeon® Processor Platinum 8180(DP): DPDK L3Fwd. Baseline Config: based on Intel® Xeon® Platinum 8160 Processor 24C @2.10GHz, Dual-Processor Configuration, Supermicro* platform, Micron* DDR4 2666MHz RDIMMs 6x 16GB(96 GB), 6 Channels/Socket, 3x Intel® Ethernet Controller XXV710 (4x25G/card). Source: http://cat.intel.com/LaunchLink.aspx?LinkID=4188 Performance on 2. and 3. is equivalent to Intel® Xeon® Gold 6254 Processor Intel® Ethernet Adapter E810-CQDA2: Source: http://cat.intel.com/LaunchLink.aspx?LinkID=4274

This section provides the DPDK Roadmap

19 2019 & 2020 Releases Jan Feb Mar Apr May
Jun Jul Aug Sep Oct Nov Dec 2019 19.02 19.05 19.11 (LTS) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2020 20.02 20.05 20.11 (LTS) 20.08 Since 16.04, releases use the Ubuntu* numbering scheme of YY.MM. * Other names and brands may be claimed as the property of others. 19.08

20 • The purpose of Long-Term Support (LTS) is to
maintain a stable release of DPDK with back-ported bug fixes over an extended period of time. This will provide downstream consumers with a stable target on which to base applications or packages. • The LTS releases will be maintained for 2 years. • Bug fixes that do not change the ABI will be back-ported. • In general, new features will not be back-ported. There may be occasional exceptions where the following criteria are met: • There is a justifiable use case (for example a new PMD). • The change is non-invasive. • The work of preparing the back-port is done by the proposer. • There is support within the community. • Releases 16.11, 17.11, 18.11, 19.11 are LTS releases Long-Term Support (DPDK ##.11 (LTS) releases

DPDK API Stability https://www.dpdk.org/blog/2019/10/10/why-is-abi-stability-important/ ABI stability will run for one
year following the v19.11 release. ABI breakage windows are aligned with LTS releases The ABI policy will then be reviewed after this initial year, with the intention of lengthening the stability period and period between ABI breakages to two years. 21

22 Version 19.11 (2019 November) • configurability of Rx offloads
• rte_flow patterns for NSH, IGMP, AH • rte_flow actions for mirroring and multicast • Rx metadata in mbuf, with rte_flow API and mlx5 implementation • hairpin forwarding offload, with mlx5 implementation • VF configuration from host via representor port id • Arm N1 platform config • Arm optimizations in i40e and ixgbe • ice support of DDP, multi-process and flexible descriptor • ice rte_flow updates to support RSS, high/low priority flows, DDP profiles • ice and iavf avx2 vector path • ipn3ke graceful shutdown • mlx5 HW support of VLAN id update and push/pop, VF LAG, flow metering and EEPROM module • virtio packed ring performance optimizations • use C11 atomic functions in memif • Arm WFE/SEV instructions in spinlock and ring library • integrate RCU library with LPM and hash libraries • optimized algorithm for resizeable hash table • lock-free stack mempool handler • lock-free l3fwd algorithms • ntb FIFO ring for Rx/Tx • eventdev examples in l2fwd-event, l3fwd and ipsec-secgw • cryptodev session-less asymmetric crypto • Nitrox cryptodev • OCTEON TX asymmetric crypto • OCTEON TX2 cryptodev • OCTEON TX2 inline IPsec using rte_security • rte_security support of inline crypto statistics • rte_security improved performance for IPsec with software crypto • IPsec add security association database • ipsec-secgw support of multiple sessions for the same SA • QAT stateful decompression • regexdev • template based ring API • sched library configuration more flexible • eBPF arm64 JIT • KNI IOVA as VA • sample application for ioat • UBSan in build Nice to have - Future • multi-process rework • automatic UIO/VFIO binding • infiniband driver class (ibdev) • default configuration from files • generic white/blacklisting • libedit integration DPDK Roadmap https://core.dpdk.org/roadmap/

OPTIMIZATION NOTICE Optimization Notice: Intel's compilers may or may not
optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

DPDK MEETUP TOKYO #1 DPDK INTRODUCTION

DPDK MEETUP TOKYO #1 DPDK INTRODUCTION

Linux Foundation Japan PRO

More Decks by Linux Foundation Japan

Other Decks in Technology

Featured

Transcript

2019/10/28th @Yahoo! JAPAN LODGE, hosted by Linux Foundation https://www.meetup.com/osn-tokyo/events/265399278/ IAGS/CPDP/CEE

2 Legal Disclaimers Intel technologies’ features and benefits depend on

3 DPDK Overview DPDK Performance Concepts DPDK Roadmap Agenda New

This section gives a high level overview of the overall

5 Network Workloads Signal Processing Video Transcoding, Baseband Processing, Speech

6 * Other names and brands may be claimed as

7 Open Source Timeline 2010-12 2013 2014 2015 2016 2017

8 Open Source Stats A fully open source software project

9 Multi-Architecture/Multi-Vendor Support * Other names and brands may be

DPDK for VMware and Microsoft Presenting in 2017 https://dpdksummit.com/Archive/pdf/2017USA/Accelerating%20NFV%20with%2 0VMware%27s%20Enhanced%20Network%20Stack%20(ENS)%20and%20Intel%

Why is DPDK is fast? How do Intel hardware features

12 < 2 Mpps / core @ 64B1 - Data

ドライバーカーネルとユーザー空間のDPDK の違いカーネルバイパス、ユーザー空間で動作、ポール・モード・ドライバーで高速化パケットデータソケットバッファーメモリーディスクプリターコンフィグユーザー空間

14 PCIe* Connectivity and Core Usage Using run-to-completion or pipeline

15 High Performance Challenges The system can’t keep up with

16 Achieving Performance on the Processor HW Concepts Comments Vector

PERFORMANCE BENCHMARK DISCLOSURES Software and workloads used in performance tests

This section provides the DPDK Roadmap

19 2019 & 2020 Releases Jan Feb Mar Apr May

20 • The purpose of Long-Term Support (LTS) is to

DPDK API Stability https://www.dpdk.org/blog/2019/10/10/why-is-abi-stability-important/ ABI stability will run for one

22 Version 19.11 (2019 November) • configurability of Rx offloads

OPTIMIZATION NOTICE Optimization Notice: Intel's compilers may or may not