Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DPDK MEETUP TOKYO #1 DPDK INTRODUCTION

DPDK MEETUP TOKYO #1 DPDK INTRODUCTION

Linux Foundation Japan
PRO

October 28, 2019
Tweet

More Decks by Linux Foundation Japan

Other Decks in Technology

Transcript

  1. 2019/10/28th @Yahoo! JAPAN LODGE, hosted by Linux Foundation
    https://www.meetup.com/osn-tokyo/events/265399278/
    IAGS/CPDP/CEE AE
    Naoyuki Mori

    View Slide

  2. 2
    Legal Disclaimers
    Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
    configuration. Check with your system manufacturer or retailer or learn more at intel.com.
    No computer system can be absolutely secure.
    Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources
    of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks .
    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
    measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
    information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
    complete information visit http://www.intel.com/benchmarks .
    Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
    SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured
    by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
    Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
    Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost
    savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
    No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
    Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any
    warranty arising from course of performance, course of dealing, or usage in trade.
    All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
    Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are
    accurate.
    © 2019 Intel Corporation.
    Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
    *Other names and brands may be claimed as property of others.

    View Slide

  3. 3
    DPDK Overview
    DPDK Performance Concepts
    DPDK Roadmap
    Agenda
    New Roadmap Items
    Positioning, History & Current Status
    DPDK with VMware* & Microsoft*
    Performance Challenge

    View Slide

  4. This section gives a high level overview of the overall DPDK Framework, including the software libraries,
    APIs and drivers that it contains, and the sample applications that can be used to demonstrate how it
    works and test performance.

    View Slide

  5. 5
    Network Workloads
    Signal
    Processing
    Video Transcoding,
    Baseband Processing,
    Speech Codecs
    Video Transcode
    Control Plane
    Processing
    BGP, OSPF, RIP, SCTP,
    SIP, MGCP, RANAP,
    RNSAP
    BGP Routing
    Application
    Processing
    Firewall, ADC, Call
    Processing Stack,
    BSS/OSS
    Packet
    Processing
    Classification, Traffic
    Management, QoS
    Scheduler, NW
    Overlays, IPSEC/DTLS
    QoS Scheduling
    L7 ~ L3-4 L2-3 L1

    View Slide

  6. 6
    * Other names and brands may be claimed as the property of others.
    DPDK Components
    Kernel
    Packet
    classification
    Software libraries
    for hash/exact
    match, LPM, ACL,
    etc.
    Accelerated SW
    libraries
    Common functions
    such as fragmentation,
    reassembly, reordering
    etc.
    Stats
    Libraries for
    collecting and
    reporting statistics &
    metrics.
    QoS
    Libraries for QoS
    scheduling and
    metering/policing
    Packet
    Framework
    Libraries for creating
    complex pipelines in
    software.
    Core libraries
    Core functions such
    as memory
    management,
    software rings,
    timers etc.
    DPDK Applications - Network Functions (Cloud, Enterprise, Telco)
    DPDK Fundamentals
    • Implements run-to-completion and
    pipeline models
    • No scheduler - all devices accessed by
    polling
    • Supports 32-bit and 64-bit OSs, with and
    without NUMA
    • Scales from Intel® Atom® to Intel® Xeon®
    processors
    • Number of cores and processors is not
    limited
    • Optimal packet allocation across DRAM
    channels
    • Use of 2M & 1G hugepages and cache
    aligned structures
    • Uses bulk concepts - processing ‘n’ packets
    simultaneously
    • Open source and BSD licensed
    •Ease of Development - quick
    prototyping with samples, debugging
    (gdb), Analysis (VTune™, Intel®
    Performance Counter Monitor (Intel®
    PCM), PROX)
    RAWDEV
    PMDs for raw
    devices
    CRYPTODEV
    PMDs for HW and
    SW crypto
    accelerators
    COMPRESSDEV
    PMDs for HW and
    SW compression
    accelerators
    EVENTDEV
    Event-driven PMDs
    BBDEV
    PMDs for HW
    and SW wireless
    accelerators
    ETHDEV
    Flow API, MTR API,
    TM API
    PMDs for physical
    and virtual Ethernet
    devices
    IGB_UIO KNI UIO_PCI_GENERIC VFIO AF_XDP
    Security
    HW and SW
    acceleration of
    security protocols

    View Slide

  7. 7
    Open Source Timeline
    2010-12 2013 2014 2015 2016 2017 2018
    • Initial DPDK
    releases from
    Intel, provided
    under open
    source BSD
    license.
    • DPDK.org open source
    community established.
    Helps to facilitate an increase
    in the use of, and
    contributions to, DPDK.
    • First fully open source
    release (1.7).
    • First multi-vendor CPU and
    NIC support.
    • First OS distro packaging of
    DPDK (Fedora, FreeBSD
    etc.).
    • First DPDK Summit
    community event held.
    • Continued increase in
    multi-vendor CPU and
    NIC support.
    • Technical Board created
    to aid technical decision
    making.
    • Community decision to
    adopt a more formal
    governance structure.
    • Rapid increase in multi-vendor CPU
    and NIC support.
    • Increased OS distro packaging (RHEL,
    CentOS, Ubuntu etc.).
    • DPDK Summits extended to PRC and
    Europe.
    • Support for hardware and software
    accelerators added.
    • DPDK transitions to The
    Linux Foundation.
    • DPDK Summits extended to
    India.
    • Strong focus on generic APIs
    to abstract application from
    underlying platform.
    • Continued expansion of
    accelerator support (FPGAs,
    smart NICs, cryptodev, bbdev,
    compressdev, eventdev etc.).

    View Slide

  8. 8
    Open Source Stats
    A fully open source software project with a strong development community:
    • BSD licensed
    • Website: http://dpdk.org; Git: http://dpdk.org/browse/dpdk/
    Major Contributors
    DPDK Release Stats
    * Other names and brands may be claimed as the property of others.
    0
    20
    40
    60
    80
    100
    120
    140
    160
    180
    200
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    1.7
    1.8
    2.0
    2.1
    2.2
    16.04
    16.07
    16.11
    17.02
    17.05
    17.08
    17.11
    18.02
    18.05
    18.08
    18.11
    Total Contributors
    Total Commits
    Total Contributors Total Commits

    View Slide

  9. 9
    Multi-Architecture/Multi-Vendor Support
    * Other names and brands may be claimed as the property of others.
    2014 2015 2016 2017
    First non-IA
    contributions.
    Non-Intel NIC
    support.
    Significant ARM
    vendor engagement.
    SoC enhancements.
    Non-Intel crypto.
    ENIC
    POWER 8
    TILE-Gx
    ARM v7/v8
    BNX2X MLX4/MLX5
    NFP
    CXGBE
    SZEDATA2
    ThunderX PMD
    QEDE
    ENA
    BNXT
    DPAA2
    ARMv8 Crypto
    OcteonTX
    LiquidIO
    SoC Enhancements
    Event API
    Enhanced ARM Support
    SFC
    CPU Architectures
    Poll Mode Drivers
    AVP
    ARK

    View Slide

  10. DPDK for VMware and Microsoft Presenting in 2017
    https://dpdksummit.com/Archive/pdf/2017USA/Accelerating%20NFV%20with%2
    0VMware%27s%20Enhanced%20Network%20Stack%20(ENS)%20and%20Intel%
    27s%20Poll%20Mode%20Drivers%20(PMD).pdf
    https://dpdksummit.com/Archive/pdf/2017USA/Making%20networking%20apps
    %20scream%20on%20Windows%20with%20DPDK.pdf
    DPDK Summit North America 2017 - November 14 - 15, 2017
    10

    View Slide

  11. Why is DPDK is fast?
    How do Intel hardware features help performance

    View Slide

  12. 12
    < 2 Mpps / core @ 64B1 - Data not touched
    < 52 Mpps / core @ 64B2 - Data not touched
    – 1C1T 32.6 Mpps @ 64B, 50Gbps @ 256B3 - Data touched
    – 1C2T 41.6 Mpps @ 64B, 50Gbps @ 256B3 - Data touched
    Packet Performance and Today’s Needs
    Today 2020
    (400+ Gbps)
    1 Gbps
    Residential CPE
    10 Gbps
    Enterprise CPE
    10-40 Gbps
    CO, Cable Headend
    40+ Gbps
    National-Regional DC
    Core Networks
    100+ Gbps
    High End
    Performance advantage of DPDK over Linux*
    More free cores for customer’s applications
    * Other names and brands may be claimed as the property of others.

    View Slide

  13. ドライバー
    カーネルとユーザー空間のDPDK の違い
    カーネルバイパス、ユーザー空間で動作、ポール・モード・ドライバーで高速化
    パケットデータ
    ソケットバッファー
    メモリー
    ディスクプリター
    コンフィグ
    ユーザー空間
    カーネル空間
    アプリケーション
    システムコール
    プロトコル
    スタック
    ハードウェア
    NIC デバイス
    割り込み
    Ring CSR
    コピー
    1
    2
    3
    カーネル空間ネットワーク・ドライバー
    UIO ドライバー
    パケットデータ
    メモリー
    ディスクプリター
    コンフィグ
    ユーザー
    空間
    カーネル
    空間
    アプリケーション
    ハードウェア
    NIC デバイス
    Ring CSR
    2
    ユーザー空間ネットワーク・ドライバー (DPDK)
    ディスクプリター
    コンフィグ
    マッピング
    DPDK PMD
    1

    View Slide

  14. 14
    PCIe* Connectivity and Core Usage
    Using run-to-completion or pipeline software models
    Processor 0
    Physical
    Core 0
    Linux* Control Plane
    10 GbE
    10 GbE
    Physical
    Core 1
    PMD Packet I/O
    Packet work
    Rx
    Tx
    Physical
    Core 2
    PMD Packet I/O
    Flow work
    Rx
    Tx
    Physical
    Core 3
    PMD Packet I/O
    Flow Classification
    App A, B, C
    Rx
    Tx
    Run to Completion Model
    • I/O and Application workload can be handled on a single core
    • I/O can be scaled over multiple cores
    10 GbE
    Pipeline Model
    • I/O application disperses packets to other cores
    • Application work performed on other cores
    Processor 1
    Physical
    Core 4
    10 GbE Physical
    Core 0
    PMD Packet I/O
    Hash
    Physical
    Core 1
    App
    A
    App
    B
    App
    C
    Physical
    Core 2
    App
    A
    App
    B
    App
    C
    Physical
    Core 3
    Rx
    Tx
    10 GbE
    Pkt Pkt
    Physical
    Core 4
    PMD Packet I/O
    Flow Classification
    App A, B, C
    Rx
    Tx
    Pkt Pkt
    Pkt Pkt
    Pkt
    Pkt
    RSS
    Mode
    QPI/UPI
    PCIe
    PCIe
    PCIe PCIe
    PCIe PCIe
    Can handle more
    I/O on fewer cores
    with vectorization
    NUMA Pool Caches
    Queue/Rings
    Buffers
    Mempool NUMA Pool Caches
    Queue/Rings
    Buffers
    Mempool
    GRUB settings are important for performance – see performance reports on www.dpdk.org
    default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048 isolcpus=1-11,22-33 nohz_full=1-11,22-33 rcu_nocbs=1-11,22-33
    Note: nohz_full and rcu_nocbs is to disable Linux* kernel interrupts, and it’s important for zero-packet loss test. Generally, 1G huge pages are used for performance test.
    Vectorization
    good
    for some
    applications to
    boost
    performance

    View Slide

  15. 15
    High Performance Challenges
    The system can’t keep up with the number of interrupts for
    packet Rx:
     Switch from an interrupt-driven network device driver to a
    polled-mode driver.
    The Linux scheduler causes too much overhead for task
    switches:
     Bind a single software thread to a logical core.
    Memory and PCIe access is really slow compared to CPU
    operations:
    • Process a bunch of packets during each software iteration
    and amortize the access cost over multiple packets.
    Data doesn’t seem to be near the CPU when it needs to be:
     For memory access, use HW or SW controlled prefetching.
    For PCIe access, use Data Direct IO to write data directly into
    cache.
    Access to shared data structures is a bottleneck:
     Use access schemes that reduce the amount of sharing (e.g.
    lockless queues for message passing).
    Page tables are constantly evicted (DTLB Thrashing):
     Allow Linux to use Huge Pages (2MB, 1GB)

    View Slide

  16. 16
    Achieving Performance on the Processor
    HW Concepts Comments
    Vector instruction
    set
    CPU core supports vector instructions
    SSE: 128-bit integer, AVX1:128-bit integer, AVX2: 256-bit
    integer, AVX3 512-bit integer
    Huge-pages Intel CPUs support 4K, 2MB and 1GB page sizes. Picking
    the right page size for the data structure minimizes TLB
    thrashing. DPDK uses hugetlbfs to manage physically
    mapped huge page area
    Hardware prefetch Intel CPUs support prefetching data into all levels of the
    cache hierarchy (L1, L2, LLC).
    Cache and
    memory alignment
    DPDK aligns all its data structures to 64B multiples. This
    avoids elements straddling cache lines and DDR memory
    lines fulfilling requests with single read cycles
    Intel Data Direct
    I/O (DDIO)
    Intel® Xeon® E5, E7, Scalable processors, packet I/O data
    is placed directly in LLC on ingress and sourced from LLC
    on egress
    Cache QoS Allows way-allocation control of LLC between multiple
    applications, controlled by software
    CPU Frequency
    scaling/Turbo
    Allows the core to temporarily boost CPU frequency
    higher for single threaded performance
    NUMA Non-Uniform Memory Architecture – DPDK tries to
    allocate memory as close to the core where the code is
    executing.
    SW Concepts Comments
    Complete user
    space
    implementation
    Allows quick prototyping and development. Compiler
    can aggressively optimize to use complete instruction
    set
    Software prefetch DPDK also uses SW prefetch instructions to limit the
    effect of memory latency for software pipelining
    Core-thread affinity Threads are affinitized to a particular core, dedicated to
    certain functions. Prevents reloading L1, L2 with
    instructions/data when threads hop from core to core
    Use of vector
    instructions
    The code implements algorithms using as much of the
    instruction set as possible – we use vector (SSE, AVX)
    providing significant speed up
    Function in-lining DPDK implements a number of performance critical
    functions in header files for easier compiler in-lining.
    Algorithmic
    optimizations
    To implement functions common in network processing
    e.g. n-tuple lookups, wildcards, ACLs etc.
    Hardware offload
    libraries
    Hardware offloads can complement the software
    implementation when the required hardware capability
    is available. E.g. 5-tuple lookups can be done on most
    modern NICs, and act in conjunction with a software
    classifier implementation
    Bulk functions Most functions support a “bulk” mode – processing ‘n’
    packets simultaneously. Allows for software pipelining
    to overcome memory latency

    View Slide

  17. PERFORMANCE BENCHMARK DISCLOSURES
    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
    Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
    functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to
    assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
    For more complete information visit www.intel.com/benchmarks.
    Results are based on internal testing and are provided to you for informational purposes. Any differences in your system hardware, software
    or configuration may affect your actual performance.
    1: Linux <2 Mpps / core @ 64 Byte based on Intel® Xeon® E5-2658v2 Processor (DP)/Intel® C600 Series Chipset:
    Linux* Layer 3 IPv4 Forwarding using 10GbE. Baseline config: Intel® Xeon® E5-2658v2 processors M0 Stepping, 2.40GHz, 10 cores, 8 GT/s QPI,
    25MB L3 cache, Dual-Processor configuration, Intel® C600 Series Chipset (C0 stepping), Crown Pass Platform (W2600CR), DDR3 1867MHz, 8 x dual
    rank registered ECC 16GB (total 128GB), 4 memory channels per socket Configuration, 1 DIMM per channel, 4 x Intel® X520-DA4 PCI-Express Gen3
    x8 10 Gb Ethernet NIC (40G/card) (PLX rev. ca) Source: http://cat.intel.com/LaunchLink.aspx?LinkID=3023
    2: DPDK < 52 Mpps / core @ 64 Byte based on Intel® Xeon® Processor Platinum 8180(DP): DPDK testpmd. Baseline Config: based on Intel® Xeon®
    Processor Platinum 8180 (38.5 M Cache, 2.50 GHz, 28 core), Dual-Processor configuration, 98304 MBs over 12 channels @ 2666 MHz, 2 x Intel®
    Ethernet Converged Network Adapter XL710-QDA2 (2X40G) or 4 x Intel ® Ethernet Converged Network Adapter 82599ES, Source: DPDK 18.11 Intel
    NIC Performance Report
    3: DPDK < 41 Mpps / core @ 64 Byte based on Intel® Xeon® Processor Platinum 8180(DP): DPDK L3Fwd. Baseline Config: based on Intel® Xeon®
    Platinum 8160 Processor 24C @2.10GHz, Dual-Processor Configuration, Supermicro* platform, Micron* DDR4 2666MHz RDIMMs 6x 16GB(96 GB), 6
    Channels/Socket, 3x Intel® Ethernet Controller XXV710 (4x25G/card). Source: http://cat.intel.com/LaunchLink.aspx?LinkID=4188
    Performance on 2. and 3. is equivalent to Intel® Xeon® Gold 6254 Processor Intel® Ethernet Adapter E810-CQDA2:
    Source: http://cat.intel.com/LaunchLink.aspx?LinkID=4274

    View Slide

  18. This section provides the DPDK Roadmap

    View Slide

  19. 19
    2019 & 2020 Releases
    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
    2019
    19.02 19.05
    19.11
    (LTS)
    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
    2020
    20.02 20.05
    20.11
    (LTS)
    20.08
    Since 16.04, releases use the Ubuntu* numbering scheme of YY.MM.
    * Other names and brands may be claimed as the property of others.
    19.08

    View Slide

  20. 20
    • The purpose of Long-Term Support (LTS) is to maintain a stable release of DPDK with back-ported
    bug fixes over an extended period of time. This will provide downstream consumers with a stable
    target on which to base applications or packages.
    • The LTS releases will be maintained for 2 years.
    • Bug fixes that do not change the ABI will be back-ported.
    • In general, new features will not be back-ported. There may be occasional exceptions where the following
    criteria are met:
    • There is a justifiable use case (for example a new PMD).
    • The change is non-invasive.
    • The work of preparing the back-port is done by the proposer.
    • There is support within the community.
    • Releases 16.11, 17.11, 18.11, 19.11 are LTS releases
    Long-Term Support
    (DPDK ##.11 (LTS) releases

    View Slide

  21. DPDK API Stability
    https://www.dpdk.org/blog/2019/10/10/why-is-abi-stability-important/
    ABI stability will run for one year following the v19.11 release.
    ABI breakage windows are aligned with LTS releases
    The ABI policy will then be reviewed after this initial year, with the intention of
    lengthening the stability period and period between ABI breakages to two years.
    21

    View Slide

  22. 22
    Version 19.11 (2019 November)
    • configurability of Rx offloads
    • rte_flow patterns for NSH, IGMP, AH
    • rte_flow actions for mirroring and multicast
    • Rx metadata in mbuf, with rte_flow API and mlx5 implementation
    • hairpin forwarding offload, with mlx5 implementation
    • VF configuration from host via representor port id
    • Arm N1 platform config
    • Arm optimizations in i40e and ixgbe
    • ice support of DDP, multi-process and flexible descriptor
    • ice rte_flow updates to support RSS, high/low priority flows, DDP profiles
    • ice and iavf avx2 vector path
    • ipn3ke graceful shutdown
    • mlx5 HW support of VLAN id update and push/pop, VF LAG, flow metering
    and EEPROM module
    • virtio packed ring performance optimizations
    • use C11 atomic functions in memif
    • Arm WFE/SEV instructions in spinlock and ring library
    • integrate RCU library with LPM and hash libraries
    • optimized algorithm for resizeable hash table
    • lock-free stack mempool handler
    • lock-free l3fwd algorithms
    • ntb FIFO ring for Rx/Tx
    • eventdev examples in l2fwd-event, l3fwd and ipsec-secgw
    • cryptodev session-less asymmetric crypto
    • Nitrox cryptodev
    • OCTEON TX asymmetric crypto
    • OCTEON TX2 cryptodev
    • OCTEON TX2 inline IPsec using rte_security
    • rte_security support of inline crypto statistics
    • rte_security improved performance for IPsec with software crypto
    • IPsec add security association database
    • ipsec-secgw support of multiple sessions for the same SA
    • QAT stateful decompression
    • regexdev
    • template based ring API
    • sched library configuration more flexible
    • eBPF arm64 JIT
    • KNI IOVA as VA
    • sample application for ioat
    • UBSan in build
    Nice to have - Future
    • multi-process rework
    • automatic UIO/VFIO binding
    • infiniband driver class (ibdev)
    • default configuration from files
    • generic white/blacklisting
    • libedit integration
    DPDK Roadmap
    https://core.dpdk.org/roadmap/

    View Slide

  23. View Slide

  24. OPTIMIZATION NOTICE
    Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel
    microprocessors for optimizations that are not unique to Intel microprocessors. These
    optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does
    not guarantee the availability, functionality, or effectiveness of any optimization on
    microprocessors not manufactured by Intel.
    Microprocessor-dependent optimizations in this product are intended for use with Intel
    microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
    microprocessors. Please refer to the applicable product User and Reference Guides for more
    information regarding the specific instruction sets covered by this notice.
    Notice revision #20110804

    View Slide