High Performance Networking

Slide 1

Slide 1 text

HIGH PERFORMANCE NETWORK HUNG WEI CHIU

Slide 2

Slide 2 text

WHO AM I • Hung-Wei Chiu () • [email protected] • hwchiu.com • Experience • Software Engineer at Linker Netowrks • Software Engineer at Synology (2014~2017) • Co-Found of SDNDS-TW • Open Source experience • SDN related projects (mininet, ONOS, Floodlight, awesome-sdn)

Slide 3

Slide 3 text

WHAT WE DISCUSS TODAY l The Drawback of Current Network Stack. l High Performance Network Model l DPDK l RDMA l Case Study

Slide 4

Slide 4 text

DRAWBACK OF CURRENT NETWORK STACK • Linux Kernel Stack • TCP Stack • Packets Processing in Linux Kernel

Slide 5

Slide 5 text

LINUX KERNEL TCP/IP NETWORK STACK • Have you imaged how applications communicate by network?

Slide 6

Slide 6 text

Linux Linux www-server Chrome Network PACKET

Slide 7

Slide 7 text

IN YOUR APPLICATION (CHROME). • Create a Socket • Connect to Aurora-Server (we use TCP) • Send/Receives Packets. User-Space Kernel-Space ´ Copy data from the user-space ´ Handle TCP ´ Handle IPv4 ´ Handle Ethernet ´ Handle Physical ´ Handle Driver/NIC

Slide 8

Slide 8 text

Had you wrote a socket programming before ?

Slide 9

Slide 9 text

FOR GO LANGUAGE

Slide 10

Slide 10 text

FOR PYTHON

Slide 11

Slide 11 text

FOR C LANGUAGE

Slide 12

Slide 12 text

Did you image how kernel handle those operations ?

Slide 13

Slide 13 text

HOW ABOUT THE KERNEL ? SEND MESSAGE • User Space -> send(data….) • SYSCALL_DEFINE3(….) ß kernel space. • vfs_write • do_sync_write • sock_aio_write • do_sock_write • __sock_sendmsg • security_socket_sendmsg(…)

Slide 14

Slide 14 text

• inet_sendmsg • tcp_sendmsg à finally TCP … • __tcp_push_pending_frames • Tcp_push_one • tcp_write_xmit • tcp_transmit_skb • ip_queue_xmit ---> finally IP • ip_route_output_ports • ip_route_output_flow -> routing • xfrm_lookup -> routing • Ip_local_out • dst_output • ip_output • …...

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

HOW ABOUT THE KERNEL ? RECEIVE MESSAGE • User Space -> read(data….) • SYSCALL_DEFINE3(….) ß Kernel Space • …..

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

WHAT IS THE PROBLEM • TCP • Linux Kernel Network Stack • How Linux process packets.

Slide 19

Slide 19 text

THE PROBLEM OF TCP • Designed for WAN network environment • Different hardware between now and then. • Modify the implementation of TCP to improve its performance • DCTCP (Data Center TCP) • MPTCP (Multi Path TCP) • Google BBR (Modify Congestion Control Algorithm) • New Protocol • [] • Re-architecting datacenter networks and stacks for low latency and high performance

Slide 20

Slide 20 text

THE PROBLEM OF LINUX NETWORK STACK • Increasing network speeds: 10G à 40G à 100G • Time between packets get smaller • For 1538 bytes. • 10 Gbis == 1230.4 ns • 40 Gbis == 307.6 ns • 100 Gbits == 123.0 ns • Refer to http://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA201 5.pdf • Network stack challenges at increasing speeds The 100Gbit/s challenge

Slide 21

Slide 21 text

THE PROBLEM OF LINUX NETWORK STACK • For smallest frame size 84 bytes. • At 10Gbit/s == 67.2 ns, (14.88 Mpps) (packet per second) • For 3GHz CPU, 201 CPU cycles for each packet. • System call overhead • 75.34 ns (Intel CPU E5-2630 ) • Spinlock + unlock • 16.1ns

Slide 22

Slide 22 text

THE PROBLEM OF LINUX NETWORK STACK • A single cache-miss: • 32 ns • Atomic operations • 8.25 ns • Basic sync mechanisms • Spin (16ns) • IRQ (2 ~ 14 ns)

Slide 23

Slide 23 text

SO.. • For smallest frame size 84 bytes. • At 10Gbit/s == 67.2 ns, (14.88 Mpps) (packet per second) • 75.34+16.1+32+8.25+14 = 145.69

Slide 24

Slide 24 text

PACKET PROCESSING • Let we watch the graph again

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

PACKET PROCESSING • When a network card receives a packet. • Sends the packet to its receive queue (RX) • System (kernel) needs to know the packet is coming and pass the data to a allocated buffer. • Polling/Interrupt • Allocate skb_buff for packet • Copy the data to user-space • Free the skb_buff

Slide 28

Slide 28 text

PACKETS PROCESSING IN LINUX User Space Kernel Space NIC TX/RX Queue Application Socket Driver Ring Buffer

Slide 29

Slide 29 text

PROCESSING MODE • Polling Mode • Busy Looping • CPU overloading • High Network Performance/Throughput

Slide 30

Slide 30 text

PROCESSING MODE • Interrupt Mode • Read the packet when receives the interrupt • Reduce CPU overhead. • We don’t have too many CPU before. • Worse network performance than polling mode.

Slide 31

Slide 31 text

MIX MODE • Polling + Interrupt mode (NAPI) (New API) • Interrupt first and then polling to fetch packets • Combine the advantage of both mode.

Slide 32

Slide 32 text

SUMMARY • Linux Kernel Overhead (System calls, locking, cache) • Context switching on blocking I/O • Interrupt handling in kernel • Data copy between user space and kernel space. • Too many unused network stack feature. • Additional overhead for each packets

Slide 33

Slide 33 text

HOW TO SOLVE THE PROBLEM • Out-of-tree network stack bypass solutions • Netmap • PF_RING • DPDK • RDMA

Slide 34

Slide 34 text

HOW TO SOLVE THE PROBLEM • How did those models handle the packet in 62.7ns? • Batching, preallocation, prefetching, • Staying cpu/numa local, avoid locking. • Reduce syscalls, • Faster cache-optimal data structures

Slide 35

Slide 35 text

Slide 36

Slide 36 text

HOW TO SOLVE. • Now. There’re more and more CPU in server. • We can dedicated some CPU to handle network packets. • Polling mode • Zero-Copy • Copy to the user-space iff the application needs to modify it. • Sendfile(…) • UIO (User Space I/O) • mmap (memory mapping)

Slide 37

Slide 37 text

HIGH PERFORMANCE NETWORKING • DPDK (Data Plane Development Kit) • RDMA (Remote Directly Memory Access)

Slide 38

Slide 38 text

DPDK • Supported by Intel • Only the intel NIC support at first. • Processor affinity / NUMA • UIO • Polling Mode • Batch packet handling • Kernel Bypass • …etc

Slide 39

Slide 39 text

PACKETS PROCESSING IN DPDK User Space Kernel Space NIC TX/RX Queue Application DPDK UIO (User Space IO) Driver Ring Buffer

Slide 40

Slide 40 text

COMPARE Network Interface Card Linux Kernel Network Stack Network Driver Application Network Interface Card Linux Kernel Network Stack Network Driver Application Kernel Space User Space

Slide 41

Slide 41 text

WHAT’S THE PROBLEM. • Without the Linux Kernel Network Stack • How do we know what kind of the packets we received. • Layer2 (MAC/Vlan) • Layer3 (IPv4, IPv6) • Layer4 (TCP,UDP,ICMP)

Slide 42

Slide 42 text

USER SPACE NETWORK STACK • We need to build the user space network stack • For each applications, we need to handle following issues. • Parse packets • Mac/Vlan • IPv4/IPv6 • TCP/UDP/ICMP • For TCP, we need to handle three-way handshake

Slide 43

Slide 43 text

FOR ALL EXISTING NETWORK APPLICATIONS • Rewrite all socket related API to DPDK API • DIY • Find some OSS to help you • dpdk-ans (c ) • mTCP (c ) • yanff (go) • Those projects provide BSD-like interface for using.

Slide 44

Slide 44 text

SUPPORT DPDK? • Storage • Ceph • Software Switch • BSS • FD.IO • Open vSwitch • ..etc

Slide 45

Slide 45 text

A USE CASE • Software switch • Application • Combine both of above (Run Application as VM or Container)

Slide 46

Slide 46 text

Kernel User Open vSwitch(DPDK) NIC(DPDK) NIC(DPDK) Kernel User My Application NIC(DPDK)

Slide 47

Slide 47 text

Kernel User Open vSwitch(DPDK) NIC(DPDK) NIC(DPDK) Container 1 Container 2 How container connect to the OpenvSwitch?

Slide 48

Slide 48 text

PROBLEMS OF CONNECTION • Use VETH • Kernel space again. • Performance downgrade • Virtio_user

Slide 49

Slide 49 text

RDMA • Remote Direct Memory Access • Original from DMA (Direct Memory Access) • Access memory without interrupting CPU.

Slide 50

Slide 50 text

ADVANTAGES • Zero-Copy • Kernel bypass • No CPU involvement • Message based transactions • Scatter/Gather entries support.

Slide 51

Slide 51 text

WHAT IT PROVIDES • Low CPU usage • High throughput • Low-latency • You can’t have those features in the same time. • Refer to :Tips and tricks to optimize your RDMA code

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

SUPPORT RDMA • Storage • Ceph • DRBD (Distributed Replicated Block Device) • Tensorflow • Case Study - Towards Zero Copy Dataflows using RDMA

Slide 57

Slide 57 text

CASE STUDY • Towards Zero Copy Dataflows using RDMA • 2017 SICCOM Poster • Introduction • What problem? • How to solve ? • How to implement ? • Evaluation

Slide 58

Slide 58 text

INTRODUCTION • Based on Tensorflow • Distributed • Based on RDMA • Zero Copy • Copy problem • Contribute to Tensorflow (merged)

Slide 59

Slide 59 text

WHAT PROBLEMS • Dataflow • Directed Acyclic Graph • Large data • Hundred of MB • Some data is unmodified. • Too many copies operation • User Space <-> User Space • User Space <-> Kernel Space • Kernel Space -> Physical devices

Slide 60

Slide 60 text

WHY DATA COPY IS BOTTLENECK • Data buffer is bigger than the system L1/L2/L3 cache • Too many cache miss (increate latency) • A Single Application unlikely can congest the network bandwidth. • Authors says. • 20-30 GBs for data buffer 4KB • 2-4 GBs for data buffer > 4MB • Too many cache miss.

Slide 61

Slide 61 text

HOW TO SOLVE • Too many data copies operations. • Same device. • Use DMA to pass data. • Different device • Use RDMA • In order to read/write the remote GPU • GPUDirect RDMA (published by Nvidia)

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

HOW TO IMPLEMENT • Implement a memory allocator • Parse the computational graph/distributed graph partition • Register the memory with RDMA/DMA by the node’s type. • In Tensorflow • Replace the original gRPC format by RDMA

Slide 64

Slide 64 text

EVALUATION (TARGET) • Tensorflow v1.2 • Basd on gRPC • RDMA zero copy Tensorflow • Yahoo open RDMA Tensorflow (still some copy operat Software ions)

Slide 65

Slide 65 text

EVALUATION (RESULT) • RDMA (zero copy) v.s gRPC • 2.43x • RDMA (zero copy) v.sYahoo version • 1.21x • Number of GPU, 16 v.s 1 • 13.8x

Slide 66

Slide 66 text

Q&A?

Slide 67

Slide 67 text

EVALUATION (HARDWARE) • Server * 4 • DUal6-core Intel Xeon E5-2603v4 CPU • 4 Nvidia Tesla K40m GPUs • 256 GB DDR4-2400MHz • Mellanox MT27500 40GbE NIC • Switch • 40Gbe RoCE Switch • Priority Flow Control

Slide 68

Slide 68 text

EVALUATION (SOFTWARE) • VGG16 CNN Model • Model parameter size is 528 MB • Synchronous • Number of PS == Number of Workers • Workers • Use CPU+GPU • Parameter Server • Only CPU