WHO AM I
• Hung-Wei Chiu ()
• [email protected]
• hwchiu.com
• Experience
• Software Engineer at Linker Netowrks
• Software Engineer at Synology (2014~2017)
• Co-Found of SDNDS-TW
• Open Source experience
• SDN related projects (mininet, ONOS, Floodlight, awesome-sdn)
Slide 3
Slide 3 text
WHAT WE DISCUSS TODAY
l The Drawback of Current Network Stack.
l High Performance Network Model
l DPDK
l RDMA
l Case Study
Slide 4
Slide 4 text
DRAWBACK OF CURRENT NETWORK STACK
• Linux Kernel Stack
• TCP Stack
• Packets Processing in Linux Kernel
Slide 5
Slide 5 text
LINUX KERNEL TCP/IP NETWORK STACK
• Have you imaged how applications communicate by network?
Slide 6
Slide 6 text
Linux Linux
www-server
Chrome
Network
PACKET
Slide 7
Slide 7 text
IN YOUR APPLICATION (CHROME).
• Create a Socket
• Connect to Aurora-Server (we use TCP)
• Send/Receives Packets. User-Space
Kernel-Space
´ Copy data from the user-space
´ Handle TCP
´ Handle IPv4
´ Handle Ethernet
´ Handle Physical
´ Handle Driver/NIC
Slide 8
Slide 8 text
Had you wrote a socket programming before ?
Slide 9
Slide 9 text
FOR GO LANGUAGE
Slide 10
Slide 10 text
FOR PYTHON
Slide 11
Slide 11 text
FOR C LANGUAGE
Slide 12
Slide 12 text
Did you image how kernel handle those operations ?
Slide 13
Slide 13 text
HOW ABOUT THE KERNEL ?
SEND MESSAGE
• User Space -> send(data….)
• SYSCALL_DEFINE3(….) ß kernel space.
• vfs_write
• do_sync_write
• sock_aio_write
• do_sock_write
• __sock_sendmsg
• security_socket_sendmsg(…)
HOW ABOUT THE KERNEL ?
RECEIVE MESSAGE
• User Space -> read(data….)
• SYSCALL_DEFINE3(….) ß Kernel Space
• …..
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
WHAT IS THE PROBLEM
• TCP
• Linux Kernel Network Stack
• How Linux process packets.
Slide 19
Slide 19 text
THE PROBLEM OF TCP
• Designed for WAN network environment
• Different hardware between now and then.
• Modify the implementation of TCP to improve its performance
• DCTCP (Data Center TCP)
• MPTCP (Multi Path TCP)
• Google BBR (Modify Congestion Control Algorithm)
• New Protocol
• []
• Re-architecting datacenter networks and stacks for low latency and high performance
Slide 20
Slide 20 text
THE PROBLEM OF LINUX NETWORK STACK
• Increasing network speeds: 10G à 40G à 100G
• Time between packets get smaller
• For 1538 bytes.
• 10 Gbis == 1230.4 ns
• 40 Gbis == 307.6 ns
• 100 Gbits == 123.0 ns
• Refer to
http://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA201
5.pdf
• Network stack challenges at increasing speeds The 100Gbit/s challenge
Slide 21
Slide 21 text
THE PROBLEM OF LINUX NETWORK STACK
• For smallest frame size 84 bytes.
• At 10Gbit/s == 67.2 ns, (14.88 Mpps) (packet per second)
• For 3GHz CPU, 201 CPU cycles for each packet.
• System call overhead
• 75.34 ns (Intel CPU E5-2630 )
• Spinlock + unlock
• 16.1ns
Slide 22
Slide 22 text
THE PROBLEM OF LINUX NETWORK STACK
• A single cache-miss:
• 32 ns
• Atomic operations
• 8.25 ns
• Basic sync mechanisms
• Spin (16ns)
• IRQ (2 ~ 14 ns)
Slide 23
Slide 23 text
SO..
• For smallest frame size 84 bytes.
• At 10Gbit/s == 67.2 ns, (14.88 Mpps) (packet per second)
• 75.34+16.1+32+8.25+14 = 145.69
Slide 24
Slide 24 text
PACKET PROCESSING
• Let we watch the graph again
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
No content
Slide 27
Slide 27 text
PACKET PROCESSING
• When a network card receives a packet.
• Sends the packet to its receive queue (RX)
• System (kernel) needs to know the packet is coming and pass the data to a allocated
buffer.
• Polling/Interrupt
• Allocate skb_buff for packet
• Copy the data to user-space
• Free the skb_buff
Slide 28
Slide 28 text
PACKETS PROCESSING IN LINUX
User Space
Kernel Space
NIC TX/RX Queue
Application
Socket Driver Ring Buffer
Slide 29
Slide 29 text
PROCESSING MODE
• Polling Mode
• Busy Looping
• CPU overloading
• High Network Performance/Throughput
Slide 30
Slide 30 text
PROCESSING MODE
• Interrupt Mode
• Read the packet when receives the interrupt
• Reduce CPU overhead.
• We don’t have too many CPU before.
• Worse network performance than polling mode.
Slide 31
Slide 31 text
MIX MODE
• Polling + Interrupt mode (NAPI) (New API)
• Interrupt first and then polling to fetch packets
• Combine the advantage of both mode.
Slide 32
Slide 32 text
SUMMARY
• Linux Kernel Overhead (System calls, locking, cache)
• Context switching on blocking I/O
• Interrupt handling in kernel
• Data copy between user space and kernel space.
• Too many unused network stack feature.
• Additional overhead for each packets
Slide 33
Slide 33 text
HOW TO SOLVE THE PROBLEM
• Out-of-tree network stack bypass solutions
• Netmap
• PF_RING
• DPDK
• RDMA
Slide 34
Slide 34 text
HOW TO SOLVE THE PROBLEM
• How did those models handle the packet in 62.7ns?
• Batching, preallocation, prefetching,
• Staying cpu/numa local, avoid locking.
• Reduce syscalls,
• Faster cache-optimal data structures
Slide 35
Slide 35 text
HOW TO SOLVE THE PROBLEM
• How did those models handle the packet in 62.7ns?
• Batching, preallocation, prefetching,
• Staying cpu/numa local, avoid locking.
• Reduce syscalls,
• Faster cache-optimal data structures
Slide 36
Slide 36 text
HOW TO SOLVE.
• Now. There’re more and more CPU in server.
• We can dedicated some CPU to handle network packets.
• Polling mode
• Zero-Copy
• Copy to the user-space iff the application needs to modify it.
• Sendfile(…)
• UIO (User Space I/O)
• mmap (memory mapping)
Slide 37
Slide 37 text
HIGH PERFORMANCE NETWORKING
• DPDK (Data Plane Development Kit)
• RDMA (Remote Directly Memory Access)
Slide 38
Slide 38 text
DPDK
• Supported by Intel
• Only the intel NIC support at first.
• Processor affinity / NUMA
• UIO
• Polling Mode
• Batch packet handling
• Kernel Bypass
• …etc
Slide 39
Slide 39 text
PACKETS PROCESSING IN DPDK
User Space
Kernel Space
NIC TX/RX Queue
Application DPDK
UIO (User Space IO)
Driver
Ring Buffer
Slide 40
Slide 40 text
COMPARE
Network Interface Card
Linux Kernel
Network Stack
Network Driver
Application
Network Interface Card
Linux Kernel
Network Stack
Network Driver
Application
Kernel Space
User Space
Slide 41
Slide 41 text
WHAT’S THE PROBLEM.
• Without the Linux Kernel Network Stack
• How do we know what kind of the packets we received.
• Layer2 (MAC/Vlan)
• Layer3 (IPv4, IPv6)
• Layer4 (TCP,UDP,ICMP)
Slide 42
Slide 42 text
USER SPACE NETWORK STACK
• We need to build the user space network stack
• For each applications, we need to handle following issues.
• Parse packets
• Mac/Vlan
• IPv4/IPv6
• TCP/UDP/ICMP
• For TCP, we need to handle three-way handshake
Slide 43
Slide 43 text
FOR ALL EXISTING NETWORK APPLICATIONS
• Rewrite all socket related API to DPDK API
• DIY
• Find some OSS to help you
• dpdk-ans (c )
• mTCP (c )
• yanff (go)
• Those projects provide BSD-like interface for using.
Slide 44
Slide 44 text
SUPPORT DPDK?
• Storage
• Ceph
• Software Switch
• BSS
• FD.IO
• Open vSwitch
• ..etc
Slide 45
Slide 45 text
A USE CASE
• Software switch
• Application
• Combine both of above (Run Application as VM or Container)
Slide 46
Slide 46 text
Kernel
User
Open vSwitch(DPDK)
NIC(DPDK) NIC(DPDK)
Kernel
User
My Application
NIC(DPDK)
Slide 47
Slide 47 text
Kernel
User
Open vSwitch(DPDK)
NIC(DPDK) NIC(DPDK)
Container 1 Container 2
How container
connect to the
OpenvSwitch?
Slide 48
Slide 48 text
PROBLEMS OF CONNECTION
• Use VETH
• Kernel space again.
• Performance downgrade
• Virtio_user
Slide 49
Slide 49 text
RDMA
• Remote Direct Memory Access
• Original from DMA (Direct Memory Access)
• Access memory without interrupting CPU.
Slide 50
Slide 50 text
ADVANTAGES
• Zero-Copy
• Kernel bypass
• No CPU involvement
• Message based transactions
• Scatter/Gather entries support.
Slide 51
Slide 51 text
WHAT IT PROVIDES
• Low CPU usage
• High throughput
• Low-latency
• You can’t have those features in the same time.
• Refer to :Tips and tricks to optimize your RDMA code
Slide 52
Slide 52 text
No content
Slide 53
Slide 53 text
No content
Slide 54
Slide 54 text
No content
Slide 55
Slide 55 text
No content
Slide 56
Slide 56 text
SUPPORT RDMA
• Storage
• Ceph
• DRBD (Distributed Replicated Block Device)
• Tensorflow
• Case Study - Towards Zero Copy Dataflows using RDMA
Slide 57
Slide 57 text
CASE STUDY
• Towards Zero Copy Dataflows using RDMA
• 2017 SICCOM Poster
• Introduction
• What problem?
• How to solve ?
• How to implement ?
• Evaluation
Slide 58
Slide 58 text
INTRODUCTION
• Based on Tensorflow
• Distributed
• Based on RDMA
• Zero Copy
• Copy problem
• Contribute to Tensorflow (merged)
Slide 59
Slide 59 text
WHAT PROBLEMS
• Dataflow
• Directed Acyclic Graph
• Large data
• Hundred of MB
• Some data is unmodified.
• Too many copies operation
• User Space <-> User Space
• User Space <-> Kernel Space
• Kernel Space -> Physical devices
Slide 60
Slide 60 text
WHY DATA COPY IS BOTTLENECK
• Data buffer is bigger than the system L1/L2/L3 cache
• Too many cache miss (increate latency)
• A Single Application unlikely can congest the network bandwidth.
• Authors says.
• 20-30 GBs for data buffer 4KB
• 2-4 GBs for data buffer > 4MB
• Too many cache miss.
Slide 61
Slide 61 text
HOW TO SOLVE
• Too many data copies operations.
• Same device.
• Use DMA to pass data.
• Different device
• Use RDMA
• In order to read/write the remote GPU
• GPUDirect RDMA (published by Nvidia)
Slide 62
Slide 62 text
No content
Slide 63
Slide 63 text
HOW TO IMPLEMENT
• Implement a memory allocator
• Parse the computational graph/distributed graph partition
• Register the memory with RDMA/DMA by the node’s type.
• In Tensorflow
• Replace the original gRPC format by RDMA
Slide 64
Slide 64 text
EVALUATION (TARGET)
• Tensorflow v1.2
• Basd on gRPC
• RDMA zero copy Tensorflow
• Yahoo open RDMA Tensorflow (still some copy operat Software ions)
Slide 65
Slide 65 text
EVALUATION (RESULT)
• RDMA (zero copy) v.s gRPC
• 2.43x
• RDMA (zero copy) v.sYahoo version
• 1.21x
• Number of GPU, 16 v.s 1
• 13.8x
Slide 66
Slide 66 text
Q&A?
Slide 67
Slide 67 text
EVALUATION (HARDWARE)
• Server * 4
• DUal6-core Intel Xeon E5-2603v4 CPU
• 4 Nvidia Tesla K40m GPUs
• 256 GB DDR4-2400MHz
• Mellanox MT27500 40GbE NIC
• Switch
• 40Gbe RoCE Switch
• Priority Flow Control
Slide 68
Slide 68 text
EVALUATION (SOFTWARE)
• VGG16 CNN Model
• Model parameter size is 528 MB
• Synchronous
• Number of PS == Number of Workers
• Workers
• Use CPU+GPU
• Parameter Server
• Only CPU