• Different hardware between now and then. • Modify the implementation of TCP to improve its performance • DCTCP (Data Center TCP) • MPTCP (Multi Path TCP) • Google BBR (Modify Congestion Control Algorithm) • New Protocol • [] • Re-architecting datacenter networks and stacks for low latency and high performance
• Sends the packet to its receive queue (RX) • System (kernel) needs to know the packet is coming and pass the data to a allocated buffer. • Polling/Interrupt • Allocate skb_buff for packet • Copy the data to user-space • Free the skb_buff
Context switching on blocking I/O • Interrupt handling in kernel • Data copy between user space and kernel space. • Too many unused network stack feature. • Additional overhead for each packets
in server. • We can dedicated some CPU to handle network packets. • Polling mode • Zero-Copy • Copy to the user-space iff the application needs to modify it. • Sendfile(…) • UIO (User Space I/O) • mmap (memory mapping)
user space network stack • For each applications, we need to handle following issues. • Parse packets • Mac/Vlan • IPv4/IPv6 • TCP/UDP/ICMP • For TCP, we need to handle three-way handshake
data • Hundred of MB • Some data is unmodified. • Too many copies operation • User Space <-> User Space • User Space <-> Kernel Space • Kernel Space -> Physical devices
than the system L1/L2/L3 cache • Too many cache miss (increate latency) • A Single Application unlikely can congest the network bandwidth. • Authors says. • 20-30 GBs for data buffer 4KB • 2-4 GBs for data buffer > 4MB • Too many cache miss.
the computational graph/distributed graph partition • Register the memory with RDMA/DMA by the node’s type. • In Tensorflow • Replace the original gRPC format by RDMA