Improved System Call Batching for Network I/O

Improving syscall batching for network IO - Rahul Jadhav, Zhen
Cao, Anmol Sarma Huawei, 2012 Labs NetdevConf, Prague, March 2019

Why? • Gaming Scenario • Multiple small-sized packets across multiple
sockets • Both TCP and UDP sockets • Packet Rate characteristics • Less throughput (and packet-rate) per connection • But thousands of such connection • Reverse proxies • Handling thousands of connections in parallel (NGINX) • Multipath Transports • Send different/same packets on multiple sockets

Recv progression • recv() – user-supplied single buffer • recvmsg()
– enables SG (scatter-gather) operation • recvmmsg() – enables SG on multiple packets • And now recvmmmsg() – enables SG on multiple packets on multiple sockets Msgvec based SG Multiple packets SG .. across Multiple sockets

Comparison APPLICATION s1 s2 s3 buf1 buf2 Kernel Space User
Space recvmsg(s1, packet1) Sockets buf1 buf2 buf1 buf2 buf1 buf2 packet3 packet4 packet5 buf1 buf2 packet1 packet2 recvmsg(s2, packet2) recvmsg(s2, packet3) recvmsg(s3, packet4) recvmsg(s3, packet5) Packets APPLICATION s1 s2 s3 buf1 buf2 Kernel Space User Space recvmmsg(s1, packet1) Sockets buf1 buf2 buf1 buf2 buf1 buf2 packet3 packet4 packet5 buf1 buf2 packet1 packet2 recvmmsg(s2, packet[2,3]) recvmmsg(s3, packet[4,5]) Packets 5 packets, 5 system calls 5 packets, 3 system calls APPLICATION s1 s2 s3 buf1 buf2 Kernel Space User Space recvmmmsg(s[1,2,3], packet[1,2,3,4,5]) Sockets buf1 buf2 buf1 buf2 buf1 buf2 packet3 packet4 packet5 buf1 buf2 packet1 packet2 Packets 5 packets, 1 system call

recvmmmsg() … user-space changes • Group together all the sockets
on which POLLIN is received • Pull all the messages from all the sockets in the same syscall • In kernel, recvmmmsg() internally wraps recvmmsg /**** recvmmmsg() *****/ #define MAX_FD_ARR 20 #define MAX_EPOLL_EVENTS64 #define MAX_VLEN 100 void epoll_thread() { struct mmep eps[MAX_FD_ARR]; int cnt; struct events epoll_ev[MAX_EPOLL_EVENTS]; struct mmsghdr msgs[VLEN]; n = epoll_wait(epollfd, epoll_ev, MAX_EPOLL_EVENTS, -1); cnt = 0; for(int i=0; i < n; i++) { if(epoll_ev.events & EPOLLIN) { eps[cnt++].fd = epoll_ev.fd; } } if(cnt > 0) { msgcnt = recvmmmsg(eps, cnt, msgs, VLEN, 0, NULL); // Handle returned messages } }

Implementation • Handling multiple sockets in same call • Output/Errnos
per socket • Need socket->messages mapping • Use of message offset • *mmmsg() is always non-blocking • MSG_DONTWAIT is implicitly added to flags int recvmmmsg(struct mmep *epvec, int eplen, struct mmsghdr *msgvec, int vlen, int flags); int sendmmmsg(struct mmep *epvec, int eplen, struct mmsghdr *msgvec, int vlen, int flags); struct mmep { int sockfd; /* socket file descriptor */ int num; /* write num or return value */ int offset; /* starting msgvec offset */ }; 0 1 2 3 4 Sock=7, num=2, offset=3 Sock=8, num=3, offset=0 Sock=9, num=4, offset=5 5 6 7 8 Struct mmep [] Message vector

sendmmmsg() • Implementation similar to recvmmmsg() • can send same/different
messages from different sockets • Why? • Consider gaming scenario using mpudp with redundant scheduler 0 1 2 3 4 5 6 7 8 9 10 Socket = 1 Num = 3 Offset = 0 Socket = 2 Num = 3 Offset = 0 Socket = 4 Num = 4 Offset = 7 msgvec mmep Socket = 3 Num = 4 Offset = 3

Word about syscalls overhead • Direct vs Indirect cost •
Direct cost • Mode switching cost (CPU Protection Ring 3 to Ring 0 and vice-versa) • = Time(Exec syscall in userspace + resume exec in kernel space + return control to user-space) • Indirect cost • Processor State pollution • Caused because of L1/L2 cache updates, TLB updates during switches REF: FlexSC: Flexible System Call Scheduling with Exception-Less System Calls, Soares et al, Usenix

Numbers: Direct Overhead • Server with 512 UDP ports •
Each port 10K UDP datagrams with an inter packet time of 1ms • Batching: N = epoll_wait(…); Batching Syscall # of syscalls Pkts rcvd Cycles N= ~8 recvmsg() 5120000 5120000 25238216185 recvmmsg() 5120000 26109759149 recvmmmsg() 646188 23398099931 (-7.3%) N=~30 recvmsg() 5120000 5120000 13583450955 recvmmsg() 5119211 14913732738 recvmmmsg() 156289 11593181642 (-14%) Note that recvmmsg() was not able to get any batching improvement at this rate.

Perf profile… With recvmmsg() With recvmmmsg()

Are there any existing alternatives to achieve similar results? •
IOCB_CMD_POLL – A new kernel polling interface • Based on Linux AIO mechanism • io_setup() -> setup aio_context • io_submit() -> submit vector of I/O control blocks ‘struct iocb’ • io_getevents() -> Completion notification. Block on a vec of ‘struct io_event’ • Impact application code substantially • Interfaces difficult to use • But can group read/write syscall together Ref: 1. https://lwn.net/Articles/743714/ 2. https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/ struct iocb cb[2] = {{.aio_fildes = sd2, .aio_lio_opcode = IOCB_CMD_PWRITE, .aio_buf = (uint64_t)&buf[0], .aio_nbytes = 0}, {.aio_fildes = sd1, .aio_lio_opcode = IOCB_CMD_PREAD, .aio_buf = (uint64_t)&buf[0], .aio_nbytes = BUF_SZ}}; struct iocb *list_of_iocb[2] = {&cb[0], &cb[1]}; while(1) { r = io_submit(ctx, 2, list_of_iocb); struct io_event events[2] = {}; r = io_getevents(ctx, 2, 2, events, NULL); cb[0].aio_nbytes = events[1].res; }

SO_RCVLOWAT • Receive Low Watermark • Threshold used for buffering
packets in kernel space • POLLIN generated only when threshold crosses • Good for applications not sensitive to latency • Cloud synchronization • Background transfers • Applies to video client apps as well • Apply SO_RCVLOWAT when video playout buffer is relatively full WiFi KernelSpace Userspace App 1 epoll() recv() TCP rmem(max=6MB) Context Switching Let packets buffer in the kernel space and then we pull in multiple packets together.

Problem we faced • SO_RCVLOWAT blocks until threshold • Cannot
set LOWAT aggressively • Problem of SO_RCVLOWAT with epoll_wait based timeout • Epoll_wait() timeout operates across all sockets • SO_RCVTIMEO option does not work with epoll_wait

SO_RCVLOWAT_TIMEO • Added timeout parameter with LOWAT • If the
timeout happens and if there is any pending data, POLLIN is generated • Subsequent recv() will fetch the buffer whatever is available • If threshold is crossed before timeout, then behavior is same as SO_RCVLOWAT • Advantage • Can afford to have more aggressive watermarks

Concluding remarks • Amortizing per packet cost by syscall batching
• *mmmsg(): Easy to use interfaces which gels with existing syscalls • Reduces system overhead for userspace multipath transports • SO_RCVLOWAT_TIMEO • Allows more aggressive watermark settings

Improved System Call Batching for Network I/O

Improved System Call Batching for Network I/O

Anmol Sarma

More Decks by Anmol Sarma

Other Decks in Programming

Featured

Transcript

Improving syscall batching for network IO - Rahul Jadhav, Zhen

Why? • Gaming Scenario • Multiple small-sized packets across multiple

Recv progression • recv() – user-supplied single buffer • recvmsg()

Comparison APPLICATION s1 s2 s3 buf1 buf2 Kernel Space User

recvmmmsg() … user-space changes • Group together all the sockets

Implementation • Handling multiple sockets in same call • Output/Errnos

sendmmmsg() • Implementation similar to recvmmmsg() • can send same/different

Word about syscalls overhead • Direct vs Indirect cost •

Numbers: Direct Overhead • Server with 512 UDP ports •

Perf profile… With recvmmsg() With recvmmmsg()

Are there any existing alternatives to achieve similar results? •

SO_RCVLOWAT • Receive Low Watermark • Threshold used for buffering

Problem we faced • SO_RCVLOWAT blocks until threshold • Cannot

SO_RCVLOWAT_TIMEO • Added timeout parameter with LOWAT • If the

Concluding remarks • Amortizing per packet cost by syscall batching