Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On the way to io_uring networking

On the way to io_uring networking

io_uring has set a remarkably high bar for storage performance. Consequently, attention has naturally turned towards networking as the next frontier. While io_uring provided basic primitives, like send and recv, from early days, their practical application to real-life scenarios proved somewhat insufficient to
compete with traditional networking approaches. In this talk, I will be walking through the problems we had and the changes we’ve made, elaborate on the rationale behind them, and finally discuss how userspace can best be designed to take full advantage of io_uring.

Pavel Begunkov

Kernel Recipes

September 30, 2023
Tweet

More Decks by Kernel Recipes

Other Decks in Programming

Transcript

  1. io_uring meets network
    Kernel Recipes 2023
    Pavel Begunkov

    View full-size slide

  2. 2
    struct msghdr msg = { … };
    msg_flags = MSG_WAITALL;
    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_sendmsg(sqe, sockfd,
    &msg, msg_flags);
    sqe->user_data = tag;
    io_uring_submit(ring);
    ● IORING_OP_SENDMSG
    ● IORING_OP_RECVMSG
    ret = io_uring_wait_cqe(&ring, &cqe);
    assert(cqe->user_data == tag);
    result = cqe->res;
    submission completion / waiting

    View full-size slide

  3. Early days execution
    3
    Submit
    request
    Execute
    MSG_DONTWAIT
    Complete
    Success?
    Worker pool

    View full-size slide

  4. 4
    IORING_OP_POLL_ADD
    ● Asynchronous, as it should be
    ● Polling a single file
    ● Terminates after the first desired event
    ○ User has to send another request to continue polling
    ● Can be cancelled by IORING_OP_POLL_REMOVE
    or IORING_OP_ASYNC_CANCEL
    Polling

    View full-size slide

  5. ● What if we combine IO with polling?
    ● Kernel internally polls when MSG_DONTWAIT failed
    ● Transparent, uapi stays the same
    ● Check support with IORING_FEAT_FAST_POLL
    5
    Submit
    request
    Execute
    nowait
    Complete
    Success?
    Poll
    Poll event

    View full-size slide

  6. 6
    Tip 1: use IORING_RECVSEND_POLL_FIRST with receive requests
    ● Starts with polling, skips the first nowait attempt
    ● Useful when it’s likely have to wait
    ● Usually not useful for sends
    Submit
    Execute
    nowait
    Complete
    Poll
    Poll event
    Failed?

    View full-size slide

  7. 7
    Tip 2: io_uring supports MSG_WAITALL, retries short IO
    ● Works with recv as well as sends
    ● Ignored by io_uring unless it’s a streaming socket like TCP
    do {
    left = total_len - done;
    ret = do_io(buf + done, left);
    done += ret;
    // poll_wait();
    } while (done < total_len && (msg_flags & MSG_WAITALL))

    View full-size slide

  8. 8
    ● Each recv takes and holds a buffer
    ● Buffers can’t be reused before recv completes
    ● Many (slow) connections may lock up too much memory
    Memory consumption
    Submit
    recv
    Execute
    nowait
    Complete
    Poll
    Poll event
    Failed?
    Can’t reuse the buffer

    View full-size slide

  9. 9
    Provided buffers
    Let’s the kernel have a buffer pool!
    Submit
    addr=NULL
    Execute
    nowait
    Complete
    Poll
    Poll event
    Buffer pool
    Get buffer
    Failed?
    Put buffer

    View full-size slide

  10. ● In-kernel buffer pool
    ○ User can register multiple pools
    ○ Each pool has an ID to refer to
    ○ Usually, buffers in a pool are same sized
    ● Don’t set buffer at submission, e.g. sqe->addr = NULL;
    ○ sqe->flags |= IOSQE_BUFFER_SELECT
    ○ And specify the buffer pool ID to use
    ● Request grabs a buffer on demand
    ○ Requests don’t hold a buffer while polling
    ○ It’ll grab it right before attempting to execute
    ● The buffer ID will be returned in cqe->flags
    ● The user should keep refilling the pool
    10
    Provided buffers: overview

    View full-size slide

  11. 11
    ● V1: IORING_OP_PROVIDE_BUFFERS
    ○ Buffers are returned by sending a special request
    ○ Slow and inefficient
    ● V2: IORING_REGISTER_PBUF_RING
    ○ Another kernel-user shared ring
    ○ User returns buffers by putting them in the ring
    ○ Nicely wrapped in liburing
    Provided buffers: returning buffers

    View full-size slide

  12. 12
    Provided buffers v2

    View full-size slide

  13. 13
    Back to polling
    Submit Complete
    Poll
    poll event
    Why poll requests terminate after the first event?

    View full-size slide

  14. 14
    Multishot poll
    Submit Poll
    poll event
    Post CQE

    View full-size slide

  15. 15
    Multishot accept
    Submit do accept
    Poll
    poll event
    Post CQE

    View full-size slide

  16. 16
    Multishot recv
    Submit do recv
    Poll
    poll event
    Buffer pool
    Get buffer
    Post CQE

    View full-size slide

  17. 17
    Notes on multishot…
    ● Requests can be cancelled via IORING_OP_ASYNC_CANCEL
    ○ Or by shutting down the socket
    ● Requests can fail…
    ○ Resend if recoverable: out of buffers, CQ is full, -ENOMEM, etc.
    ● Completion Queue is finite
    ○ io_uring will save overflow CQEs, but it’s slow
    ■ User has to enter the kernel to flush overflown CQE
    ○ Multishot requests will be terminated
    ● Linked requests don’t work well with multishots

    View full-size slide

  18. Fixed files
    18
    IOSQE_FIXED_FILE optimises per request file refcounting
    ● Makes much sense with send requests
    ● But not recommended with potentially time unbound requests
    ○ May cause problems
    ● Doesn’t benefit multishots, cost is already amortised

    View full-size slide

  19. Connection management
    19
    IORING_OP_CLOSE - closes a file descriptor.
    ● Interoperable with close(2) for regular (non-IOSQE_FIXED_FILE) files
    Close doesn’t kill a connection with in-flight requests
    ● Either cancel requests
    ● Or IORING_OP_SHUTDOWN / shutdown(2) it first
    There are IORING_OP_ACCEPT, IORING_OP_CONNECT and IORING_OP_SOCKET

    View full-size slide

  20. Zerocopy
    20
    Zerocopy send
    ● IORING_OP_SEND_ZC: 2 CQEs, “queued” and “completed”
    ● Need to add vectored IO support
    Zerocopy receive
    ● RFC is out, look for updates
    ● Multishot recv applications are already half prepared
    ● https://lore.kernel.org/io-uring/[email protected]/

    View full-size slide

  21. 21
    Task execution
    Submit
    request
    Execute
    nowait
    Complete
    Success?
    Poll
    Task
    IRQ
    Notify the task

    View full-size slide

  22. 24
    ● Poll event arrives in an IRQ* context
    ● We wake up the submitter task to execute the IO
    ● task_work similar to signals but in-kernel
    ○ Wakes the task if sleeping
    ○ Interrupts any syscall
    ○ Forces userspace into the kernel
    ● Hot path is generally executed by the submitter task
    Task work overview

    View full-size slide

  23. 25
    IORING_SETUP_COOP_TASKRUN

    View full-size slide

  24. 26
    IORING_SETUP_COOP_TASKRUN
    ● Doesn’t interrupt running userspace
    ● Still aborts running syscalls
    ● Will be executed with the next syscall
    ○ Hence the app has to eventually make a syscall
    ● The user should not busy poll CQ
    ○ It’s almost never a good idea regardless

    View full-size slide

  25. 27
    IORING_SETUP_DEFER_TASKRUN

    View full-size slide

  26. IORING_SETUP_DEFER_TASKRUN
    ● Executed only in io_uring_enter(2) syscall
    ● User has to enter the kernel to wait for events
    ● Requires IORING_SETUP_SINGLE_ISSUER
    28

    View full-size slide

  27. Performance
    29
    Performance highly depends on batching
    ● submission batching
    ● as well as completion batching
    Be prepared for tradeoffs
    ● Wait for longer until there is more to submit
    ● Wait for multiple completions, possibly with a timeout
    ● Throughput vs latency

    View full-size slide

  28. Gluing together
    30
    ● One io_uring instance per process
    ○ No need to share, no synchronisation around queues
    ○ Add IORING_SETUP_SINGLE_ISSUER and IORING_SETUP_DEFER_TASKRUN
    ● Processes communicate via IORING_OP_MSG_RING
    ● Each process serves multiple sockets
    ○ The more sockets per process the better, improves batching
    ● Simple IORING_OP_SEND[MSG] requests are usually fine
    ○ Often complete by the time the submission syscall returns
    ● One recv request for each socket
    ○ Needs a provided buffer pool

    View full-size slide

  29. ● CQ waiting with a timeout, see io_uring_wait_cqe_timeout(), etc.
    ● IORING_OP_TIMEOUT - timeout request, supports multishot
    ● IORING_OP_LINK_TIMEOUT - per request timeout
    ○ There is a cost, app might want to implement it in userspace via
    IORING_OP_TIMEOUT + IORING_OP_ASYNC_CANCEL
    31
    Timeouts

    View full-size slide

  30. 32
    ● Liburing - io_uring userspace library
    github.com/axboe/liburing/
    git://git.kernel.dk/liburing.git
    ● Write up about networking
    https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023
    ● Benchmarking
    https://github.com/dylanZA/netbench
    ● io_uring mailing list
    [email protected]
    ● Zerocopy receive
    https://lore.kernel.org/io-uring/[email protected]/
    ● Folly library: supports io_uring with all modern features
    https://github.com/facebook/folly.git
    References

    View full-size slide