Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[HiPINEB 2016] Exploring Low-latency Interconnect for Scaling-Out Software Routers

Joongi Kim
March 12, 2016

[HiPINEB 2016] Exploring Low-latency Interconnect for Scaling-Out Software Routers

About building an optimized interconnection scheme for multi-node software routers using RDMA-capable Ethernet NICs and our own hardware-assisted batching technique

Joongi Kim

March 12, 2016
Tweet

More Decks by Joongi Kim

Other Decks in Research

Transcript

  1. HiPINEB 2016
    Exploring Low-latency Interconnect
    for Scaling Out Software Routers
    Sangwook Ma, Joongi Kim, Sue Moon
    School of Computing, KAIST
    2016. 3. 12.

    View Slide

  2. Motivations for Software Router
    • Packet processing on commodity x86 servers
    • Low-cost alternatives for HW routers
    • Programmability & cost-effectiveness
    • Lower performance than HW routers
    • Commercial products: Vyatta (Brocade), 1000V (Cisco)
    • Well-known principles for high performance
    • Batching, pipelining, and parallelization
    2
    +
    -

    View Slide

  3. Scaling Out Software Routers
    • Why scale out? Because single boxes have:
    • Limited port density as a router
    (e.g., 40 PCIe lanes for a latest single Xeon CPU)
    • RouteBricks [SOSP09], ScaleBricks [SIGCOMM15], E2 [SOSP15]
    • Interconnect medium: Ethernet
    • Topology: full mesh / central switch
    3
    RouteBricks ScaleBricks & E2

    View Slide

  4. Our Problem: Latency Penalty
    • Two factors of latency increase
    • Multiple hops among router nodes
    • Aggressive I/O and computation batching
    • Our goal: allow maximum flexibility in future router
    cluster designs by minimizing interconnect overheads.
    4

    View Slide

  5. Our Solutions
    • RDMA (Remote DMA) as the interconnect medium
    • Needs to keep throughput high
    • Hardware-assisted I/O batching
    • Offers lower latency compared to software-side batching
    • Enhances RDMA throughput with small packets
    5

    View Slide

  6. RDMA as Interconnect
    • It is an unexplored design choice for routers.
    • All external connections need to be compatible: Ethernet.
    • Scaling out opens a new design space: interconnect.
    • RDMA provides low latency and high throughput.
    • It reduces the burden of host CPUs, by offloading most
    functionalities of network stacks to hardware.
    6

    View Slide

  7. Hardware-assisted Batching
    • NICs often provide HW-based segmentation.
    • In Ethernet: for jumbo-frames that don’t fit with page size
    • In RDMA: for fast access to remote pages
    • Batching reduces per-packet overheads.
    • It saves bandwidth for repeated protocol headers and
    computations for parsing/generating them.
    • Doing it in HW leaves more CPU cycles for SW.
    7

    View Slide

  8. Our Contributions
    • We compare throughput & latency of:
    • Combinations of different RoCE transport/operation modes
    • RoCE (RDMA over Converged Ethernet) vs. Ethernet
    • Result Highlights
    • In RoCE, UC transport type and SEND/RECV ops offer the
    maximum performance.
    • RoCE latency is consistently lower than Ethernet.
    • RoCE throughput is lower than Ethernet in small packets.
    • HW-assisted batching improves RoCE throughput for small
    packets to be comparable to Ethernet.
    8

    View Slide

  9. Experiment Setup
    9
    Packet Generator Packet Forwarder
    RDMA-capable NICs
    (Mellanox ConnectX-3,
    40 Gbps per port)
    RDMA-capable NICs
    (Mellanox ConnectX-3,
    40 Gbps per port)
    Switch
    (Mellanox SX1036)
    Software stack:
    Ubuntu 14.04 / kernel 3.16 / Mellanox OFED 3.0.2 / Intel DPDK 2.1
    Commodity server
    (Intel Xeon E5-2670v3 @
    2.6 GHz, 32 GB RAM)
    Commodity server
    (Intel Xeon E5-2670v3 @
    2.6 GHz, 32 GB RAM)

    View Slide

  10. Latency Measurement
    • For both RoCE & Ethernet: API-level granularity
    • 1-hop latency = ( (T4 – T1) – (T3 – T2) ) / 2
    • (T4 – T1): round-trip time between TX/RX APIs in sender
    • (T3 – T2): elapsed time between RX/TX APIs in receiver
    10
    T1
    T4
    T2
    T3

    View Slide

  11. Throughput Measurement
    • Bidirectional throughput through 40GbE
    • Protocol overhead included
    • Ethernet = 14 bytes / packet
    • RDMA over Ethernet = 72 bytes / message
    • Packet size: 64 to 1500 bytes
    • Typical network traffic for routers
    • Throughput: 10 sec average
    11

    View Slide

  12. 2.9 3.3
    4.0
    5.3
    7.9
    10.3
    1.5
    1.5 1.5
    2 2
    2.5
    0.0
    2.0
    4.0
    6.0
    8.0
    10.0
    12.0
    64 128 256 512 1024 1500
    Latency (median, usec)
    Packet size (bytes)
    Ethernet
    RoCE
    RoCE vs. Ethernet: 1-hop Latency
    12
    (w/o batching)
    Deviations at 10%/90% quantile are within 0.5 usec for RoCE

    View Slide

  13. 10.3
    18.0
    31.4
    53.8
    62.1
    73.0
    0
    20
    40
    60
    80
    1 2 4 8 16 32
    Latency (median, usec)
    I/O batch size (number of packets)
    Ethernet
    Impact of I/O Batching in Ethernet
    13
    Packet size fixed to 1500 bytes
    Nearly 30x
    of 2.5 usec
    in RoCE

    View Slide

  14. 0
    10
    20
    30
    40
    50
    60
    70
    80
    64 128 256 512 1024 1500
    Througput (Gbps)
    Packet size (bytes)
    RoCE-TX RoCE-RX
    Ethernet-TX Ethernet-RX
    RoCE vs. Ethernet: Throughput
    14
    Ethernet performs better than
    RoCE in all packet sizes.
    Throughput gap
    is worse in
    smaller packets.
    0
    10
    20
    30
    40
    50
    60
    70
    80
    64 128 256 512 1024 1500
    Througput (Gbps)
    Packet size (bytes)
    RoCE-TX RoCE-RX
    Ethernet-TX Ethernet-RX
    0
    10
    20
    30
    40
    50
    60
    70
    80
    64 128 256 512 1024 1500
    Packet size (bytes)
    RoCE-TX RoCE-RX
    Ethernet-TX Ethernet-RX
    0
    10
    20
    30
    40
    50
    60
    70
    80
    64 128 256 512 1024 1500
    Througput (Gbps)
    Packet size (bytes)
    RoCE-TX RoCE-RX
    Ethernet-TX Ethernet-RX
    0
    10
    20
    30
    40
    50
    60
    70
    80
    64 128 256 512 1024 1500
    Througput (Gbps)
    Packet size (bytes)
    RoCE-TX RoCE-RX
    Ethernet-TX Ethernet-RX

    View Slide

  15. Mixed Results of RoCE for Routers
    • RDMA keeps latency under 3 usec in all packet sizes.
    • Up to 30x lower than Ethernet in the same conditions
    • RDMA throughput < Ethernet throughput
    when packet size ≤ 1500B.
    15
    Our Breakthrough:
    Exploit HW-assisted Batching!
    +
    -

    View Slide

  16. 1.5 1.5 1.5 2 2 2.5 3 3.5
    5.5
    9
    16.5
    0
    2
    4
    6
    8
    10
    12
    14
    16
    18
    1-way Latency (usec)
    RoCE message size (bytes)
    0.0
    20.0
    40.0
    60.0
    80.0
    Throughput (Gbps)
    RoCE Message size (bytes)
    Potential Benefits of RoCE Batching
    • With packets ≥ 1500 bytes,
    • RoCE achieves line rates & keeps latency under 17 usec.
    16

    View Slide

  17. How HW-assisted Batching Works
    17
    “Sender” Host “Receiver” Host
    Application
    RoCE
    NIC
    Application
    Combined
    RoCE
    message
    RoCE header
    Ethernet
    packets
    RoCE
    NIC
    Interconnect
    (bypass OS network stack) (bypass OS network stack)
    Kernel
    Kernel

    View Slide

  18. 0
    10
    20
    30
    40
    50
    60
    70
    80
    64 128 256 512 1024 1500
    Throughput (Gbps)
    Packet size (byte)
    Non-batching 2 packets 4 packets 8 packets 16 packets 32 packets Ethernet
    HW-assisted Batching: Throughput
    18
    3.7~4.8x improvements
    for small packets
    Generally best batch size: 16

    View Slide

  19. 1.5 1.5 1.5 2 2 2.5
    2 2 2.5 3
    4
    5
    3 3 3.5
    4.5
    6
    7.5
    4 4
    5
    6.5
    10.5
    13.5
    0
    2
    4
    6
    8
    10
    12
    14
    64 128 256 512 1024 1500
    1-way latency (usec)
    Packet size (bytes)
    Non-batching 2 packets 4 packets
    8 packets 16 packets 32 packets
    HW-assisted Batching: Latency
    19
    (Deviations at 10%/90% quantile are within 0.5 usec from median)
    5.4x lower than Ethernet
    with same batch size
    10.3
    18.0
    31.4
    53.8
    62.1
    73.0
    0
    20
    40
    60
    80
    1 2 4 8 16 32
    Latency (median, usec)
    I/O batch size (number of packets)
    Ethernet

    View Slide

  20. Summary & Conclusion
    • RDMA is a valid alternative as an interconnect of scaled-out
    SW routers.
    • It reduces I/O latency up to 30x compared to Ethernet
    • Challenge is its low throughput in packet sizes ≤ 1500 bytes.
    • We exploit HW-assisted batching to enhance throughput.
    • It batches multiple Ethernet packets in a single RoCE message.
    • Our scheme achieves throughput higher or close to Ethernet
    while still keeps 1-hop latency under 14 usec.
    20

    View Slide

  21. Q & A

    View Slide

  22. Transfer Operations and Connection Types
    • 4 types of RDMA transfer operation
    • READ, WRITE, SEND and RECV operations
    • We use SEND & RECV, which is more suitable to latency-
    critical applications like packet processing.
    • 3 transport types for RDMA connection
    • RC (Reliable Connection), UC (Unreliable Connection),
    UD (Unreliable Datagram)
    • We choose UC type, which shows highest throughput of
    all types.
    22

    View Slide

  23. 4 Types of RDMA Transfer Operations
    • READ & WRITE operations
    • One-sided operations: receive side’s CPU is unaware of the
    transfer
    • READ “pulls” data from remote memory and WRITE “pushes”
    data into remote memory.
    • SEND & RECV operations
    • Two-sided operations: CPU of both side is involved
    • Sender sends data using SEND, receiver uses RECV to
    receive data
    • We use SEND & RECV, which is more suitable to latency-
    critical applications like packet processing.
    23

    View Slide

  24. 3 Types of RDMA Connections
    • 3 transport types for RDMA connection
    • RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable
    Datagram)
    • Connected types (RC & UC) support message sizes up to 2GB
    but requires fixed sender-receiver connection.
    • ACK/NACK protocol of RC enables it to guarantee lossless
    transfer but consumes link bandwidth.
    • UD type does not require fixed connection but its message size
    is limited to MTU and requires additional 40-byte protocol
    overhead.
    • UC type shows highest throughput of all types and we use it in
    this work.
    24

    View Slide

  25. Related Work
    • Implementation of distributed key-value stores
    • Pilaf [ATC 13’], HERD [SIGCOMM 14’], FaRM [NSDI 14’]
    • Acceleration of existing applications
    • MPI [ICS 03’], Hbase [IPDPS 12’], HDFS [SC 12’], Memcached
    [ICPP 11’]
    • They replace socket interface with RDMA transfer operations
    • RDMA-like interconnects for rack-scale computing
    • Scale-out NUMA [ASPLOS 14’] , R2C2 [SIGCOMM 15’], Marlin
    [ANCS 14’]
    25

    View Slide

  26. Future Works
    • Examine the effect of the number of RDMA
    connections on performance
    • Measure throughput and latency using real traffic
    traces
    • Implement scaled-out SW router prototype using
    RDMA interconnect
    • Cluster composed of Ethernet ports for external interface
    and RoCE ports for interconnect
    26

    View Slide

  27. etc.
    • The ”Barcelona” icon in the title slide is by Adam
    Whitcroft sponsored by OffScreen.
    27

    View Slide