Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[HiPINEB 2016] Exploring Low-latency Interconne...

Joongi Kim
March 12, 2016

[HiPINEB 2016] Exploring Low-latency Interconnect for Scaling-Out Software Routers

About building an optimized interconnection scheme for multi-node software routers using RDMA-capable Ethernet NICs and our own hardware-assisted batching technique

Joongi Kim

March 12, 2016
Tweet

More Decks by Joongi Kim

Other Decks in Research

Transcript

  1. HiPINEB 2016 Exploring Low-latency Interconnect for Scaling Out Software Routers

    Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST 2016. 3. 12.
  2. Motivations for Software Router • Packet processing on commodity x86

    servers • Low-cost alternatives for HW routers • Programmability & cost-effectiveness • Lower performance than HW routers • Commercial products: Vyatta (Brocade), 1000V (Cisco) • Well-known principles for high performance • Batching, pipelining, and parallelization 2 + -
  3. Scaling Out Software Routers • Why scale out? Because single

    boxes have: • Limited port density as a router (e.g., 40 PCIe lanes for a latest single Xeon CPU) • RouteBricks [SOSP09], ScaleBricks [SIGCOMM15], E2 [SOSP15] • Interconnect medium: Ethernet • Topology: full mesh / central switch 3 RouteBricks ScaleBricks & E2
  4. Our Problem: Latency Penalty • Two factors of latency increase

    • Multiple hops among router nodes • Aggressive I/O and computation batching • Our goal: allow maximum flexibility in future router cluster designs by minimizing interconnect overheads. 4
  5. Our Solutions • RDMA (Remote DMA) as the interconnect medium

    • Needs to keep throughput high • Hardware-assisted I/O batching • Offers lower latency compared to software-side batching • Enhances RDMA throughput with small packets 5
  6. RDMA as Interconnect • It is an unexplored design choice

    for routers. • All external connections need to be compatible: Ethernet. • Scaling out opens a new design space: interconnect. • RDMA provides low latency and high throughput. • It reduces the burden of host CPUs, by offloading most functionalities of network stacks to hardware. 6
  7. Hardware-assisted Batching • NICs often provide HW-based segmentation. • In

    Ethernet: for jumbo-frames that don’t fit with page size • In RDMA: for fast access to remote pages • Batching reduces per-packet overheads. • It saves bandwidth for repeated protocol headers and computations for parsing/generating them. • Doing it in HW leaves more CPU cycles for SW. 7
  8. Our Contributions • We compare throughput & latency of: •

    Combinations of different RoCE transport/operation modes • RoCE (RDMA over Converged Ethernet) vs. Ethernet • Result Highlights • In RoCE, UC transport type and SEND/RECV ops offer the maximum performance. • RoCE latency is consistently lower than Ethernet. • RoCE throughput is lower than Ethernet in small packets. • HW-assisted batching improves RoCE throughput for small packets to be comparable to Ethernet. 8
  9. Experiment Setup 9 Packet Generator Packet Forwarder RDMA-capable NICs (Mellanox

    ConnectX-3, 40 Gbps per port) RDMA-capable NICs (Mellanox ConnectX-3, 40 Gbps per port) Switch (Mellanox SX1036) Software stack: Ubuntu 14.04 / kernel 3.16 / Mellanox OFED 3.0.2 / Intel DPDK 2.1 Commodity server (Intel Xeon E5-2670v3 @ 2.6 GHz, 32 GB RAM) Commodity server (Intel Xeon E5-2670v3 @ 2.6 GHz, 32 GB RAM)
  10. Latency Measurement • For both RoCE & Ethernet: API-level granularity

    • 1-hop latency = ( (T4 – T1) – (T3 – T2) ) / 2 • (T4 – T1): round-trip time between TX/RX APIs in sender • (T3 – T2): elapsed time between RX/TX APIs in receiver 10 T1 T4 T2 T3
  11. Throughput Measurement • Bidirectional throughput through 40GbE • Protocol overhead

    included • Ethernet = 14 bytes / packet • RDMA over Ethernet = 72 bytes / message • Packet size: 64 to 1500 bytes • Typical network traffic for routers • Throughput: 10 sec average 11
  12. 2.9 3.3 4.0 5.3 7.9 10.3 1.5 1.5 1.5 2

    2 2.5 0.0 2.0 4.0 6.0 8.0 10.0 12.0 64 128 256 512 1024 1500 Latency (median, usec) Packet size (bytes) Ethernet RoCE RoCE vs. Ethernet: 1-hop Latency 12 (w/o batching) Deviations at 10%/90% quantile are within 0.5 usec for RoCE
  13. 10.3 18.0 31.4 53.8 62.1 73.0 0 20 40 60

    80 1 2 4 8 16 32 Latency (median, usec) I/O batch size (number of packets) Ethernet Impact of I/O Batching in Ethernet 13 Packet size fixed to 1500 bytes Nearly 30x of 2.5 usec in RoCE
  14. 0 10 20 30 40 50 60 70 80 64

    128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX RoCE vs. Ethernet: Throughput 14 Ethernet performs better than RoCE in all packet sizes. Throughput gap is worse in smaller packets. 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX
  15. Mixed Results of RoCE for Routers • RDMA keeps latency

    under 3 usec in all packet sizes. • Up to 30x lower than Ethernet in the same conditions • RDMA throughput < Ethernet throughput when packet size ≤ 1500B. 15 Our Breakthrough: Exploit HW-assisted Batching! + -
  16. 1.5 1.5 1.5 2 2 2.5 3 3.5 5.5 9

    16.5 0 2 4 6 8 10 12 14 16 18 1-way Latency (usec) RoCE message size (bytes) 0.0 20.0 40.0 60.0 80.0 Throughput (Gbps) RoCE Message size (bytes) Potential Benefits of RoCE Batching • With packets ≥ 1500 bytes, • RoCE achieves line rates & keeps latency under 17 usec. 16
  17. How HW-assisted Batching Works 17 “Sender” Host “Receiver” Host Application

    RoCE NIC Application Combined RoCE message RoCE header Ethernet packets RoCE NIC Interconnect (bypass OS network stack) (bypass OS network stack) Kernel Kernel
  18. 0 10 20 30 40 50 60 70 80 64

    128 256 512 1024 1500 Throughput (Gbps) Packet size (byte) Non-batching 2 packets 4 packets 8 packets 16 packets 32 packets Ethernet HW-assisted Batching: Throughput 18 3.7~4.8x improvements for small packets Generally best batch size: 16
  19. 1.5 1.5 1.5 2 2 2.5 2 2 2.5 3

    4 5 3 3 3.5 4.5 6 7.5 4 4 5 6.5 10.5 13.5 0 2 4 6 8 10 12 14 64 128 256 512 1024 1500 1-way latency (usec) Packet size (bytes) Non-batching 2 packets 4 packets 8 packets 16 packets 32 packets HW-assisted Batching: Latency 19 (Deviations at 10%/90% quantile are within 0.5 usec from median) 5.4x lower than Ethernet with same batch size 10.3 18.0 31.4 53.8 62.1 73.0 0 20 40 60 80 1 2 4 8 16 32 Latency (median, usec) I/O batch size (number of packets) Ethernet
  20. Summary & Conclusion • RDMA is a valid alternative as

    an interconnect of scaled-out SW routers. • It reduces I/O latency up to 30x compared to Ethernet • Challenge is its low throughput in packet sizes ≤ 1500 bytes. • We exploit HW-assisted batching to enhance throughput. • It batches multiple Ethernet packets in a single RoCE message. • Our scheme achieves throughput higher or close to Ethernet while still keeps 1-hop latency under 14 usec. 20
  21. Transfer Operations and Connection Types • 4 types of RDMA

    transfer operation • READ, WRITE, SEND and RECV operations • We use SEND & RECV, which is more suitable to latency- critical applications like packet processing. • 3 transport types for RDMA connection • RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) • We choose UC type, which shows highest throughput of all types. 22
  22. 4 Types of RDMA Transfer Operations • READ & WRITE

    operations • One-sided operations: receive side’s CPU is unaware of the transfer • READ “pulls” data from remote memory and WRITE “pushes” data into remote memory. • SEND & RECV operations • Two-sided operations: CPU of both side is involved • Sender sends data using SEND, receiver uses RECV to receive data • We use SEND & RECV, which is more suitable to latency- critical applications like packet processing. 23
  23. 3 Types of RDMA Connections • 3 transport types for

    RDMA connection • RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) • Connected types (RC & UC) support message sizes up to 2GB but requires fixed sender-receiver connection. • ACK/NACK protocol of RC enables it to guarantee lossless transfer but consumes link bandwidth. • UD type does not require fixed connection but its message size is limited to MTU and requires additional 40-byte protocol overhead. • UC type shows highest throughput of all types and we use it in this work. 24
  24. Related Work • Implementation of distributed key-value stores • Pilaf

    [ATC 13’], HERD [SIGCOMM 14’], FaRM [NSDI 14’] • Acceleration of existing applications • MPI [ICS 03’], Hbase [IPDPS 12’], HDFS [SC 12’], Memcached [ICPP 11’] • They replace socket interface with RDMA transfer operations • RDMA-like interconnects for rack-scale computing • Scale-out NUMA [ASPLOS 14’] , R2C2 [SIGCOMM 15’], Marlin [ANCS 14’] 25
  25. Future Works • Examine the effect of the number of

    RDMA connections on performance • Measure throughput and latency using real traffic traces • Implement scaled-out SW router prototype using RDMA interconnect • Cluster composed of Ethernet ports for external interface and RoCE ports for interconnect 26
  26. etc. • The ”Barcelona” icon in the title slide is

    by Adam Whitcroft sponsored by OffScreen. 27