[HiPINEB 2016] Exploring Low-latency Interconnect for Scaling-Out Software Routers

HiPINEB 2016 Exploring Low-latency Interconnect for Scaling Out Software Routers
Sangwook Ma, Joongi Kim, Sue Moon School of Computing, KAIST 2016. 3. 12.

Motivations for Software Router • Packet processing on commodity x86
servers • Low-cost alternatives for HW routers • Programmability & cost-effectiveness • Lower performance than HW routers • Commercial products: Vyatta (Brocade), 1000V (Cisco) • Well-known principles for high performance • Batching, pipelining, and parallelization 2 + -

Scaling Out Software Routers • Why scale out? Because single
boxes have: • Limited port density as a router (e.g., 40 PCIe lanes for a latest single Xeon CPU) • RouteBricks [SOSP09], ScaleBricks [SIGCOMM15], E2 [SOSP15] • Interconnect medium: Ethernet • Topology: full mesh / central switch 3 RouteBricks ScaleBricks & E2

Our Problem: Latency Penalty • Two factors of latency increase
• Multiple hops among router nodes • Aggressive I/O and computation batching • Our goal: allow maximum flexibility in future router cluster designs by minimizing interconnect overheads. 4

Our Solutions • RDMA (Remote DMA) as the interconnect medium
• Needs to keep throughput high • Hardware-assisted I/O batching • Offers lower latency compared to software-side batching • Enhances RDMA throughput with small packets 5

RDMA as Interconnect • It is an unexplored design choice
for routers. • All external connections need to be compatible: Ethernet. • Scaling out opens a new design space: interconnect. • RDMA provides low latency and high throughput. • It reduces the burden of host CPUs, by offloading most functionalities of network stacks to hardware. 6

Hardware-assisted Batching • NICs often provide HW-based segmentation. • In
Ethernet: for jumbo-frames that don’t fit with page size • In RDMA: for fast access to remote pages • Batching reduces per-packet overheads. • It saves bandwidth for repeated protocol headers and computations for parsing/generating them. • Doing it in HW leaves more CPU cycles for SW. 7

Our Contributions • We compare throughput & latency of: •
Combinations of different RoCE transport/operation modes • RoCE (RDMA over Converged Ethernet) vs. Ethernet • Result Highlights • In RoCE, UC transport type and SEND/RECV ops offer the maximum performance. • RoCE latency is consistently lower than Ethernet. • RoCE throughput is lower than Ethernet in small packets. • HW-assisted batching improves RoCE throughput for small packets to be comparable to Ethernet. 8

Experiment Setup 9 Packet Generator Packet Forwarder RDMA-capable NICs (Mellanox
ConnectX-3, 40 Gbps per port) RDMA-capable NICs (Mellanox ConnectX-3, 40 Gbps per port) Switch (Mellanox SX1036) Software stack: Ubuntu 14.04 / kernel 3.16 / Mellanox OFED 3.0.2 / Intel DPDK 2.1 Commodity server (Intel Xeon E5-2670v3 @ 2.6 GHz, 32 GB RAM) Commodity server (Intel Xeon E5-2670v3 @ 2.6 GHz, 32 GB RAM)

Latency Measurement • For both RoCE & Ethernet: API-level granularity
• 1-hop latency = ( (T4 – T1) – (T3 – T2) ) / 2 • (T4 – T1): round-trip time between TX/RX APIs in sender • (T3 – T2): elapsed time between RX/TX APIs in receiver 10 T1 T4 T2 T3

Throughput Measurement • Bidirectional throughput through 40GbE • Protocol overhead
included • Ethernet = 14 bytes / packet • RDMA over Ethernet = 72 bytes / message • Packet size: 64 to 1500 bytes • Typical network traffic for routers • Throughput: 10 sec average 11

2.9 3.3 4.0 5.3 7.9 10.3 1.5 1.5 1.5 2
2 2.5 0.0 2.0 4.0 6.0 8.0 10.0 12.0 64 128 256 512 1024 1500 Latency (median, usec) Packet size (bytes) Ethernet RoCE RoCE vs. Ethernet: 1-hop Latency 12 (w/o batching) Deviations at 10%/90% quantile are within 0.5 usec for RoCE

10.3 18.0 31.4 53.8 62.1 73.0 0 20 40 60
80 1 2 4 8 16 32 Latency (median, usec) I/O batch size (number of packets) Ethernet Impact of I/O Batching in Ethernet 13 Packet size fixed to 1500 bytes Nearly 30x of 2.5 usec in RoCE

0 10 20 30 40 50 60 70 80 64
128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX RoCE vs. Ethernet: Throughput 14 Ethernet performs better than RoCE in all packet sizes. Throughput gap is worse in smaller packets. 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX 0 10 20 30 40 50 60 70 80 64 128 256 512 1024 1500 Througput (Gbps) Packet size (bytes) RoCE-TX RoCE-RX Ethernet-TX Ethernet-RX

Mixed Results of RoCE for Routers • RDMA keeps latency
under 3 usec in all packet sizes. • Up to 30x lower than Ethernet in the same conditions • RDMA throughput < Ethernet throughput when packet size ≤ 1500B. 15 Our Breakthrough: Exploit HW-assisted Batching! + -

1.5 1.5 1.5 2 2 2.5 3 3.5 5.5 9
16.5 0 2 4 6 8 10 12 14 16 18 1-way Latency (usec) RoCE message size (bytes) 0.0 20.0 40.0 60.0 80.0 Throughput (Gbps) RoCE Message size (bytes) Potential Benefits of RoCE Batching • With packets ≥ 1500 bytes, • RoCE achieves line rates & keeps latency under 17 usec. 16

How HW-assisted Batching Works 17 “Sender” Host “Receiver” Host Application
RoCE NIC Application Combined RoCE message RoCE header Ethernet packets RoCE NIC Interconnect (bypass OS network stack) (bypass OS network stack) Kernel Kernel

0 10 20 30 40 50 60 70 80 64
128 256 512 1024 1500 Throughput (Gbps) Packet size (byte) Non-batching 2 packets 4 packets 8 packets 16 packets 32 packets Ethernet HW-assisted Batching: Throughput 18 3.7~4.8x improvements for small packets Generally best batch size: 16

1.5 1.5 1.5 2 2 2.5 2 2 2.5 3
4 5 3 3 3.5 4.5 6 7.5 4 4 5 6.5 10.5 13.5 0 2 4 6 8 10 12 14 64 128 256 512 1024 1500 1-way latency (usec) Packet size (bytes) Non-batching 2 packets 4 packets 8 packets 16 packets 32 packets HW-assisted Batching: Latency 19 (Deviations at 10%/90% quantile are within 0.5 usec from median) 5.4x lower than Ethernet with same batch size 10.3 18.0 31.4 53.8 62.1 73.0 0 20 40 60 80 1 2 4 8 16 32 Latency (median, usec) I/O batch size (number of packets) Ethernet

Summary & Conclusion • RDMA is a valid alternative as
an interconnect of scaled-out SW routers. • It reduces I/O latency up to 30x compared to Ethernet • Challenge is its low throughput in packet sizes ≤ 1500 bytes. • We exploit HW-assisted batching to enhance throughput. • It batches multiple Ethernet packets in a single RoCE message. • Our scheme achieves throughput higher or close to Ethernet while still keeps 1-hop latency under 14 usec. 20

Transfer Operations and Connection Types • 4 types of RDMA
transfer operation • READ, WRITE, SEND and RECV operations • We use SEND & RECV, which is more suitable to latency- critical applications like packet processing. • 3 transport types for RDMA connection • RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) • We choose UC type, which shows highest throughput of all types. 22

4 Types of RDMA Transfer Operations • READ & WRITE
operations • One-sided operations: receive side’s CPU is unaware of the transfer • READ “pulls” data from remote memory and WRITE “pushes” data into remote memory. • SEND & RECV operations • Two-sided operations: CPU of both side is involved • Sender sends data using SEND, receiver uses RECV to receive data • We use SEND & RECV, which is more suitable to latency- critical applications like packet processing. 23

3 Types of RDMA Connections • 3 transport types for
RDMA connection • RC (Reliable Connection), UC (Unreliable Connection), UD (Unreliable Datagram) • Connected types (RC & UC) support message sizes up to 2GB but requires fixed sender-receiver connection. • ACK/NACK protocol of RC enables it to guarantee lossless transfer but consumes link bandwidth. • UD type does not require fixed connection but its message size is limited to MTU and requires additional 40-byte protocol overhead. • UC type shows highest throughput of all types and we use it in this work. 24

Related Work • Implementation of distributed key-value stores • Pilaf
[ATC 13’], HERD [SIGCOMM 14’], FaRM [NSDI 14’] • Acceleration of existing applications • MPI [ICS 03’], Hbase [IPDPS 12’], HDFS [SC 12’], Memcached [ICPP 11’] • They replace socket interface with RDMA transfer operations • RDMA-like interconnects for rack-scale computing • Scale-out NUMA [ASPLOS 14’] , R2C2 [SIGCOMM 15’], Marlin [ANCS 14’] 25

Future Works • Examine the effect of the number of
RDMA connections on performance • Measure throughput and latency using real traffic traces • Implement scaled-out SW router prototype using RDMA interconnect • Cluster composed of Ethernet ports for external interface and RoCE ports for interconnect 26

etc. • The ”Barcelona” icon in the title slide is
by Adam Whitcroft sponsored by OffScreen. 27

[HiPINEB 2016] Exploring Low-latency Interconne...

[HiPINEB 2016] Exploring Low-latency Interconnect for Scaling-Out Software Routers

Joongi Kim

More Decks by Joongi Kim

Other Decks in Research

Featured

Transcript

HiPINEB 2016 Exploring Low-latency Interconnect for Scaling Out Software Routers

Motivations for Software Router • Packet processing on commodity x86

Scaling Out Software Routers • Why scale out? Because single

Our Problem: Latency Penalty • Two factors of latency increase

Our Solutions • RDMA (Remote DMA) as the interconnect medium

RDMA as Interconnect • It is an unexplored design choice

Hardware-assisted Batching • NICs often provide HW-based segmentation. • In

Our Contributions • We compare throughput & latency of: •

Experiment Setup 9 Packet Generator Packet Forwarder RDMA-capable NICs (Mellanox

Latency Measurement • For both RoCE & Ethernet: API-level granularity

Throughput Measurement • Bidirectional throughput through 40GbE • Protocol overhead

2.9 3.3 4.0 5.3 7.9 10.3 1.5 1.5 1.5 2

10.3 18.0 31.4 53.8 62.1 73.0 0 20 40 60

0 10 20 30 40 50 60 70 80 64

Mixed Results of RoCE for Routers • RDMA keeps latency

1.5 1.5 1.5 2 2 2.5 3 3.5 5.5 9

How HW-assisted Batching Works 17 “Sender” Host “Receiver” Host Application

0 10 20 30 40 50 60 70 80 64

1.5 1.5 1.5 2 2 2.5 2 2 2.5 3

Summary & Conclusion • RDMA is a valid alternative as

Q & A

Transfer Operations and Connection Types • 4 types of RDMA

4 Types of RDMA Transfer Operations • READ & WRITE

3 Types of RDMA Connections • 3 transport types for

Related Work • Implementation of distributed key-value stores • Pilaf

Future Works • Examine the effect of the number of

etc. • The ”Barcelona” icon in the title slide is