[EuroSys 2015] Network Balancing Act

Network Balancing Act: A High-performance Packet Processing Framework for Heterogeneous
Processors Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, Sue Moon KAIST ACM EuroSys 2015

4/23/2015 Joongi Kim, KAIST 2 (NBA Logo is the trademark
of National Basketball Association in United States)

Network Balancing Act § Network packet processing requires balances between:
• Matching speed of I/O and computation • Flexibility vs. High-performance • Hardware vs. Software • Workload distribution to multi-cores & heterogeneous processors § NBA is a slogan towards the world where: • We do not need to manually balance them any more! • We do not need to trade-off one for the other! 4/23/2015 Joongi Kim, KAIST 3

A Brief History § Flexibility vs. Performance 4/23/2015 Joongi Kim,
KAIST 4 Performance (single machine) Flexibility Click Modular Router [Kohler et al., 2000] 1-3 Gbps

The Click modular router [Kohler et al., 2000] § Fine-grained
modular abstraction of processing pipelines 4/23/2015 Joongi Kim, KAIST 5 FromDevice ToDevice CheckIPHeader IPLookup AES128Encrypt HMACSHA1 Discard IPsecESPEncap J Flexibility to create, combine, and reuse elements L Not designed for multi-cores Elements: per-packet processing functions as composable modules

A Brief History § Flexibility vs. Performance 4/23/2015 Joongi Kim,
KAIST 6 Performance (single machine) Flexibility Click Modular Router [Kohler et al., 2000] PacketShader [Han et al., 2010] 1-3 Gbps 8.7 Gbps 40 Gbps RouteBricks [Dobrescu et al., 2009] 28 ~ 30 Gbps Snap [Sun et al., 2013] Next-generation packet processing framework DoubleClick [Kim et al., 2012]

Flexibility + Performance = NBA § Useful ideas from existing
work • Modular, reusable abstraction of packet processing functions • Multi-core scalable + NUMA support • GPU acceleration § Our contributions • Optimized batched packet processing pipeline while preserving intuitive per-packet modular abstractions • New offloading abstraction improving reusability of GPU buffers with more optimization opportunities • Adaptive load balancer eliminating manual optimization efforts 4/23/2015 Joongi Kim, KAIST 7

The NBA Framework 4/23/2015 Joongi Kim, KAIST 8 Packet I/O
and multi-core structure implemented using Intel DPDK

The NBA Framework 4/23/2015 Joongi Kim, KAIST 9 Worker threads
running elements like the Click modular router (pinned to individual CPU cores)

The NBA Framework 4/23/2015 Joongi Kim, KAIST 10 Device threads
that handle GPU offloading

How NBA Makes Your Life Easier § A top-down example
• A middlebox that encrypts all incoming packets and forward them. § We want to make it: • Fast (i.e., NUMA and multi-core scalable) • Accelerated by GPUs, but without manual optimization § …for free! 4/23/2015 Joongi Kim, KAIST 11 FromQueue MyEncrypt L2Forward ToQueue FromQueue -> MyEncrypt(key 0x123456789) -> L2Forward() -> ToQueue;

Writing CPU Element in Click § You write a per-packet
function. 4/23/2015 Joongi Kim, KAIST 12 class MyEncrypt: Element { protected: long key; public: const char *class_name() const { return "MyEncrypt"; } const char *port_count() const { return "1/1"; } int configure(Vector<String> &args) { key = …; } void push(int in_port, Packet *p) { WritablePacket *q = p->uniqueify(); my_great_encrypt_func(key, q->data(), q->length()); output(0).push(q); } };

Writing CPU Element in NBA § You still write a
per-packet function! (with extensions) 4/23/2015 Joongi Kim, KAIST 13 class MyEncrypt: Element { protected: long key; public: const char *class_name() const { return "MyEncrypt"; } const char *port_count() const { return "1/1"; } int configure(ThreadContext *ctx, vector<string> &args) { key = …; } int initialize_global() { return 0; } int initialize_per_node() { return 0; } int initialize() { return 0; } int process(int in_port, Packet *p, Annotation *anno) { my_great_encrypt_func(key, p->data(), p->length()); return 0; // out_port } }; Using NBA gives advantages of multi-core architectures! (up to 80 Gbps)

Inside NBA: Computation Batching § Two solutions • Wrap process()
by a for-loop over the batch of input packets. • Use a branch prediction technique to mitigate batch-split problems. 4/23/2015 Joongi Kim, KAIST 14 for (int i = 0; i < batch_size; i ++) { elem->process(batch->in_port, batch->packets[i]); ð my_great_encrypt_func(key, pkt->data(), pkt->length()); }

Inside NBA: Simple Branch Prediction § Exploit the statistics that
most packets take the same path. Joongi Kim, KAIST 4/23/2015 15 Branch Path A Path B input batch: output batches: “Majority” path “Minority” path Mask branched packets Reuse the batch object Allocate a new batch

Inside NBA: Effects of Branch Prediction § The branching overhead
reduce from 38% to 10% when 1% of packets take the minority path. Joongi Kim, KAIST 4/23/2015 16

Accelerating (offloading) with GPU § What you need to do:
4/23/2015 Joongi Kim, KAIST 17 CPU Element GPU Element CPU Element Concept: Reality: CPU Element CPU Element batch pipeline prepare buffers host-to-device copy synchronize device-to-host copy dispatch launch GPU kernel

Accelerating (offloading) with GPU § What NBA does for you:
4/23/2015 Joongi Kim, KAIST 18 NBA: managed by the framework managed by the framework CPU Element GPU Element CPU Element Snap [Sun et al., 2013]: CPU Element GPU Element CPU Element Batcher H2DCopy D2HCopy CompletionQ Debatcher Dispatcher An attempt to do it on Click:

Writing GPU Elements in NBA § What you need to
do: • Specify which devices that the element supports. • Specify what data will be copied to/from the device. • Write your device kernel function. • Register your device init/launch handlers to the framework. 4/23/2015 Joongi Kim, KAIST 19

Copying Packets to/from GPU § You declare “datablocks”, and NBA
will handle the rest. § Datablock defines byte ranges to copy: § …and directions of each range. (host-to-device / device-to-host) 4/23/2015 Joongi Kim, KAIST 20 Fixed byte ranges Variable byte ranges ⋮ ⋮ or Packets NBA aggregates datablocks from multiple batches and copies them using a single API call. All automatically and efficiently.

Benefits of Datablocks (1/2) § Datablock abstraction covers common packet
processing applications. 4/23/2015 Joongi Kim, KAIST 21 Example App Datablock Types Router Fixed-length byte range to read header fields IPsec Encryption & MyEncrypt Variable-length byte range to read/write payloads NAT Fixed-length byte range to read/write header fields WAN Optimizer Variable-length byte range to read payloads Datablock is highly customizable.

Benefits of Datablocks (2/2) § Different GPU elements may share
same datablocks. § This enables following optimizations: • Reuse of datablocks shared by subsequent GPU elements (on-going future work) • Coalesced copy of different datablocks (not included in NBA) [Sun et al., 2013] 4/23/2015 Joongi Kim, KAIST 22 AES128Encrypt HMACSHA1 FlowIDs Payloads InitialVectorByFlow shared

4/23/2015 Joongi Kim, KAIST 23 Is the GPU always faster?

A Pitfall of GPU Offloading § Sub-optimal throughputs when offloading
everything 4/23/2015 Joongi Kim, KAIST 24

Adaptive Load Balancer § Find the optimal offloading weight by
a feedback loop. § Challenges • Jitter ð Elongate the history size of low pass filter • Local maxima ð Periodically perturb 4/23/2015 Joongi Kim, KAIST 25 Load Balancer GPU func. of GPU Element CPU func. of GPU Element 1 − w w ±δ Is system throughput increasing or decreasing?

Performance of Adaptive Load Balancer § ALB achieves over 92%
of manually tuned throughputs. 4/23/2015 Joongi Kim, KAIST 26 NBA’s adaptive load balancer does optimization for you!

Further Details § Performance evaluation • Multi-core scalability • Per-app
CPU-only / GPU-only throughputs • Latency § Please refer the paper! 4/23/2015 Joongi Kim, KAIST 27

Conclusion § The world’s first 80 Gbps packet processing framework
• As well as a next-generation Click § NBA is now open-source! • https://github.com/ANLAB-KAIST/NBA § Future work • (ongoing) Optimization of datablocks • “Advanced” adaptive CPU/GPU load balancer • Extension to Intel Xeon Phi § NBA is an evolving framework! 4/23/2015 Joongi Kim, KAIST 28

Q&A Thanks! 4/23/2015 Joongi Kim, KAIST 29

Back-up Slides 4/23/2015 Joongi Kim, KAIST 30

Performance Gains by Comp.Batching § Optimal batch size for most
configurations: 64 packets/batch § In this experiment, the IO batch size was fixed to 64. 4/23/2015 Joongi Kim, KAIST 31

Performance Gains by GPU § You know the conclusion: “it
depends.” § Our applications • For IPv4 router: CPU is better. (offloading overheads is larger than performance benefits, because two memory lookups are fast enough on CPUs.) • For IPv6 router: GPU is better. • For IPsec: depending on the packet sizes & workloads. (mostly due to buffer copy overheads) • For IDS: GPU is better. § Refer the paper for exact numbers! 4/23/2015 Joongi Kim, KAIST 32

Writing GPU Kernels in NBA (1/3) § We said “it
is straight-forward” in Slide #19. • Yes, it is! • The complexity of parallelized algorithm is not in our scope! § What you need to do: • Find the reference of data item for each GPU thread. • Run your function on it. • Notify the framework on completion. 4/23/2015 Joongi Kim, KAIST 33

Writing GPU Kernels in NBA (2/3) § Find the reference
to datablock items from multiple batches. 4/23/2015 Joongi Kim, KAIST 34 ex) 4 batches with 64 packets + 1 batch with 14 packets = 270 items: batch 1 batch 2 batch 3 batch 4 Conceptual mapping of items and GPU threads batch_idx = 4 item_idx = 20 count = 270

Writing GPU Kernels in NBA (3/3) § GPU threads can
access datablocks via GPU-side API. 4/23/2015 Joongi Kim, KAIST 35 __global__ void my_encrypt_kernel( ??? ) { int i = blockIdx.x * blockDim.x + threadIdx.x; my_great_encrypt_func(key, ???, ???); } __global__ void my_encrypt_kernel(datablocks, batch_ids, item_ids, …) { int i = blockIdx.x * blockDim.x + threadIdx.x; int batch_idx = batch_ids[i]; int item_idx = item_ids[i]; char *buffer = datablocks[0]->buffer_bases_in[batch_idx][item_idx]; int offset = datablocks[0]->item_offsets_in[batch_idx][item_idx]; int len = datablocks[0]->item_sizes_in[batch_idx][item_idx]; my_great_encrypt_func(key, &buffer[offset], len); notify_completion(); } Still, you do not have to worry about buffers! (datablock reuse / coalescing done automatically)

Used Configurations § Highlighted elements are GPU-offloadable ones. § We
prepended the first GPU element with the load balancer. 4/23/2015 Joongi Kim, KAIST 36

Latency (CPU) § CPU versions have “state-of-the-art” latencies. (compared to
Snap, PacketShader, etc.) 4/23/2015 Joongi Kim, KAIST 37

Latency (GPU) § GPU versions have high variances. • Minimum
values correspond to the minimum possible calculated using NVprof (Profiler for CUDA). • Additional batching on batches may be the cause of prolonged latency, but we are still doing investigation and optimization. 4/23/2015 Joongi Kim, KAIST 38

Other Anticipated Questions § Why not put all elements into
the GPU? (Or, have you read GASPP?) • GASPP is one of our reference. Using GPUs incurs high latency in general, and GASPP has also high latency like our GPU-only cases. (Figure 11 of GASPP paper.) • We want to leave the choice of CPU or GPU for users (and the adaptive load balancer), as they have strengths in different applications. 4/23/2015 Joongi Kim, KAIST 39

Other Anticipated Questions § How do you solve packet reordering
when using ALB? • We do not. We assume that flow control is now common in high- speed links, such as converged Ethernet, so there will be no or minimal packet drops. Then, reordering within our latency ranges would not impact the end-to-end performance significantly. • In NBA you can write “queue” elements. Imagine two elements: one that annotates incoming packets with monotonically increasing counter, and the other that stores incoming packets and sort them according to the annotated counter value before passing them to the next element. You may change the length of queue as you want. 4/23/2015 Joongi Kim, KAIST 40

Other Anticipated Questions § I saw the poster. Your current
datablock performance is not better than that of the paper version. Why? • We don’t have the exact answer yet. However, we found that reserving more memory space for datablocks significantly degrades the performance, and so suspect that the main bottleneck is related to the memory management. I am pretty sure that it’s not extra computation overheads. It’s under optimization. § Your ALB uses “heuristic” configurations such as update intervals and the size of delta. Would it work generally? • The ALB presented in the paper is configured very conservatively, so that it would converge for sure, but with a long time (a few min). • We have some hints to make this better: estimated per-packet processing cost is one of candidate measure for future load balancer, instead of the total system throughput. 4/23/2015 Joongi Kim, KAIST 41

Other Anticipated Questions § Snap is compatible with Click. What
about NBA? / Why not add DPDK-support to Click directly? • Full support for computation batching requires interface changes: calling push()/pull() as in Click directly does not work! • We plan to develop an automatic convertor for legacy Click elements and improve API compatibility. § Why is IDS CPU performance so low? • The CPU version is not well optimized. (The code we used is from Kargus [Jamshed et al., 2013], but an under-refactoring version.) • Our point is that regardless of whatever application NBA runs, our ALB can find the optimal offloading weight even for such extreme cases. You can just write faster IDS and NBA will work well with it. 4/23/2015 Joongi Kim, KAIST 42

[EuroSys 2015] Network Balancing Act

[EuroSys 2015] Network Balancing Act

More Decks by Joongi Kim

Other Decks in Research

Featured

Transcript