Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[EuroSys 2015] Network Balancing Act

Ed7b6f41ac2581f1be3fd9b5bc883875?s=47 Joongi Kim
April 23, 2015

[EuroSys 2015] Network Balancing Act

A High-Performance Packet Processing Framework for Heterogeneous Processors. Received the best student paper award.
The NBA framework is the world-first 80 Gbps-grade generic packet processing framework, though there had existed application-specific prototypes reaching that performance. The framework's API resembles the Click modular router while it transparently supports composition of offloadable elements (e.g., GPU-accelerated) with an adaptive load balancer.


Joongi Kim

April 23, 2015


  1. Network Balancing Act: A High-performance Packet Processing Framework for Heterogeneous

    Processors Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, Sue Moon KAIST ACM EuroSys 2015
  2. 4/23/2015 Joongi Kim, KAIST 2 (NBA Logo is the trademark

    of National Basketball Association in United States)
  3. Network Balancing Act § Network packet processing requires balances between:

    • Matching speed of I/O and computation • Flexibility vs. High-performance • Hardware vs. Software • Workload distribution to multi-cores & heterogeneous processors § NBA is a slogan towards the world where: • We do not need to manually balance them any more! • We do not need to trade-off one for the other! 4/23/2015 Joongi Kim, KAIST 3
  4. A Brief History § Flexibility vs. Performance 4/23/2015 Joongi Kim,

    KAIST 4 Performance (single machine) Flexibility Click Modular Router [Kohler et al., 2000] 1-3 Gbps
  5. The Click modular router [Kohler et al., 2000] § Fine-grained

    modular abstraction of processing pipelines 4/23/2015 Joongi Kim, KAIST 5 FromDevice ToDevice CheckIPHeader IPLookup AES128Encrypt HMACSHA1 Discard IPsecESPEncap J Flexibility to create, combine, and reuse elements L Not designed for multi-cores Elements: per-packet processing functions as composable modules
  6. A Brief History § Flexibility vs. Performance 4/23/2015 Joongi Kim,

    KAIST 6 Performance (single machine) Flexibility Click Modular Router [Kohler et al., 2000] PacketShader [Han et al., 2010] 1-3 Gbps 8.7 Gbps 40 Gbps RouteBricks [Dobrescu et al., 2009] 28 ~ 30 Gbps Snap [Sun et al., 2013] Next-generation packet processing framework DoubleClick [Kim et al., 2012]
  7. Flexibility + Performance = NBA § Useful ideas from existing

    work • Modular, reusable abstraction of packet processing functions • Multi-core scalable + NUMA support • GPU acceleration § Our contributions • Optimized batched packet processing pipeline while preserving intuitive per-packet modular abstractions • New offloading abstraction improving reusability of GPU buffers with more optimization opportunities • Adaptive load balancer eliminating manual optimization efforts 4/23/2015 Joongi Kim, KAIST 7
  8. The NBA Framework 4/23/2015 Joongi Kim, KAIST 8 Packet I/O

    and multi-core structure implemented using Intel DPDK
  9. The NBA Framework 4/23/2015 Joongi Kim, KAIST 9 Worker threads

    running elements like the Click modular router (pinned to individual CPU cores)
  10. The NBA Framework 4/23/2015 Joongi Kim, KAIST 10 Device threads

    that handle GPU offloading
  11. How NBA Makes Your Life Easier § A top-down example

    • A middlebox that encrypts all incoming packets and forward them. § We want to make it: • Fast (i.e., NUMA and multi-core scalable) • Accelerated by GPUs, but without manual optimization § …for free! 4/23/2015 Joongi Kim, KAIST 11 FromQueue MyEncrypt L2Forward ToQueue FromQueue -> MyEncrypt(key 0x123456789) -> L2Forward() -> ToQueue;
  12. Writing CPU Element in Click § You write a per-packet

    function. 4/23/2015 Joongi Kim, KAIST 12 class MyEncrypt: Element { protected: long key; public: const char *class_name() const { return "MyEncrypt"; } const char *port_count() const { return "1/1"; } int configure(Vector<String> &args) { key = …; } void push(int in_port, Packet *p) { WritablePacket *q = p->uniqueify(); my_great_encrypt_func(key, q->data(), q->length()); output(0).push(q); } };
  13. Writing CPU Element in NBA § You still write a

    per-packet function! (with extensions) 4/23/2015 Joongi Kim, KAIST 13 class MyEncrypt: Element { protected: long key; public: const char *class_name() const { return "MyEncrypt"; } const char *port_count() const { return "1/1"; } int configure(ThreadContext *ctx, vector<string> &args) { key = …; } int initialize_global() { return 0; } int initialize_per_node() { return 0; } int initialize() { return 0; } int process(int in_port, Packet *p, Annotation *anno) { my_great_encrypt_func(key, p->data(), p->length()); return 0; // out_port } }; Using NBA gives advantages of multi-core architectures! (up to 80 Gbps)
  14. Inside NBA: Computation Batching § Two solutions • Wrap process()

    by a for-loop over the batch of input packets. • Use a branch prediction technique to mitigate batch-split problems. 4/23/2015 Joongi Kim, KAIST 14 for (int i = 0; i < batch_size; i ++) { elem->process(batch->in_port, batch->packets[i]); ð my_great_encrypt_func(key, pkt->data(), pkt->length()); }
  15. Inside NBA: Simple Branch Prediction § Exploit the statistics that

    most packets take the same path. Joongi Kim, KAIST 4/23/2015 15 Branch Path A Path B input batch: output batches: “Majority” path “Minority” path Mask branched packets Reuse the batch object Allocate a new batch
  16. Inside NBA: Effects of Branch Prediction § The branching overhead

    reduce from 38% to 10% when 1% of packets take the minority path. Joongi Kim, KAIST 4/23/2015 16
  17. Accelerating (offloading) with GPU § What you need to do:

    4/23/2015 Joongi Kim, KAIST 17 CPU Element GPU Element CPU Element Concept: Reality: CPU Element CPU Element batch pipeline prepare buffers host-to-device copy synchronize device-to-host copy dispatch launch GPU kernel
  18. Accelerating (offloading) with GPU § What NBA does for you:

    4/23/2015 Joongi Kim, KAIST 18 NBA: managed by the framework managed by the framework CPU Element GPU Element CPU Element Snap [Sun et al., 2013]: CPU Element GPU Element CPU Element Batcher H2DCopy D2HCopy CompletionQ Debatcher Dispatcher An attempt to do it on Click:
  19. Writing GPU Elements in NBA § What you need to

    do: • Specify which devices that the element supports. • Specify what data will be copied to/from the device. • Write your device kernel function. • Register your device init/launch handlers to the framework. 4/23/2015 Joongi Kim, KAIST 19
  20. Copying Packets to/from GPU § You declare “datablocks”, and NBA

    will handle the rest. § Datablock defines byte ranges to copy: § …and directions of each range. (host-to-device / device-to-host) 4/23/2015 Joongi Kim, KAIST 20 Fixed byte ranges Variable byte ranges ⋮ ⋮ or Packets NBA aggregates datablocks from multiple batches and copies them using a single API call. All automatically and efficiently.
  21. Benefits of Datablocks (1/2) § Datablock abstraction covers common packet

    processing applications. 4/23/2015 Joongi Kim, KAIST 21 Example App Datablock Types Router Fixed-length byte range to read header fields IPsec Encryption & MyEncrypt Variable-length byte range to read/write payloads NAT Fixed-length byte range to read/write header fields WAN Optimizer Variable-length byte range to read payloads Datablock is highly customizable.
  22. Benefits of Datablocks (2/2) § Different GPU elements may share

    same datablocks. § This enables following optimizations: • Reuse of datablocks shared by subsequent GPU elements (on-going future work) • Coalesced copy of different datablocks (not included in NBA) [Sun et al., 2013] 4/23/2015 Joongi Kim, KAIST 22 AES128Encrypt HMACSHA1 FlowIDs Payloads InitialVectorByFlow shared
  23. 4/23/2015 Joongi Kim, KAIST 23 Is the GPU always faster?

  24. A Pitfall of GPU Offloading § Sub-optimal throughputs when offloading

    everything 4/23/2015 Joongi Kim, KAIST 24
  25. Adaptive Load Balancer § Find the optimal offloading weight by

    a feedback loop. § Challenges • Jitter ð Elongate the history size of low pass filter • Local maxima ð Periodically perturb 4/23/2015 Joongi Kim, KAIST 25 Load Balancer GPU func. of GPU Element CPU func. of GPU Element 1 − w w ±δ Is system throughput increasing or decreasing?
  26. Performance of Adaptive Load Balancer § ALB achieves over 92%

    of manually tuned throughputs. 4/23/2015 Joongi Kim, KAIST 26 NBA’s adaptive load balancer does optimization for you!
  27. Further Details § Performance evaluation • Multi-core scalability • Per-app

    CPU-only / GPU-only throughputs • Latency § Please refer the paper! 4/23/2015 Joongi Kim, KAIST 27
  28. Conclusion § The world’s first 80 Gbps packet processing framework

    • As well as a next-generation Click § NBA is now open-source! • https://github.com/ANLAB-KAIST/NBA § Future work • (ongoing) Optimization of datablocks • “Advanced” adaptive CPU/GPU load balancer • Extension to Intel Xeon Phi § NBA is an evolving framework! 4/23/2015 Joongi Kim, KAIST 28
  29. Q&A Thanks! 4/23/2015 Joongi Kim, KAIST 29

  30. Back-up Slides 4/23/2015 Joongi Kim, KAIST 30

  31. Performance Gains by Comp.Batching § Optimal batch size for most

    configurations: 64 packets/batch § In this experiment, the IO batch size was fixed to 64. 4/23/2015 Joongi Kim, KAIST 31
  32. Performance Gains by GPU § You know the conclusion: “it

    depends.” § Our applications • For IPv4 router: CPU is better. (offloading overheads is larger than performance benefits, because two memory lookups are fast enough on CPUs.) • For IPv6 router: GPU is better. • For IPsec: depending on the packet sizes & workloads. (mostly due to buffer copy overheads) • For IDS: GPU is better. § Refer the paper for exact numbers! 4/23/2015 Joongi Kim, KAIST 32
  33. Writing GPU Kernels in NBA (1/3) § We said “it

    is straight-forward” in Slide #19. • Yes, it is! • The complexity of parallelized algorithm is not in our scope! § What you need to do: • Find the reference of data item for each GPU thread. • Run your function on it. • Notify the framework on completion. 4/23/2015 Joongi Kim, KAIST 33
  34. Writing GPU Kernels in NBA (2/3) § Find the reference

    to datablock items from multiple batches. 4/23/2015 Joongi Kim, KAIST 34 ex) 4 batches with 64 packets + 1 batch with 14 packets = 270 items: batch 1 batch 2 batch 3 batch 4 Conceptual mapping of items and GPU threads batch_idx = 4 item_idx = 20 count = 270
  35. Writing GPU Kernels in NBA (3/3) § GPU threads can

    access datablocks via GPU-side API. 4/23/2015 Joongi Kim, KAIST 35 __global__ void my_encrypt_kernel( ??? ) { int i = blockIdx.x * blockDim.x + threadIdx.x; my_great_encrypt_func(key, ???, ???); } __global__ void my_encrypt_kernel(datablocks, batch_ids, item_ids, …) { int i = blockIdx.x * blockDim.x + threadIdx.x; int batch_idx = batch_ids[i]; int item_idx = item_ids[i]; char *buffer = datablocks[0]->buffer_bases_in[batch_idx][item_idx]; int offset = datablocks[0]->item_offsets_in[batch_idx][item_idx]; int len = datablocks[0]->item_sizes_in[batch_idx][item_idx]; my_great_encrypt_func(key, &buffer[offset], len); notify_completion(); } Still, you do not have to worry about buffers! (datablock reuse / coalescing done automatically)
  36. Used Configurations § Highlighted elements are GPU-offloadable ones. § We

    prepended the first GPU element with the load balancer. 4/23/2015 Joongi Kim, KAIST 36
  37. Latency (CPU) § CPU versions have “state-of-the-art” latencies. (compared to

    Snap, PacketShader, etc.) 4/23/2015 Joongi Kim, KAIST 37
  38. Latency (GPU) § GPU versions have high variances. • Minimum

    values correspond to the minimum possible calculated using NVprof (Profiler for CUDA). • Additional batching on batches may be the cause of prolonged latency, but we are still doing investigation and optimization. 4/23/2015 Joongi Kim, KAIST 38
  39. Other Anticipated Questions § Why not put all elements into

    the GPU? (Or, have you read GASPP?) • GASPP is one of our reference. Using GPUs incurs high latency in general, and GASPP has also high latency like our GPU-only cases. (Figure 11 of GASPP paper.) • We want to leave the choice of CPU or GPU for users (and the adaptive load balancer), as they have strengths in different applications. 4/23/2015 Joongi Kim, KAIST 39
  40. Other Anticipated Questions § How do you solve packet reordering

    when using ALB? • We do not. We assume that flow control is now common in high- speed links, such as converged Ethernet, so there will be no or minimal packet drops. Then, reordering within our latency ranges would not impact the end-to-end performance significantly. • In NBA you can write “queue” elements. Imagine two elements: one that annotates incoming packets with monotonically increasing counter, and the other that stores incoming packets and sort them according to the annotated counter value before passing them to the next element. You may change the length of queue as you want. 4/23/2015 Joongi Kim, KAIST 40
  41. Other Anticipated Questions § I saw the poster. Your current

    datablock performance is not better than that of the paper version. Why? • We don’t have the exact answer yet. However, we found that reserving more memory space for datablocks significantly degrades the performance, and so suspect that the main bottleneck is related to the memory management. I am pretty sure that it’s not extra computation overheads. It’s under optimization. § Your ALB uses “heuristic” configurations such as update intervals and the size of delta. Would it work generally? • The ALB presented in the paper is configured very conservatively, so that it would converge for sure, but with a long time (a few min). • We have some hints to make this better: estimated per-packet processing cost is one of candidate measure for future load balancer, instead of the total system throughput. 4/23/2015 Joongi Kim, KAIST 41
  42. Other Anticipated Questions § Snap is compatible with Click. What

    about NBA? / Why not add DPDK-support to Click directly? • Full support for computation batching requires interface changes: calling push()/pull() as in Click directly does not work! • We plan to develop an automatic convertor for legacy Click elements and improve API compatibility. § Why is IDS CPU performance so low? • The CPU version is not well optimized. (The code we used is from Kargus [Jamshed et al., 2013], but an under-refactoring version.) • Our point is that regardless of whatever application NBA runs, our ALB can find the optimal offloading weight even for such extreme cases. You can just write faster IDS and NBA will work well with it. 4/23/2015 Joongi Kim, KAIST 42