Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Power-Proportional Software Router

Building a Power-Proportional Software Router

We aim at improving the power efficiency of network routers without compromising their performance. Using server-based software routers as our prototyping vehicle, we investigate the design of a router that consumes power in proportion to the rate of incoming traffic. We start with an empirical study of power consumption in current software routers, decomposing the total power consumption into its component causes. Informed by this analysis, we develop software mechanisms that exploit the underlying hardware's power management features for more energy-efficient packet processing. We incorporate these mechanisms into Click and demonstrate a router that matches the peak performance of the original (unmodified) router while consuming up to half the power at low loads, with negligible impact on the packet forwarding latency.

Avatar for Luca Niccolini

Luca Niccolini

June 14, 2012
Tweet

Other Decks in Research

Transcript

  1. Motivation Networking devices ≫  Provisioned for peak load ≫  Underutilized

    on average ~5% in enterprise networks 30-40% for ISPs 5X variability in ADSL networks ≫  Highly inefficient at low load 80-90% with no traffic Large deployments of network appliances (x86 based) ≫  WAN optimizer, Firewall … ≫  Approximately 2 appliances for 3 routers in enterprises [Sekar – HotNets’11] 2
  2. Challenge 3 How to build an energy-efficient software router? Can

    adapt dynamically to the incoming rate Consumes power in proportion to the incoming rate Still achieves peak packet forwarding performance Our solution: Reduce energy by up to 50%, Latency increase of 10us
  3. HW/SW Platform General Purpose x86 servers Linux + Click modular

    router (kernel mode) 10Gbps network ≫ Fast enough •  Routebricks, PacketShader, Netmap Open Platform ≫ Can use OS primitives for low-power 4
  4. Multiqueue operation 5 RSS hash 1 2 1 2 1

    1 2 ... C1 C2 Cn RX queue 1 demux queue number packets 7 lsb } } active cores idle cores (C6) RX queue 2 RX queue n redirect table Traffic is split among multiple HW queues Receive Side Scaling ≫ Each queue is managed by one core (no contention) ≫ How many queues/cores to use? Router  
  5. Primitives for low power Sleep States / C-States •  Co

    – Active, executing instructions •  C1 – Active, not executing instructions (clock-gated) … •  Cn – Deepest Sleep State (power-gated) ≫  Idle Power vs. Exit Latency tradeoff DVFS / P-States •  P0 – Max Operating Frequency •  P1, P2, P3 … •  Pn – Min Operating Frequency 6
  6. Power Consumption Breakdown ≫  High IDLE power ≫  Memory, NICs

    contribute little ≫  CPUs are the most power-hungry components, with a high dynamic range 8 Idle, w/out Click Click, zero traffic Click, 40 Gbps 0 50 100 150 200 250 System power (W) Motherboard Fans NICs Memory CPUs
  7. System Idle Power Trend Idle Power 0 50 100 150

    200 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • •• 2008 2009 2010 2011 • • Single Processor Dual Processor 9 SPECPower  data  
  8. Addressing SW inefficiency with NAPI 104 106 100 101 102

    103 Packet rate (pps) Latency (us) NAPI−Click Click (unmodified) 10 introduced  a  modest   increase  in  latency   Enables  power  savings     Idle 40Gbps @ 1024B 29Mpps @ 64B 0 50 100 150 200 250 System power (W) NAPI−Click Click (unmodified) 5us  
  9. Power Saving algorithms Design space 1.  How many cores to

    allocate? 2.  At what frequency should they run? 3.  Which sleep states to use? •  Active – underutilized cores •  Inactive cores 12
  10. Power Saving algorithms Design space Single Core ≫ Race-to-idle •  Process

    packets as fast as possible •  Maximize sleep time ≫ Just-in-time •  Process packets as slow as we can •  Never sleep Multi Core ≫ #cores vs. operating frequency tradeoff 13
  11. Single Core Case 120 140 160 180 200 220 240

    System power (W) 1 core 1.9 Mpps 6 cores 10 Mpps 12 cores 19 Mpps 1.6 GHz − no idle 2.4 GHz 3.3 GHz 14 ≫ Just-in-time vs Race-to-idle •  I/O bound workload •  Doubling the frequency does not halves the time •  Race-to-idle drawbacks (idle power, exit-latency)
  12. Single Core Case 120 140 160 180 200 220 240

    System power (W) 1 core 1.9 Mpps 6 cores 10 Mpps 12 cores 19 Mpps 1.6 GHz − no idle 2.4 GHz 3.3 GHz 15 ≫ Just-in-time vs Race-to-idle •  I/O bound workload •  Doubling the frequency does not halves the time •  Race-to-idle drawbacks (idle power, exit-latency) Run  at  the  minimum  frequency  that     keep  up  with  the  incoming  rate  
  13. Multicore case - # cores Use k cores at frequency

    f Use n*k cores at frequency f/n 16 1 2 3 4 5 6 0.02 0.04 0.06 0.08 0.1 0.12 k Energy Efficiency (Mpkts/J) k cores f=3.2 GHz 2k cores f=1.6 GHz Higher     is  better  
  14. Multicore case - # cores Why not run all the

    cores all the time? ≫  Limited number of frequency levels available ≫  The lower frequency level is typically high •  Half the maximum in our case (1.6GHz – 3.3Ghz) 17
  15. Multicore case - # cores Why not run all the

    cores all the time? ≫  Limited number of frequency levels available ≫  The lower frequency level is typically high •  Half the maximum in our case (1.6GHz – 3.3Ghz) 18 Run  the  maximum  number  of  cores  that  can   be  kept  fully  utilized  
  16. Multicore case – Sleep states How to operate inactive and

    underutilized cores? 19 C-­‐State   System  Power   (12  cores)   Exit  Latency   C1   133  W   <  1  us   C3   120  W   ~  60  us   C6   115  W   ~  87  us   Best  for   inactive   Cores  
  17. Multicore case – Sleep states How to operate inactive and

    underutilized cores? 20 C-­‐State   System  Power   (12  cores)   Exit  Latency   C1   133  W   <  1  us   C3   120  W   ~  60  us   C6   115  W   ~  87  us   Let  underutilized  cores  take  quick  and  light   naps  (C1)   Best  for   inactive   Cores  
  18. 0 5 10 15 20 25 30 120 140 160

    180 200 220 240 260 Offered Load (Mpps) System Power (W) Power envelope 1  core   12  cores   22 >>>  Load  increase  >>>  
  19. 0 5 10 15 20 25 30 120 140 160

    180 200 220 240 260 Offered Load (Mpps) System Power (W) Power envelope 1  core   12  cores   23 >>>  Load  increase  >>>  
  20. 0 5 10 15 20 25 30 120 140 160

    180 200 220 240 260 Offered Load (Mpps) System Power (W) Power envelope 1  core   12  cores   24 >>>  Load  increase  >>>  
  21. 0 5 10 15 20 25 30 120 140 160

    180 200 220 240 260 Offered Load (Mpps) System Power (W) Power envelope 1  core   12  cores   25 <<<  Load  decrease  <<<  
  22. 0 5 10 15 20 25 30 120 140 160

    180 200 220 240 260 Offered Load (Mpps) System Power (W) Power envelope 1  core   12  cores   26 <<<  Load  decrease  <<<  
  23. 0 5 10 15 20 25 30 120 140 160

    180 200 220 240 260 Offered Load (Mpps) System Power (W) Power envelope 1  core   12  cores   27 <<<  Load  decrease  <<<  
  24. RSS hash 1 2 1 2 1 1 2 ...

    C1 C2 Cn RX queue 1 demux queue number packets 7 lsb } } active cores idle cores (C6) RX queue 2 RX queue n redirect table 1 2 3 … n 1 2 … n Implementation 28 Router  
  25. RSS hash 1 2 1 2 1 1 2 ...

    C1 C2 Cn RX queue 1 demux queue number packets 7 lsb } } active cores idle cores (C6) RX queue 2 RX queue n redirect table 1 2 3 … n 1 2 … n Implementation Controller   29 Router  
  26. RSS hash 1 2 1 2 1 1 2 ...

    C1 C2 Cn RX queue 1 demux queue number packets 7 lsb } } active cores idle cores (C6) RX queue 2 RX queue n redirect table 1 2 3 … n 1 2 … n Implementation Controller   30 Router  
  27. RSS hash 1 2 1 2 1 1 2 ...

    C1 C2 Cn RX queue 1 demux queue number packets 7 lsb } } active cores idle cores (C6) RX queue 2 RX queue n redirect table 1 2 3 … n 1 2 … n 1 2 1 2 1 2 1 2 1 Implementation Controller   31 Router  
  28. RSS hash 1 2 1 2 1 1 2 ...

    C1 C2 Cn RX queue 1 demux queue number packets 7 lsb } } active cores idle cores (C6) RX queue 2 RX queue n redirect table 1 2 3 … n 1 2 … n 1 2 1 2 1 2 1 2 1 Implementation Controller   32 Router  
  29. RSS hash 1 2 1 2 1 1 2 ...

    C1 C2 Cn RX queue 1 demux queue number packets 7 lsb } } active cores idle cores (C6) RX queue 2 RX queue n redirect table 1 2 3 … n 1 2 … n 1 2 1 2 1 2 1 2 1 Implementation Controller   33 Router  
  30. Power consumption 0 50 100 150 200 250 300 100

    200 300 Idle Time (s) System Power (W) 0 50 100 150 200 250 300 100 200 300 Idle Time (s) System Power (W) 0 50 100 150 200 250 300 100 200 300 Idle Time (s) System Power (W) 0 50 100 150 200 250 300 100 200 300 Idle Time (s) System Power (W) S y s t e m   Po w e r   ( W )   T im e   (s)   IPv4  Routing   NetFlow   IPSEC   WAN   Optimizer   28%   24%   12%   20%   Savings,  compared   to  NAPI-­‐Click   35 0 50 100 150 200 250 300 100 200 300 Idle Time (s) System Power (W) Click NAPI−Click PowerSave
  31. Latency / Loss / Reordering Latency ≫ ~10µs increase on average

    compared to polling No Packet Loss No Reordering ≫ could happen when waking up a queue 36 0 50 100 150 200 250 300 0 5 10 15 20 25 30 Time (s) Latency (us) Click Powersave 0 50 100 150 200 250 300 0 10 20 30 40 Time (s) Input Rate (Gbps) traffic profile
  32. Conclusion Algorithm guidelines •  Run the smallest number of cores

    at the minimum frequency •  Increase number of cores before increasing the frequency ≫ Make the best use of power-hungry resources On-Line algorithm implementation ≫ Monitor queue length and react quickly •  Make sure that queues can absorb traffic during sleep states transitions Up to 50% savings are possible ≫ Depending on the application 37