Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

DAC 2011

3c332dfc0b438785cb10c5234652dd66?s=128

Yuhao Zhu

June 07, 2011
Tweet

Transcript

  1. 1.

    Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*

    Yangdong Deng‡ Yubei Chen‡ * Electrical and Computer Engineering University of Texas at Austin ‡ Institute of Microelectronics Tsinghua University
  2. 4.

    Motivation: IP Routing 2  Challenges in IP Router design

     Internet traffic is still increasing       
  3. 5.

    Motivation: IP Routing 2  Challenges in IP Router design

            Throughput & QoS!
  4. 6.

    Motivation: IP Routing 2  Challenges in IP Router design

      New network services and protocols keep appearing       Throughput & QoS!
  5. 7.

    Motivation: IP Routing 2  Challenges in IP Router design

            Throughput & QoS! Progammability
  6. 8.

    Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions      Throughput & QoS! Progammability
  7. 9.

    Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors     Throughput & QoS! Progammability
  8. 10.

    Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers    Throughput & QoS! Progammability
  9. 11.

    Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?   Throughput & QoS! Progammability
  10. 12.

    Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Throughput & QoS! Progammability
  11. 13.

    Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Mass market with strong development support Throughput & QoS! Progammability
  12. 14.

    GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]   
  13. 15.

    GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  
  14. 16.

    GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput 
  15. 17.

    GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Mu et al. [DATE2010]
  16. 18.

    GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS
  17. 19.

    GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS Worst case delay: batch_transfer_gr anularity/line- card_rate
  18. 24.

    Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication

    overhead  QoS wins!  Adaptive warp scheduler through Task FIFO and DCQ
  19. 26.

    Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory   Why?   
  20. 27.

    Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory   Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 
  21. 28.

    Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 
  22. 29.

    Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers!  Avoid consistency issues in shared memory systems
  23. 32.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way        
  24. 33.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism       
  25. 34.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet      
  26. 35.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO     
  27. 36.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible    
  28. 37.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency   
  29. 38.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  
  30. 39.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing 
  31. 40.

    Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ
  32. 41.

    Adaptive Warp Scheduler 6      

      Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ
  33. 43.

    Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns          7
  34. 44.

    Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core    7
  35. 45.

    Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core  Maximally allowed concurrent warps (MCW) per core  They compete for hardware resources  They affect the updating/fetching frequency 7
  36. 47.

    Evaluations 8  Throughput 0 100 200 300 400 DPI

    Classifier RTL DecTTL Throughput (Gbps) Line-card Rate CPU/GPU Hermes-8 Hermes-16 Hermes-32 0 50 100 150 200 DPI Classifier RTL Dec TTL Throughput (Gbps) Line-card CPU/GPU Hermes  
  37. 48.

    Evaluations 8   Delay  0 50 100 150

    DPI Classifier RTL DecTTL Delay (1K Cycles) CPU/GPU Hermes-8 Hermes-16 Hermes-32
  38. 49.

    Evaluations 8    Scalability 0 200 400 600

    800 1000 1200 DPI Classifier RTL DecTTL Throughput (Gbps) Line-card Rate Hermes 8 cores Hermes 17 cores Hermes 28 cores
  39. 52.

    Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time    
  40. 53.

    Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router   
  41. 54.

    Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  
  42. 55.

    Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market 
  43. 56.

    Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions
  44. 57.

    Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions Come to my poster to learn more!