Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

DAC 2011

3c332dfc0b438785cb10c5234652dd66?s=128

Yuhao Zhu

June 07, 2011
Tweet

Transcript

  1. Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*

    Yangdong Deng‡ Yubei Chen‡ * Electrical and Computer Engineering University of Texas at Austin ‡ Institute of Microelectronics Tsinghua University
  2. Motivation: IP Routing 2      

      
  3. Motivation: IP Routing 2  Challenges in IP Router design

           
  4. Motivation: IP Routing 2  Challenges in IP Router design

     Internet traffic is still increasing       
  5. Motivation: IP Routing 2  Challenges in IP Router design

            Throughput & QoS!
  6. Motivation: IP Routing 2  Challenges in IP Router design

      New network services and protocols keep appearing       Throughput & QoS!
  7. Motivation: IP Routing 2  Challenges in IP Router design

            Throughput & QoS! Progammability
  8. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions      Throughput & QoS! Progammability
  9. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors     Throughput & QoS! Progammability
  10. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers    Throughput & QoS! Progammability
  11. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?   Throughput & QoS! Progammability
  12. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Throughput & QoS! Progammability
  13. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Mass market with strong development support Throughput & QoS! Progammability
  14. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]   
  15. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  
  16. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput 
  17. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Mu et al. [DATE2010]
  18. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS
  19. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS Worst case delay: batch_transfer_gr anularity/line- card_rate
  20. Hermes Microarchitecture 4    

  21. Hermes Microarchitecture 4  Throughput wins!   

  22. Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication

    overhead  
  23. Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication

    overhead  QoS wins! 
  24. Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication

    overhead  QoS wins!  Adaptive warp scheduler through Task FIFO and DCQ
  25. Shared Memory Hierarchy 5  How?    Why?

      
  26. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory   Why?   
  27. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory   Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 
  28. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 
  29. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers!  Avoid consistency issues in shared memory systems
  30. Adaptive Warp Scheduler 6      

       
  31. Adaptive Warp Scheduler 6  Basic idea   

         
  32. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way        
  33. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism       
  34. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet      
  35. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO     
  36. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible    
  37. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency   
  38. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  
  39. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing 
  40. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ
  41. Adaptive Warp Scheduler 6      

      Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ
  42. Methodology         

       7
  43. Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns          7
  44. Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core    7
  45. Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core  Maximally allowed concurrent warps (MCW) per core  They compete for hardware resources  They affect the updating/fetching frequency 7
  46. Evaluations 8   

  47. Evaluations 8  Throughput 0 100 200 300 400 DPI

    Classifier RTL DecTTL Throughput (Gbps) Line-card Rate CPU/GPU Hermes-8 Hermes-16 Hermes-32 0 50 100 150 200 DPI Classifier RTL Dec TTL Throughput (Gbps) Line-card CPU/GPU Hermes  
  48. Evaluations 8   Delay  0 50 100 150

    DPI Classifier RTL DecTTL Delay (1K Cycles) CPU/GPU Hermes-8 Hermes-16 Hermes-32
  49. Evaluations 8    Scalability 0 200 400 600

    800 1000 1200 DPI Classifier RTL DecTTL Throughput (Gbps) Line-card Rate Hermes 8 cores Hermes 17 cores Hermes 28 cores
  50. Conclusion 9      

  51. Conclusion 9  Ever-demanding need for high-quality IP Routers 

       
  52. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time    
  53. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router   
  54. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  
  55. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market 
  56. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions
  57. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions Come to my poster to learn more!