Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hermes: An Integrated CPU/GPU Microarchitecture...

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

DAC 2011

Yuhao Zhu

June 07, 2011
Tweet

More Decks by Yuhao Zhu

Other Decks in Education

Transcript

  1. Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu*

    Yangdong Deng‡ Yubei Chen‡ * Electrical and Computer Engineering University of Texas at Austin ‡ Institute of Microelectronics Tsinghua University
  2. Motivation: IP Routing 2  Challenges in IP Router design

     Internet traffic is still increasing       
  3. Motivation: IP Routing 2  Challenges in IP Router design

            Throughput & QoS!
  4. Motivation: IP Routing 2  Challenges in IP Router design

      New network services and protocols keep appearing       Throughput & QoS!
  5. Motivation: IP Routing 2  Challenges in IP Router design

            Throughput & QoS! Progammability
  6. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions      Throughput & QoS! Progammability
  7. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors     Throughput & QoS! Progammability
  8. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers    Throughput & QoS! Progammability
  9. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?   Throughput & QoS! Progammability
  10. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Throughput & QoS! Progammability
  11. Motivation: IP Routing 2  Challenges in IP Router design

       Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Mass market with strong development support Throughput & QoS! Progammability
  12. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]   
  13. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  
  14. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput 
  15. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Mu et al. [DATE2010]
  16. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS
  17. GPU based Software Router 3  Related Work  Smith

    et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS Worst case delay: batch_transfer_gr anularity/line- card_rate
  18. Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication

    overhead  QoS wins!  Adaptive warp scheduler through Task FIFO and DCQ
  19. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory   Why?   
  20. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory   Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 
  21. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 
  22. Shared Memory Hierarchy 5  How?  CPU/GPU connected to

    the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers!  Avoid consistency issues in shared memory systems
  23. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way        
  24. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism       
  25. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet      
  26. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO     
  27. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible    
  28. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency   
  29. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  
  30. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing 
  31. Adaptive Warp Scheduler 6  Basic idea  Deliver packets

    in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ
  32. Adaptive Warp Scheduler 6      

      Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ
  33. Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns          7
  34. Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core    7
  35. Methodology  Benchmarks: hand-coded complete software router in CUDA 

    Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core  Maximally allowed concurrent warps (MCW) per core  They compete for hardware resources  They affect the updating/fetching frequency 7
  36. Evaluations 8  Throughput 0 100 200 300 400 DPI

    Classifier RTL DecTTL Throughput (Gbps) Line-card Rate CPU/GPU Hermes-8 Hermes-16 Hermes-32 0 50 100 150 200 DPI Classifier RTL Dec TTL Throughput (Gbps) Line-card CPU/GPU Hermes  
  37. Evaluations 8   Delay  0 50 100 150

    DPI Classifier RTL DecTTL Delay (1K Cycles) CPU/GPU Hermes-8 Hermes-16 Hermes-32
  38. Evaluations 8    Scalability 0 200 400 600

    800 1000 1200 DPI Classifier RTL DecTTL Throughput (Gbps) Line-card Rate Hermes 8 cores Hermes 17 cores Hermes 28 cores
  39. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time    
  40. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router   
  41. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  
  42. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market 
  43. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions
  44. Conclusion 9  Ever-demanding need for high-quality IP Routers 

    Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions Come to my poster to learn more!