Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

DAC 2011

Yuhao Zhu

June 07, 2011
Tweet

More Decks by Yuhao Zhu

Other Decks in Education

Transcript

  1. Hermes: An Integrated CPU/GPU
    Microarchitecture for IP Routing
    Yuhao Zhu*
    Yangdong Deng‡
    Yubei Chen‡
    * Electrical and Computer Engineering
    University of Texas at Austin
    ‡ Institute of Microelectronics
    Tsinghua University

    View Slide

  2. Motivation: IP Routing
    2









    View Slide

  3. Motivation: IP Routing
    2
     Challenges in IP Router design








    View Slide

  4. Motivation: IP Routing
    2
     Challenges in IP Router design
     Internet traffic is still increasing







    View Slide

  5. Motivation: IP Routing
    2
     Challenges in IP Router design








    Throughput &
    QoS!

    View Slide

  6. Motivation: IP Routing
    2
     Challenges in IP Router design

     New network services and protocols keep appearing






    Throughput &
    QoS!

    View Slide

  7. Motivation: IP Routing
    2
     Challenges in IP Router design








    Throughput &
    QoS!
    Progammability

    View Slide

  8. Motivation: IP Routing
    2
     Challenges in IP Router design


     Traditional router solutions





    Throughput &
    QoS!
    Progammability

    View Slide

  9. Motivation: IP Routing
    2
     Challenges in IP Router design


     Traditional router solutions
     Hardware Routers: ASIC, Network Processors




    Throughput &
    QoS!
    Progammability

    View Slide

  10. Motivation: IP Routing
    2
     Challenges in IP Router design


     Traditional router solutions
     Hardware Routers: ASIC, Network Processors
     PC based Software Routers



    Throughput &
    QoS!
    Progammability

    View Slide

  11. Motivation: IP Routing
    2
     Challenges in IP Router design


     Traditional router solutions
     Hardware Routers: ASIC, Network Processors
     PC based Software Routers
     How about GPUs?


    Throughput &
    QoS!
    Progammability

    View Slide

  12. Motivation: IP Routing
    2
     Challenges in IP Router design


     Traditional router solutions
     Hardware Routers: ASIC, Network Processors
     PC based Software Routers
     How about GPUs?
     High computing power

    Throughput &
    QoS!
    Progammability

    View Slide

  13. Motivation: IP Routing
    2
     Challenges in IP Router design


     Traditional router solutions
     Hardware Routers: ASIC, Network Processors
     PC based Software Routers
     How about GPUs?
     High computing power
     Mass market with strong development support
    Throughput &
    QoS!
    Progammability

    View Slide

  14. GPU based Software Router
    3
     Related Work
     Smith et al. [ISPASS2009]
     Mu et al. [DATE2010]
     Han et al. [SIGCOMM2010]



    View Slide

  15. GPU based Software Router
    3
     Related Work
     Smith et al. [ISPASS2009]
     Mu et al. [DATE2010]
     Han et al. [SIGCOMM2010]
     Restrictions


    View Slide

  16. GPU based Software Router
    3
     Related Work
     Smith et al. [ISPASS2009]
     Mu et al. [DATE2010]
     Han et al. [SIGCOMM2010]
     Restrictions
     CPU/GPU communication overhead hurts overall throughput

    View Slide

  17. GPU based Software Router
    3
     Related Work
     Smith et al. [ISPASS2009]
     Mu et al. [DATE2010]
     Han et al. [SIGCOMM2010]
     Restrictions
     CPU/GPU communication overhead hurts overall throughput

    Mu et al. [DATE2010]

    View Slide

  18. GPU based Software Router
    3
     Related Work
     Smith et al. [ISPASS2009]
     Mu et al. [DATE2010]
     Han et al. [SIGCOMM2010]
     Restrictions
     CPU/GPU communication overhead hurts overall throughput
     Batch (warp) processing hurts QoS

    View Slide

  19. GPU based Software Router
    3
     Related Work
     Smith et al. [ISPASS2009]
     Mu et al. [DATE2010]
     Han et al. [SIGCOMM2010]
     Restrictions
     CPU/GPU communication overhead hurts overall throughput
     Batch (warp) processing hurts QoS
    Worst case delay:
    batch_transfer_gr
    anularity/line-
    card_rate

    View Slide

  20. Hermes Microarchitecture
    4




    View Slide

  21. Hermes Microarchitecture
    4
     Throughput wins!



    View Slide

  22. Hermes Microarchitecture
    4
     Throughput wins!
     Shared-memory mitigating communication overhead


    View Slide

  23. Hermes Microarchitecture
    4
     Throughput wins!
     Shared-memory mitigating communication overhead
     QoS wins!

    View Slide

  24. Hermes Microarchitecture
    4
     Throughput wins!
     Shared-memory mitigating communication overhead
     QoS wins!
     Adaptive warp scheduler through Task FIFO and DCQ

    View Slide

  25. Shared Memory Hierarchy
    5
     How?


     Why?



    View Slide

  26. Shared Memory Hierarchy
    5
     How?
     CPU/GPU connected to the shared, centralized memory

     Why?



    View Slide

  27. Shared Memory Hierarchy
    5
     How?
     CPU/GPU connected to the shared, centralized memory

     Why?
     Except throughput…
     Serves as a large packet buffer – impractical in traditional routers!

    View Slide

  28. Shared Memory Hierarchy
    5
     How?
     CPU/GPU connected to the shared, centralized memory
     Execution model compatible with traditional CPU/GPU systems
     Why?
     Except throughput…
     Serves as a large packet buffer – impractical in traditional routers!

    View Slide

  29. Shared Memory Hierarchy
    5
     How?
     CPU/GPU connected to the shared, centralized memory
     Execution model compatible with traditional CPU/GPU systems
     Why?
     Except throughput…
     Serves as a large packet buffer – impractical in traditional routers!
     Avoid consistency issues in shared memory systems

    View Slide

  30. Adaptive Warp Scheduler
    6










    View Slide

  31. Adaptive Warp Scheduler
    6
     Basic idea









    View Slide

  32. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way








    View Slide

  33. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism







    View Slide

  34. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism
     One GPU thread for one packet






    View Slide

  35. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism
     One GPU thread for one packet
     CPU passes #available packets to GPU through Task FIFO





    View Slide

  36. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism
     One GPU thread for one packet
     CPU passes #available packets to GPU through Task FIFO
     GPU monitors the FIFO and starts processing whenever possible




    View Slide

  37. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism
     One GPU thread for one packet
     CPU passes #available packets to GPU through Task FIFO
     GPU monitors the FIFO and starts processing whenever possible
     Tradeoffs in choosing the updating/fetching frequency



    View Slide

  38. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism
     One GPU thread for one packet
     CPU passes #available packets to GPU through Task FIFO
     GPU monitors the FIFO and starts processing whenever possible
     Tradeoffs in choosing the updating/fetching frequency
     Enforce in-order commit


    View Slide

  39. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism
     One GPU thread for one packet
     CPU passes #available packets to GPU through Task FIFO
     GPU monitors the FIFO and starts processing whenever possible
     Tradeoffs in choosing the updating/fetching frequency
     Enforce in-order commit
     Some protocols (UDP, etc.) require in-order packet committing

    View Slide

  40. Adaptive Warp Scheduler
    6
     Basic idea
     Deliver packets in an agile way
     Mechanism
     One GPU thread for one packet
     CPU passes #available packets to GPU through Task FIFO
     GPU monitors the FIFO and starts processing whenever possible
     Tradeoffs in choosing the updating/fetching frequency
     Enforce in-order commit
     Some protocols (UDP, etc.) require in-order packet committing
     ROB-like structure called DCQ

    View Slide

  41. Adaptive Warp Scheduler
    6







     Enforce in-order commit
     Some protocols (UDP, etc.) require in-order packet committing
     ROB-like structure called DCQ

    View Slide

  42. Methodology












    7

    View Slide

  43. Methodology
     Benchmarks: hand-coded complete software router in CUDA
     Checking IP header  Packet classification  Routing table lookup 
    Decrementing TTL  IP fragmentation and Deep packet inspection
     Various packet traces with both burst and sparse patterns









    7

    View Slide

  44. Methodology
     Benchmarks: hand-coded complete software router in CUDA
     Checking IP header  Packet classification  Routing table lookup 
    Decrementing TTL  IP fragmentation and Deep packet inspection
     Various packet traces with both burst and sparse patterns
     gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator
     8 shader cores
     32-wide SIMD, 32-wide warp
     1000MHz shared core frequency
     16768 registers per shader core
     16KByte shared memory per shared core



    7

    View Slide

  45. Methodology
     Benchmarks: hand-coded complete software router in CUDA
     Checking IP header  Packet classification  Routing table lookup 
    Decrementing TTL  IP fragmentation and Deep packet inspection
     Various packet traces with both burst and sparse patterns
     gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator
     8 shader cores
     32-wide SIMD, 32-wide warp
     1000MHz shared core frequency
     16768 registers per shader core
     16KByte shared memory per shared core
     Maximally allowed concurrent warps (MCW) per core
     They compete for hardware resources
     They affect the updating/fetching frequency
    7

    View Slide

  46. Evaluations
    8

     

    View Slide

  47. Evaluations
    8
     Throughput
    0
    100
    200
    300
    400
    DPI Classifier RTL DecTTL
    Throughput (Gbps)
    Line-card Rate
    CPU/GPU
    Hermes-8
    Hermes-16
    Hermes-32
    0
    50
    100
    150
    200
    DPI Classifier RTL Dec TTL
    Throughput (Gbps)
    Line-card
    CPU/GPU
    Hermes
     

    View Slide

  48. Evaluations
    8

     Delay 
    0
    50
    100
    150
    DPI Classifier RTL DecTTL
    Delay (1K Cycles)
    CPU/GPU
    Hermes-8
    Hermes-16
    Hermes-32

    View Slide

  49. Evaluations
    8

      Scalability
    0
    200
    400
    600
    800
    1000
    1200
    DPI Classifier RTL DecTTL
    Throughput (Gbps)
    Line-card Rate
    Hermes 8 cores
    Hermes 17 cores
    Hermes 28 cores

    View Slide

  50. Conclusion
    9






    View Slide

  51. Conclusion
    9
     Ever-demanding need for high-quality IP Routers





    View Slide

  52. Conclusion
    9
     Ever-demanding need for high-quality IP Routers
     Throughput, QoS and programmability are important metrics but often
    not guaranteed at the same time




    View Slide

  53. Conclusion
    9
     Ever-demanding need for high-quality IP Routers
     Throughput, QoS and programmability are important metrics but often
    not guaranteed at the same time
     Hermes: GPU-based software router



    View Slide

  54. Conclusion
    9
     Ever-demanding need for high-quality IP Routers
     Throughput, QoS and programmability are important metrics but often
    not guaranteed at the same time
     Hermes: GPU-based software router
     Meet all three at the same time


    View Slide

  55. Conclusion
    9
     Ever-demanding need for high-quality IP Routers
     Throughput, QoS and programmability are important metrics but often
    not guaranteed at the same time
     Hermes: GPU-based software router
     Meet all three at the same time
     Leverage huge and mature GPU market

    View Slide

  56. Conclusion
    9
     Ever-demanding need for high-quality IP Routers
     Throughput, QoS and programmability are important metrics but often
    not guaranteed at the same time
     Hermes: GPU-based software router
     Meet all three at the same time
     Leverage huge and mature GPU market
     Minimal hardware extensions

    View Slide

  57. Conclusion
    9
     Ever-demanding need for high-quality IP Routers
     Throughput, QoS and programmability are important metrics but often
    not guaranteed at the same time
     Hermes: GPU-based software router
     Meet all three at the same time
     Leverage huge and mature GPU market
     Minimal hardware extensions
    Come to my poster to learn more!

    View Slide