Slide 1

Slide 1 text

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu* Yangdong Deng‡ Yubei Chen‡ * Electrical and Computer Engineering University of Texas at Austin ‡ Institute of Microelectronics Tsinghua University

Slide 2

Slide 2 text

Motivation: IP Routing 2         

Slide 3

Slide 3 text

Motivation: IP Routing 2  Challenges in IP Router design        

Slide 4

Slide 4 text

Motivation: IP Routing 2  Challenges in IP Router design  Internet traffic is still increasing       

Slide 5

Slide 5 text

Motivation: IP Routing 2  Challenges in IP Router design         Throughput & QoS!

Slide 6

Slide 6 text

Motivation: IP Routing 2  Challenges in IP Router design   New network services and protocols keep appearing       Throughput & QoS!

Slide 7

Slide 7 text

Motivation: IP Routing 2  Challenges in IP Router design         Throughput & QoS! Progammability

Slide 8

Slide 8 text

Motivation: IP Routing 2  Challenges in IP Router design    Traditional router solutions      Throughput & QoS! Progammability

Slide 9

Slide 9 text

Motivation: IP Routing 2  Challenges in IP Router design    Traditional router solutions  Hardware Routers: ASIC, Network Processors     Throughput & QoS! Progammability

Slide 10

Slide 10 text

Motivation: IP Routing 2  Challenges in IP Router design    Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers    Throughput & QoS! Progammability

Slide 11

Slide 11 text

Motivation: IP Routing 2  Challenges in IP Router design    Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?   Throughput & QoS! Progammability

Slide 12

Slide 12 text

Motivation: IP Routing 2  Challenges in IP Router design    Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Throughput & QoS! Progammability

Slide 13

Slide 13 text

Motivation: IP Routing 2  Challenges in IP Router design    Traditional router solutions  Hardware Routers: ASIC, Network Processors  PC based Software Routers  How about GPUs?  High computing power  Mass market with strong development support Throughput & QoS! Progammability

Slide 14

Slide 14 text

GPU based Software Router 3  Related Work  Smith et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]   

Slide 15

Slide 15 text

GPU based Software Router 3  Related Work  Smith et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  

Slide 16

Slide 16 text

GPU based Software Router 3  Related Work  Smith et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput 

Slide 17

Slide 17 text

GPU based Software Router 3  Related Work  Smith et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Mu et al. [DATE2010]

Slide 18

Slide 18 text

GPU based Software Router 3  Related Work  Smith et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS

Slide 19

Slide 19 text

GPU based Software Router 3  Related Work  Smith et al. [ISPASS2009]  Mu et al. [DATE2010]  Han et al. [SIGCOMM2010]  Restrictions  CPU/GPU communication overhead hurts overall throughput  Batch (warp) processing hurts QoS Worst case delay: batch_transfer_gr anularity/line- card_rate

Slide 20

Slide 20 text

Hermes Microarchitecture 4    

Slide 21

Slide 21 text

Hermes Microarchitecture 4  Throughput wins!   

Slide 22

Slide 22 text

Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication overhead  

Slide 23

Slide 23 text

Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication overhead  QoS wins! 

Slide 24

Slide 24 text

Hermes Microarchitecture 4  Throughput wins!  Shared-memory mitigating communication overhead  QoS wins!  Adaptive warp scheduler through Task FIFO and DCQ

Slide 25

Slide 25 text

Shared Memory Hierarchy 5  How?    Why?   

Slide 26

Slide 26 text

Shared Memory Hierarchy 5  How?  CPU/GPU connected to the shared, centralized memory   Why?   

Slide 27

Slide 27 text

Shared Memory Hierarchy 5  How?  CPU/GPU connected to the shared, centralized memory   Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 

Slide 28

Slide 28 text

Shared Memory Hierarchy 5  How?  CPU/GPU connected to the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers! 

Slide 29

Slide 29 text

Shared Memory Hierarchy 5  How?  CPU/GPU connected to the shared, centralized memory  Execution model compatible with traditional CPU/GPU systems  Why?  Except throughput…  Serves as a large packet buffer – impractical in traditional routers!  Avoid consistency issues in shared memory systems

Slide 30

Slide 30 text

Adaptive Warp Scheduler 6          

Slide 31

Slide 31 text

Adaptive Warp Scheduler 6  Basic idea         

Slide 32

Slide 32 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way        

Slide 33

Slide 33 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism       

Slide 34

Slide 34 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism  One GPU thread for one packet      

Slide 35

Slide 35 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO     

Slide 36

Slide 36 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible    

Slide 37

Slide 37 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency   

Slide 38

Slide 38 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  

Slide 39

Slide 39 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing 

Slide 40

Slide 40 text

Adaptive Warp Scheduler 6  Basic idea  Deliver packets in an agile way  Mechanism  One GPU thread for one packet  CPU passes #available packets to GPU through Task FIFO  GPU monitors the FIFO and starts processing whenever possible  Tradeoffs in choosing the updating/fetching frequency  Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ

Slide 41

Slide 41 text

Adaptive Warp Scheduler 6         Enforce in-order commit  Some protocols (UDP, etc.) require in-order packet committing  ROB-like structure called DCQ

Slide 42

Slide 42 text

Methodology             7

Slide 43

Slide 43 text

Methodology  Benchmarks: hand-coded complete software router in CUDA  Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns          7

Slide 44

Slide 44 text

Methodology  Benchmarks: hand-coded complete software router in CUDA  Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core    7

Slide 45

Slide 45 text

Methodology  Benchmarks: hand-coded complete software router in CUDA  Checking IP header  Packet classification  Routing table lookup  Decrementing TTL  IP fragmentation and Deep packet inspection  Various packet traces with both burst and sparse patterns  gpgpu-sim -- cycle accurate CUDA-compatible GPU simulator  8 shader cores  32-wide SIMD, 32-wide warp  1000MHz shared core frequency  16768 registers per shader core  16KByte shared memory per shared core  Maximally allowed concurrent warps (MCW) per core  They compete for hardware resources  They affect the updating/fetching frequency 7

Slide 46

Slide 46 text

Evaluations 8   

Slide 47

Slide 47 text

Evaluations 8  Throughput 0 100 200 300 400 DPI Classifier RTL DecTTL Throughput (Gbps) Line-card Rate CPU/GPU Hermes-8 Hermes-16 Hermes-32 0 50 100 150 200 DPI Classifier RTL Dec TTL Throughput (Gbps) Line-card CPU/GPU Hermes  

Slide 48

Slide 48 text

Evaluations 8   Delay  0 50 100 150 DPI Classifier RTL DecTTL Delay (1K Cycles) CPU/GPU Hermes-8 Hermes-16 Hermes-32

Slide 49

Slide 49 text

Evaluations 8    Scalability 0 200 400 600 800 1000 1200 DPI Classifier RTL DecTTL Throughput (Gbps) Line-card Rate Hermes 8 cores Hermes 17 cores Hermes 28 cores

Slide 50

Slide 50 text

Conclusion 9      

Slide 51

Slide 51 text

Conclusion 9  Ever-demanding need for high-quality IP Routers     

Slide 52

Slide 52 text

Conclusion 9  Ever-demanding need for high-quality IP Routers  Throughput, QoS and programmability are important metrics but often not guaranteed at the same time    

Slide 53

Slide 53 text

Conclusion 9  Ever-demanding need for high-quality IP Routers  Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router   

Slide 54

Slide 54 text

Conclusion 9  Ever-demanding need for high-quality IP Routers  Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  

Slide 55

Slide 55 text

Conclusion 9  Ever-demanding need for high-quality IP Routers  Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market 

Slide 56

Slide 56 text

Conclusion 9  Ever-demanding need for high-quality IP Routers  Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions

Slide 57

Slide 57 text

Conclusion 9  Ever-demanding need for high-quality IP Routers  Throughput, QoS and programmability are important metrics but often not guaranteed at the same time  Hermes: GPU-based software router  Meet all three at the same time  Leverage huge and mature GPU market  Minimal hardware extensions Come to my poster to learn more!