Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Precise Runahead Execution

Ajeya Naithani
February 25, 2020

Precise Runahead Execution

IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2020
Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout
Ghent University, Belgium, and Universitat Politecnica de Valencia, Spain

Ajeya Naithani

February 25, 2020
Tweet

Other Decks in Research

Transcript

  1. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns memory access L1 stalling load Loads L
  2. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access L1 stalling load Loads L
  3. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access memory access L1 stalling load L2 Loads L
  4. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access memory access memory access L1 stalling load L2 L3 Loads L
  5. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L
  6. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L L1 L2 L3
  7. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L L1 L2 L3
  8. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L L1 L2 L3
  9. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L memory access L4 L1 L2 L3
  10. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L memory access L4 memory access L5 L1 L2 L3
  11. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access stalling load Loads L memory access L4 memory access Increased Memory-Level Parallelism (MLP) L5 L1 L2 L3
  12. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access stalling load Loads L memory access L4 memory access Increased Memory-Level Parallelism (MLP) L5 L1 L2 L3 runahead interval
  13. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3
  14. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch
  15. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode
  16. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename
  17. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch
  18. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue
  19. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue execute
  20. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue execute commit
  21. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L
  22. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L L1
  23. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L A2 Producer A L1
  24. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L A1 A2 Producer A L1
  25. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A L1 L2 L3 A1 A2 L1
  26. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A L1 L2 L3 A1 A2 L1
  27. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A A1 A2 L1 L1 L2 L3 A1 A2 L1
  28. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A A1 A2 L1 L1 L2 L3 A1 A2 L1
  29. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access stalling load Loads L Producer A A1 A2 L1 L1 L2 L3 A1 A2 L1 Increased Memory-Level Parallelism (MLP)
  30. Runahead Buffer Re-Executes the Window 7 time cache hit cache

    hit cache hit re-executed instructions L1 L2 L3
  31. Runahead Buffer Re-Executes the Window 7 time cache hit cache

    hit cache hit re-executed instructions L1 L2 L3 fetch decode rename dispatch issue execute commit
  32. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  33. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** Flush ROB *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  34. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** Flush ROB ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  35. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  36. Flushing and Re-Filling Incur High Overhead ▪ Front-end refill =

    8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles 9
  37. Flushing and Re-Filling Incur High Overhead ▪ Front-end refill =

    8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles ▪ Total overhead = 56 cycles 9
  38. Flushing and Re-Filling Incur High Overhead ▪ Front-end refill =

    8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles ▪ Total overhead = 56 cycles 9 Runahead causes a pipeline bubble of 56 cycles per invocation
  39. Flushing and Re-Filling Incur High Overhead 10 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA
  40. Flushing and Re-Filling Incur High Overhead 10 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA runahead: 15.9%
  41. Flushing and Re-Filling Incur High Overhead 11 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA RA-no-overhead runahead: 15.9%
  42. Flushing and Re-Filling Incur High Overhead 11 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA RA-no-overhead runahead: 15.9% runahead without flushing: 22.7%
  43. Flushing and Re-Filling Incur High Overhead 11 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA RA-no-overhead runahead: 15.9% runahead without flushing: 22.7% 6.8%
  44. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  45. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals
  46. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals 
  47. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  48. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  49. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ All *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  50. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ All Only one slice *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  51. Runahead Techniques Provide Limited Prefetch Coverage ▪ Runahead execution: Executes

    useless instructions ▪ Runahead buffer: High coverage for only one slice 13
  52. Only One Load does not Lead to Majority of Memory

    Accesses 14 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik LLC misses identical to stalling load distinct from stalling load
  53. Only One Load does not Lead to Majority of Memory

    Accesses 14 Most of the long-latency loads during runahead differ from the stalling load 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik LLC misses identical to stalling load distinct from stalling load
  54. Applications Access Memory through Multiple Slices 15 0% 20% 40%

    60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik unique load instructions 1 2-4 5-7 8+
  55. Applications Access Memory through Multiple Slices 15 There are more

    than eight unique load instructions accessing memory during each runahead interval 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik unique load instructions 1 2-4 5-7 8+
  56. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ All Only one slice *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  57. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  58. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  59. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice High High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  60. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  61. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  62. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  63. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  64. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  65. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  66. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  67. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices Very high *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  68. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices Very high High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  69. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Precise runahead*** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices Very high High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] ***[Naithani et al. HPCA’ 20] Short intervals   ✓
  70. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB 17
  71. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB When running ahead: 17
  72. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB When running ahead: 1. Executes only useful instructions in runahead mode 17
  73. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB When running ahead: 1. Executes only useful instructions in runahead mode 2. Efficiently manages microarchitectural resources 17
  74. Processor Resources at Full-Window Stall 18 0 20 40 60

    80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers
  75. Processor Resources at Full-Window Stall 18 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers
  76. Processor Resources at Full-Window Stall 19 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers
  77. Processor Resources at Full-Window Stall 19 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers FP registers: 56%
  78. Processor Resources at Full-Window Stall 20 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers IQ entries FP registers: 56%
  79. Processor Resources at Full-Window Stall 20 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers IQ entries FP registers: 56% IQ entries: 37%
  80. Processor Resources at Full-Window Stall 20 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers IQ entries FP registers: 56% IQ entries: 37% There are sufficient resources to start runahead without flushing the ROB
  81. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L1 L2 L3 Producer A issue queue register file commit memory memory memory Normal Speculative no progress
  82. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L1 L2 L3 Producer A issue queue register file commit memory memory memory Normal Speculative
  83. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  84. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  85. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory memory Normal Speculative
  86. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 issue queue register file commit memory memory memory memory Normal Speculative
  87. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 issue queue register file commit memory memory memory memory Normal Speculative
  88. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory memory Normal Speculative
  89. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory memory Normal Speculative
  90. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  91. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  92. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  93. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  94. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
  95. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
  96. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
  97. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
  98. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
  99. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
  100. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  101. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  102. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  103. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory memory Normal Speculative
  104. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory
  105. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B
  106. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B 200 Cycles
  107. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B  200 Cycles
  108. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory
  109. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative ✓ memory
  110. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative ✓ ? memory
  111. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative ✓ ? ? ? ? ? memory
  112. Two Key Questions 1. How to identify only useful instructions?

    2. How to recycle (physical) registers? 24
  113. Two Key Questions 1. How to identify only useful instructions?

    2. How to recycle (physical) registers? 24 Iterative Backward Dependency Analysis (IBDA)
  114. Two Key Questions 1. How to identify only useful instructions?

    2. How to recycle (physical) registers? 24 Iterative Backward Dependency Analysis (IBDA) Runahead Register Reclamation
  115. Iteratively Identifying the Stalling Slices 25 L4 A1 A2 A3

    Register Allocation Table (RAT) r1 r2 r3 r4 r2 r3 r4 r5
  116. Iteratively Identifying the Stalling Slices 25 L4 A1 A2 A3

    Register Allocation Table (RAT) Arch. register r1 r2 r3 r4 r2 r3 r4 r5
  117. Iteratively Identifying the Stalling Slices 25 L4 A1 A2 A3

    Phy. register Register Allocation Table (RAT) Arch. register r1 r2 r3 r4 r2 r3 r4 r5
  118. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register r1 r2 r3 r4 r2 r3 r4 r5
  119. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5
  120. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A0
  121. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A0
  122. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A0
  123. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 A0
  124. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 A0
  125. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) A0
  126. Iteratively Identifying the Stalling Slices 26 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) Iteration-1: L4 stalls the window A0
  127. Iteratively Identifying the Stalling Slices 26 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-1: L4 stalls the window A0
  128. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 A0
  129. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-2: L4 hits in the SST A0
  130. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r4, read A3 Iteration-2: L4 hits in the SST A0
  131. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r4, read A3 Iteration-2: L4 hits in the SST A3 A0
  132. Iteratively Identifying the Stalling Slices 28 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-3: A3 hits in the SST A3 A0
  133. Iteratively Identifying the Stalling Slices 28 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r3, read A2 Iteration-3: A3 hits in the SST A3 A0
  134. Iteratively Identifying the Stalling Slices 28 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r3, read A2 Iteration-3: A3 hits in the SST A3 A2 A0
  135. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-4: A2 hits in the SST A3 A2 A0
  136. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r2, read A1 Iteration-4: A2 hits in the SST A3 A2 A0
  137. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r2, read A1 Iteration-4: A2 hits in the SST A3 A2 A1 A0
  138. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r2, read A1 Iteration-4: A2 hits in the SST A3 A2 A1 A0
  139. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 P0 normal mode
  140. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 P0 normal mode
  141. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 P0 normal mode
  142. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 P0 normal mode
  143. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 I6 sub r1  r2, r6 P11 P9 P10 P6 P0 normal mode
  144. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 I6 sub r1  r2, r6 P11 P9 P10 P6 P0 runahead mode normal mode
  145. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 I6 sub r1  r2, r6 P11 P9 P10 P6 P0 OldPhy register I1 I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 P0 runahead mode normal mode
  146. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Precise Register Deallocation Queue (PRDQ)
  147. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Precise Register Deallocation Queue (PRDQ) dispatch
  148. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch
  149. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch 0 0 0 0 0 0
  150. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 0 0 0
  151. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 0 0 1
  152. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 0 1 1
  153. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1
  154. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1 ✓ ✓
  155. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1  ✓ ✓
  156. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1  ✓ ✓
  157. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  158. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  159. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  160. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  161. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  162. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  163. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  164. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  165. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT) Normal Mode
  166. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  167. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  168. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  169. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  170. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. PRDQ I1 P5 0 I2 P3 1 I3 0 …. Rename (RAT) Normal Mode
  171. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. PRDQ I1 P5 0 I2 P3 1 I3 0 …. Rename (RAT) Normal Mode Runahead Mode
  172. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. PRDQ I1 P5 0 I2 P3 1 I3 0 …. Rename (RAT) Normal Mode Runahead Mode
  173. Evaluation Simulator: Sniper 6.0, McPAT Workloads: SPEC CPU2006/CPU2017, 1B SimPoints

    Baseline: ROB=192, issue queue=92, load/store queue=64, register file=168/168 33
  174. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals 34 *[Mutlu et al. ISCA’05]
  175. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals 34 *[Mutlu et al. ISCA’05]
  176. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** 34 *[Mutlu et al. ISCA’05]
  177. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** RA-hybrid: Better performing mechanism between RA-buffer and RA 34 *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  178. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** RA-hybrid: Better performing mechanism between RA-buffer and RA 34 *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] ***[Naithani et al. HPCA’20] PRE: Precise runahead execution***
  179. Evaluation – Performance 35 0.0 0.5 1.0 1.5 2.0 2.5

    zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  180. Evaluation – Performance 35 RA: 15.9% 0.0 0.5 1.0 1.5

    2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  181. Evaluation – Performance 35 RA: 15.9% RA-buffer: 13.3% 0.0 0.5

    1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  182. Evaluation – Performance 35 RA: 15.9% RA-buffer: 13.3% RA-hybrid: 20%

    0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  183. Evaluation – Performance 35 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  184. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  185. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  186. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  187. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  188. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  189. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  190. Evaluation – Memory-Level Parallelism 37 0.0 0.5 1.0 1.5 2.0

    2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  191. Evaluation – Memory-Level Parallelism 37 RA: 1.5X 0.0 0.5 1.0

    1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  192. Evaluation – Memory-Level Parallelism 37 RA: 1.5X RA-buffer: 1.3X 0.0

    0.5 1.0 1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  193. Evaluation – Memory-Level Parallelism 37 RA: 1.5X RA-buffer: 1.3X RA-hybrid:

    1.6X 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  194. Evaluation – Memory-Level Parallelism 37 RA: 1.5X RA-buffer: 1.3X PRE:

    2.0X RA-hybrid: 1.6X 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  195. Evaluation – LLC Miss Count Reduction 38 0.0 0.2 0.4

    0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  196. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% 0.0

    0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  197. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% RA-buffer:

    27.7% 0.0 0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  198. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% RA-buffer:

    27.7% RA-hybrid: 31% 0.0 0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  199. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% RA-buffer:

    27.7% PRE: 50.2% RA-hybrid: 31% 0.0 0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  200. Evaluation – Energy 39 0.80 0.85 0.90 0.95 1.00 1.05

    1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  201. Evaluation – Energy 39 RA: +2.4% 0.80 0.85 0.90 0.95

    1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  202. Evaluation – Energy 39 RA: +2.4% RA-buffer: Same 0.80 0.85

    0.90 0.95 1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  203. Evaluation – Energy 39 RA: +2.4% RA-buffer: Same RA-hybrid: Same

    0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  204. Evaluation – Energy 39 RA: +2.4% RA-buffer: Same PRE: -6.2%

    RA-hybrid: Same 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  205. Conclusions 1. Never flushes the ROB 2. Executes only useful

    instructions in runahead mode 3. Efficiently manages microarchitectural resources 40
  206. Conclusions 1. Never flushes the ROB 2. Executes only useful

    instructions in runahead mode 3. Efficiently manages microarchitectural resources 40 18.2% better performance
  207. Conclusions 1. Never flushes the ROB 2. Executes only useful

    instructions in runahead mode 3. Efficiently manages microarchitectural resources 40 18.2% better performance 6.2% better energy