Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Precise Runahead Execution

Ajeya Naithani
February 25, 2020

Precise Runahead Execution

IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2020
Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout
Ghent University, Belgium, and Universitat Politecnica de Valencia, Spain

Ajeya Naithani

February 25, 2020
Tweet

Other Decks in Research

Transcript

  1. Precise Runahead Execution Ajeya Naithani, Josue Feliu, Almutaz Adileh, Lieven

    Eeckhout
  2. Full-Window Stalls Degrade Performance 2

  3. Full-Window Stalls Degrade Performance 2 time

  4. Full-Window Stalls Degrade Performance 2 time ROB

  5. Full-Window Stalls Degrade Performance 2 time ROB L1 Loads L

  6. Full-Window Stalls Degrade Performance 2 time ROB memory access L1

    Loads L
  7. Full-Window Stalls Degrade Performance 2 time ROB memory access L1

    Loads L
  8. Full-Window Stalls Degrade Performance 2 time ROB full-window stall memory

    access L1 Loads L
  9. Full-Window Stalls Degrade Performance 2 time ROB full-window stall memory

    access L1 stalling load Loads L
  10. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns memory access L1 stalling load Loads L
  11. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access L1 stalling load Loads L
  12. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access memory access L1 stalling load L2 Loads L
  13. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access memory access memory access L1 stalling load L2 L3 Loads L
  14. Full-Window Stalls Degrade Performance 2 time ROB full-window stall stalling

    load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L
  15. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L L1 L2 L3
  16. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L L1 L2 L3
  17. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L L1 L2 L3
  18. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L memory access L4 L1 L2 L3
  19. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L memory access L4 memory access L5 L1 L2 L3
  20. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access stalling load Loads L memory access L4 memory access Increased Memory-Level Parallelism (MLP) L5 L1 L2 L3
  21. Runahead Execution Prefetches under a Full-Window Stall 3 time ROB

    full-window stall stalling load returns memory access memory access memory access stalling load Loads L memory access L4 memory access Increased Memory-Level Parallelism (MLP) L5 L1 L2 L3 runahead interval
  22. Runahead Execution Re-Executes All Instructions 4

  23. Runahead Execution Re-Executes All Instructions 4 time

  24. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3
  25. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch
  26. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode
  27. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename
  28. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch
  29. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue
  30. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue execute
  31. Runahead Execution Re-Executes All Instructions 4 time cache hit cache

    hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue execute commit
  32. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L
  33. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L L1
  34. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L A2 Producer A L1
  35. Runahead Buffer Finds Blocking Chain in the ROB 5 time

    ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L A1 A2 Producer A L1
  36. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A L1 L2 L3 A1 A2 L1
  37. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A L1 L2 L3 A1 A2 L1
  38. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A A1 A2 L1 L1 L2 L3 A1 A2 L1
  39. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access Memory-Level Parallelism (MLP) stalling load Loads L Producer A A1 A2 L1 L1 L2 L3 A1 A2 L1
  40. Runahead Buffer Executes Blocking Chain Speculatively 6 time ROB full-window

    stall stalling load returns memory access memory access memory access stalling load Loads L Producer A A1 A2 L1 L1 L2 L3 A1 A2 L1 Increased Memory-Level Parallelism (MLP)
  41. Runahead Buffer Re-Executes the Window 7

  42. Runahead Buffer Re-Executes the Window 7 time

  43. Runahead Buffer Re-Executes the Window 7 time cache hit cache

    hit cache hit re-executed instructions L1 L2 L3
  44. Runahead Buffer Re-Executes the Window 7 time cache hit cache

    hit cache hit re-executed instructions L1 L2 L3 fetch decode rename dispatch issue execute commit
  45. Runahead Techniques Relative to OoO Core 8

  46. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  47. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** Flush ROB *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  48. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** Flush ROB ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  49. Runahead Techniques Relative to OoO Core 8 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  50. Flushing and Re-Filling Incur High Overhead 9

  51. Flushing and Re-Filling Incur High Overhead ▪ Front-end refill =

    8 cycles 9
  52. Flushing and Re-Filling Incur High Overhead ▪ Front-end refill =

    8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles 9
  53. Flushing and Re-Filling Incur High Overhead ▪ Front-end refill =

    8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles ▪ Total overhead = 56 cycles 9
  54. Flushing and Re-Filling Incur High Overhead ▪ Front-end refill =

    8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles ▪ Total overhead = 56 cycles 9 Runahead causes a pipeline bubble of 56 cycles per invocation
  55. Flushing and Re-Filling Incur High Overhead 10 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA
  56. Flushing and Re-Filling Incur High Overhead 10 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA runahead: 15.9%
  57. Flushing and Re-Filling Incur High Overhead 11 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA RA-no-overhead runahead: 15.9%
  58. Flushing and Re-Filling Incur High Overhead 11 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA RA-no-overhead runahead: 15.9% runahead without flushing: 22.7%
  59. Flushing and Re-Filling Incur High Overhead 11 0.0 0.5 1.0

    1.5 bwave cactus fotonik Gems lbm leslie libqua mcf milc omnet parest roms soplex sphinx wrf zeusm HMean normalized IPC OoO RA RA-no-overhead runahead: 15.9% runahead without flushing: 22.7% 6.8%
  60. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  61. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals
  62. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals 
  63. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  64. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  65. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ All *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  66. Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ All Only one slice *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals  
  67. Runahead Techniques Provide Limited Prefetch Coverage ▪ Runahead execution: Executes

    useless instructions 13
  68. Runahead Techniques Provide Limited Prefetch Coverage ▪ Runahead execution: Executes

    useless instructions ▪ Runahead buffer: High coverage for only one slice 13
  69. Only One Load does not Lead to Majority of Memory

    Accesses 14 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik LLC misses identical to stalling load distinct from stalling load
  70. Only One Load does not Lead to Majority of Memory

    Accesses 14 Most of the long-latency loads during runahead differ from the stalling load 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik LLC misses identical to stalling load distinct from stalling load
  71. Applications Access Memory through Multiple Slices 15 0% 20% 40%

    60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik unique load instructions 1 2-4 5-7 8+
  72. Applications Access Memory through Multiple Slices 15 There are more

    than eight unique load instructions accessing memory during each runahead interval 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik unique load instructions 1 2-4 5-7 8+
  73. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed ✓ ✓ All Only one slice *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  74. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  75. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  76. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice High High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  77. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  78. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  79. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  80. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  81. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals  
  82. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  83. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  84. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices Very high *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  85. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices Very high High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals   ✓
  86. Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead

    buffer** Precise runahead*** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same  All slices Very high High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] ***[Naithani et al. HPCA’ 20] Short intervals   ✓
  87. Precise Runahead Execution (PRE) 17

  88. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB 17
  89. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB When running ahead: 17
  90. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB When running ahead: 1. Executes only useful instructions in runahead mode 17
  91. Precise Runahead Execution (PRE) Key insight: There are sufficient resources

    to (start) run ahead without flushing the ROB When running ahead: 1. Executes only useful instructions in runahead mode 2. Efficiently manages microarchitectural resources 17
  92. Processor Resources at Full-Window Stall 18 0 20 40 60

    80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers
  93. Processor Resources at Full-Window Stall 18 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers
  94. Processor Resources at Full-Window Stall 19 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers
  95. Processor Resources at Full-Window Stall 19 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers FP registers: 56%
  96. Processor Resources at Full-Window Stall 20 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers IQ entries FP registers: 56%
  97. Processor Resources at Full-Window Stall 20 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers IQ entries FP registers: 56% IQ entries: 37%
  98. Processor Resources at Full-Window Stall 20 GP registers: 52% 0

    20 40 60 80 100 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg. % availability GP registers FP registers IQ entries FP registers: 56% IQ entries: 37% There are sufficient resources to start runahead without flushing the ROB
  99. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L1 L2 L3 Producer A issue queue register file commit memory memory memory Normal Speculative no progress
  100. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L1 L2 L3 Producer A issue queue register file commit memory memory memory Normal Speculative
  101. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  102. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  103. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory memory Normal Speculative
  104. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 issue queue register file commit memory memory memory memory Normal Speculative
  105. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 issue queue register file commit memory memory memory memory Normal Speculative
  106. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory memory Normal Speculative
  107. Processor Resources During Runahead 21 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory memory Normal Speculative
  108. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  109. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  110. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  111. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
  112. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
  113. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
  114. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
  115. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
  116. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
  117. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
  118. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  119. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  120. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
  121. Processor Resources During Runahead 22 time current ROB full-window stall

    Loads L L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory memory Normal Speculative
  122. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory
  123. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B
  124. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B 200 Cycles
  125. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B  200 Cycles
  126. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory
  127. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative ✓ memory
  128. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative ✓ ? memory
  129. Processor Resources During Runahead 23 time current ROB full-window stall

    Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative ✓ ? ? ? ? ? memory
  130. Two Key Questions 24

  131. Two Key Questions 1. How to identify only useful instructions?

    24
  132. Two Key Questions 1. How to identify only useful instructions?

    2. How to recycle (physical) registers? 24
  133. Two Key Questions 1. How to identify only useful instructions?

    2. How to recycle (physical) registers? 24 Iterative Backward Dependency Analysis (IBDA)
  134. Two Key Questions 1. How to identify only useful instructions?

    2. How to recycle (physical) registers? 24 Iterative Backward Dependency Analysis (IBDA) Runahead Register Reclamation
  135. Iteratively Identifying the Stalling Slices 25

  136. Iteratively Identifying the Stalling Slices 25 L4 A1 A2 A3

    r1 r2 r3 r4 r2 r3 r4 r5
  137. Iteratively Identifying the Stalling Slices 25 L4 A1 A2 A3

    Register Allocation Table (RAT) r1 r2 r3 r4 r2 r3 r4 r5
  138. Iteratively Identifying the Stalling Slices 25 L4 A1 A2 A3

    Register Allocation Table (RAT) Arch. register r1 r2 r3 r4 r2 r3 r4 r5
  139. Iteratively Identifying the Stalling Slices 25 L4 A1 A2 A3

    Phy. register Register Allocation Table (RAT) Arch. register r1 r2 r3 r4 r2 r3 r4 r5
  140. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register r1 r2 r3 r4 r2 r3 r4 r5
  141. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5
  142. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A0
  143. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A0
  144. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A0
  145. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 A0
  146. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 A0
  147. Iteratively Identifying the Stalling Slices 25 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) A0
  148. Iteratively Identifying the Stalling Slices 26 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) Iteration-1: L4 stalls the window A0
  149. Iteratively Identifying the Stalling Slices 26 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-1: L4 stalls the window A0
  150. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 A0
  151. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-2: L4 hits in the SST A0
  152. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r4, read A3 Iteration-2: L4 hits in the SST A0
  153. Iteratively Identifying the Stalling Slices 27 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r4, read A3 Iteration-2: L4 hits in the SST A3 A0
  154. Iteratively Identifying the Stalling Slices 28 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-3: A3 hits in the SST A3 A0
  155. Iteratively Identifying the Stalling Slices 28 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r3, read A2 Iteration-3: A3 hits in the SST A3 A0
  156. Iteratively Identifying the Stalling Slices 28 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r3, read A2 Iteration-3: A3 hits in the SST A3 A2 A0
  157. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 Iteration-4: A2 hits in the SST A3 A2 A0
  158. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r2, read A1 Iteration-4: A2 hits in the SST A3 A2 A0
  159. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r2, read A1 Iteration-4: A2 hits in the SST A3 A2 A1 A0
  160. Iteratively Identifying the Stalling Slices 29 r1 r2 r3 r4

    r5 P1 P2 P3 P4 P5 L4 A1 A2 A3 Phy. register Register Allocation Table (RAT) Arch. register Last-writer instruction r1 r2 r3 r4 r2 r3 r4 r5 A1 A2 A3 L4 Stalling Slice Table (SST) L4 While renaming source r2, read A1 Iteration-4: A2 hits in the SST A3 A2 A1 A0
  161. Runahead Register Reclamation 30

  162. Runahead Register Reclamation 30 normal mode

  163. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    normal mode
  164. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 P0 normal mode
  165. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 P0 normal mode
  166. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 P0 normal mode
  167. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 P0 normal mode
  168. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 P0 normal mode
  169. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 I6 sub r1  r2, r6 P11 P9 P10 P6 P0 normal mode
  170. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 I6 sub r1  r2, r6 P11 P9 P10 P6 P0 runahead mode normal mode
  171. Runahead Register Reclamation 30 instruction dest src1 src2 OldPhy register

    I1 add r1  r2, r3 P1 P2 P3 I2 mul r2  r1, r4 P5 P1 P4 P2 I3 ld r1  mem[x] P6 P1 I4 add r2  r1, r3 P7 P6 P3 P5 I5 add r2  r4, r5 P9 P4 P8 P7 I6 sub r1  r2, r6 P11 P9 P10 P6 P0 OldPhy register I1 I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 P0 runahead mode normal mode
  172. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6
  173. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Precise Register Deallocation Queue (PRDQ)
  174. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Precise Register Deallocation Queue (PRDQ) dispatch
  175. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch
  176. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch 0 0 0 0 0 0
  177. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 0 0 0
  178. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 0 0 1
  179. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 0 1 1
  180. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1
  181. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1 ✓ ✓
  182. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1  ✓ ✓
  183. Runahead Register Reclamation 31 runahead mode OldPhy register I1 P0

    I2 P2 I3 P1 I4 P5 I5 P7 I6 P6 Executed ? Precise Register Deallocation Queue (PRDQ) dispatch execute 0 0 0 1 1 1  ✓ ✓
  184. Putting it All Together 32

  185. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  186. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  187. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  188. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  189. Putting it All Together 32 I-Cache Micro-op Queue Fetch Decode

    Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  190. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  191. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  192. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT)
  193. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT) Normal Mode
  194. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  195. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  196. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  197. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. Rename (RAT) Normal Mode
  198. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. PRDQ I1 P5 0 I2 P3 1 I3 0 …. Rename (RAT) Normal Mode
  199. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. PRDQ I1 P5 0 I2 P3 1 I3 0 …. Rename (RAT) Normal Mode Runahead Mode
  200. Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch

    Decode Register Read Issue Execute Commit New Modified Existing Stalling Slice Table 402ed2 4287fd 428809 …. PRDQ I1 P5 0 I2 P3 1 I3 0 …. Rename (RAT) Normal Mode Runahead Mode
  201. Evaluation 33

  202. Evaluation Simulator: Sniper 6.0, McPAT 33

  203. Evaluation Simulator: Sniper 6.0, McPAT Workloads: SPEC CPU2006/CPU2017, 1B SimPoints

    33
  204. Evaluation Simulator: Sniper 6.0, McPAT Workloads: SPEC CPU2006/CPU2017, 1B SimPoints

    Baseline: ROB=192, issue queue=92, load/store queue=64, register file=168/168 33
  205. Evaluation OoO: Baseline out-of-order core 34

  206. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals 34 *[Mutlu et al. ISCA’05]
  207. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals 34 *[Mutlu et al. ISCA’05]
  208. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** 34 *[Mutlu et al. ISCA’05]
  209. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** RA-hybrid: Better performing mechanism between RA-buffer and RA 34 *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
  210. Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No

    short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** RA-hybrid: Better performing mechanism between RA-buffer and RA 34 *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] ***[Naithani et al. HPCA’20] PRE: Precise runahead execution***
  211. Evaluation – Performance 35 0.0 0.5 1.0 1.5 2.0 2.5

    zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  212. Evaluation – Performance 35 RA: 15.9% 0.0 0.5 1.0 1.5

    2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  213. Evaluation – Performance 35 RA: 15.9% RA-buffer: 13.3% 0.0 0.5

    1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  214. Evaluation – Performance 35 RA: 15.9% RA-buffer: 13.3% RA-hybrid: 20%

    0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  215. Evaluation – Performance 35 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean normalized IPC OoO RA RA-buffer RA-hybrid PRE
  216. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  217. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  218. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  219. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  220. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  221. Evaluation – Performance 36 RA: 15.9% RA-buffer: 13.3% PRE: 38.2%

    RA-hybrid: 20% 0 20 40 60 80 100 0.0 0.5 1.0 1.5 2.0 2.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik HMean % ROB stalled normalized IPC OoO RA RA-buffer RA-hybrid PRE
  222. Evaluation – Memory-Level Parallelism 37 0.0 0.5 1.0 1.5 2.0

    2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  223. Evaluation – Memory-Level Parallelism 37 RA: 1.5X 0.0 0.5 1.0

    1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  224. Evaluation – Memory-Level Parallelism 37 RA: 1.5X RA-buffer: 1.3X 0.0

    0.5 1.0 1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  225. Evaluation – Memory-Level Parallelism 37 RA: 1.5X RA-buffer: 1.3X RA-hybrid:

    1.6X 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  226. Evaluation – Memory-Level Parallelism 37 RA: 1.5X RA-buffer: 1.3X PRE:

    2.0X RA-hybrid: 1.6X 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized MLP OoO RA RA-buffer RA-hybrid PRE
  227. Evaluation – LLC Miss Count Reduction 38 0.0 0.2 0.4

    0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  228. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% 0.0

    0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  229. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% RA-buffer:

    27.7% 0.0 0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  230. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% RA-buffer:

    27.7% RA-hybrid: 31% 0.0 0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  231. Evaluation – LLC Miss Count Reduction 38 RA: 26.4% RA-buffer:

    27.7% PRE: 50.2% RA-hybrid: 31% 0.0 0.2 0.4 0.6 0.8 1.0 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized LLC misses OoO RA RA-buffer RA-hybrid PRE
  232. Evaluation – Energy 39 0.80 0.85 0.90 0.95 1.00 1.05

    1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  233. Evaluation – Energy 39 RA: +2.4% 0.80 0.85 0.90 0.95

    1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  234. Evaluation – Energy 39 RA: +2.4% RA-buffer: Same 0.80 0.85

    0.90 0.95 1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  235. Evaluation – Energy 39 RA: +2.4% RA-buffer: Same RA-hybrid: Same

    0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  236. Evaluation – Energy 39 RA: +2.4% RA-buffer: Same PRE: -6.2%

    RA-hybrid: Same 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik avg normalized energy RA RA-Buffer PRE 0.71 0.76 0.8 1.16
  237. Conclusions 40

  238. Conclusions 1. Never flushes the ROB 40

  239. Conclusions 1. Never flushes the ROB 2. Executes only useful

    instructions in runahead mode 40
  240. Conclusions 1. Never flushes the ROB 2. Executes only useful

    instructions in runahead mode 3. Efficiently manages microarchitectural resources 40
  241. Conclusions 1. Never flushes the ROB 2. Executes only useful

    instructions in runahead mode 3. Efficiently manages microarchitectural resources 40 18.2% better performance
  242. Conclusions 1. Never flushes the ROB 2. Executes only useful

    instructions in runahead mode 3. Efficiently manages microarchitectural resources 40 18.2% better performance 6.2% better energy
  243. Precise Runahead Execution Ajeya Naithani Josue Feliu Almutaz Adileh Lieven

    Eeckhout