Runahead Execution Re-Executes All Instructions 4 time cache hit cache hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode
Runahead Execution Re-Executes All Instructions 4 time cache hit cache hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename
Runahead Execution Re-Executes All Instructions 4 time cache hit cache hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch
Runahead Execution Re-Executes All Instructions 4 time cache hit cache hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue
Runahead Execution Re-Executes All Instructions 4 time cache hit cache hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue execute
Runahead Execution Re-Executes All Instructions 4 time cache hit cache hit cache hit re-executed instructions cache hit cache hit L4 L5 L1 L2 L3 fetch decode rename dispatch issue execute commit
Runahead Buffer Finds Blocking Chain in the ROB 5 time ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L
Runahead Buffer Finds Blocking Chain in the ROB 5 time ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L L1
Runahead Buffer Finds Blocking Chain in the ROB 5 time ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L A2 Producer A L1
Runahead Buffer Finds Blocking Chain in the ROB 5 time ROB full-window stall stalling load returns no progress memory access memory access memory access Memory-Level Parallelism (MLP) L1 stalling load L2 L3 Loads L A1 A2 Producer A L1
Runahead Buffer Re-Executes the Window 7 time cache hit cache hit cache hit re-executed instructions L1 L2 L3 fetch decode rename dispatch issue execute commit
Flushing and Re-Filling Incur High Overhead ▪ Front-end refill = 8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles ▪ Total overhead = 56 cycles 9
Flushing and Re-Filling Incur High Overhead ▪ Front-end refill = 8 cycles ▪ ROB = 192, width = 4 ROB fill time = 48 cycles ▪ Total overhead = 56 cycles 9 Runahead causes a pipeline bubble of 56 cycles per invocation
Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead buffer** Flush ROB ✓ ✓ *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals
Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead buffer** Flush ROB Instructions executed ✓ ✓ All *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals
Runahead Techniques Relative to OoO Core 12 Runahead execution* Runahead buffer** Flush ROB Instructions executed ✓ ✓ All Only one slice *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] Short intervals
Runahead Techniques Provide Limited Prefetch Coverage ▪ Runahead execution: Executes useless instructions ▪ Runahead buffer: High coverage for only one slice 13
Only One Load does not Lead to Majority of Memory Accesses 14 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik LLC misses identical to stalling load distinct from stalling load
Only One Load does not Lead to Majority of Memory Accesses 14 Most of the long-latency loads during runahead differ from the stalling load 0% 20% 40% 60% 80% 100% zeusm cactus wrf Gems leslie omnet milc soplex sphinx bwave libqua lbm mcf roms parest fotonik LLC misses identical to stalling load distinct from stalling load
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed ✓ ✓ All Only one slice *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance ✓ ✓ All Only one slice High High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals ✓
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same All slices *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals ✓
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same All slices Very high *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals ✓
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same All slices Very high High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] Short intervals ✓
Runahead Techniques Relative to OoO Core 16 Runahead execution* Runahead buffer** Precise runahead*** Flush ROB Instructions executed Performance Energy-Efficiency ✓ ✓ All Only one slice High High Low Same All slices Very high High *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] ***[Naithani et al. HPCA’ 20] Short intervals ✓
Precise Runahead Execution (PRE) Key insight: There are sufficient resources to (start) run ahead without flushing the ROB When running ahead: 1. Executes only useful instructions in runahead mode 17
Precise Runahead Execution (PRE) Key insight: There are sufficient resources to (start) run ahead without flushing the ROB When running ahead: 1. Executes only useful instructions in runahead mode 2. Efficiently manages microarchitectural resources 17
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L1 L2 L3 Producer A issue queue register file commit memory memory memory Normal Speculative no progress
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L1 L2 L3 Producer A issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory memory Normal Speculative
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 issue queue register file commit memory memory memory memory Normal Speculative
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 issue queue register file commit memory memory memory memory Normal Speculative
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory memory Normal Speculative
Processor Resources During Runahead 21 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A2 A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A3 issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory Normal Speculative
Processor Resources During Runahead 22 time current ROB full-window stall Loads L L1 L2 L3 future instructions Producer A issue queue register file commit memory memory memory memory Normal Speculative
Processor Resources During Runahead 23 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory
Processor Resources During Runahead 23 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B
Processor Resources During Runahead 23 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B 200 Cycles
Processor Resources During Runahead 23 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory B 200 Cycles
Processor Resources During Runahead 23 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative memory
Processor Resources During Runahead 23 time current ROB full-window stall Loads L L4 L1 L2 L3 future instructions Producer A A1 A2 A3 issue queue register file commit memory memory memory Normal Speculative ✓ memory
Two Key Questions 1. How to identify only useful instructions? 2. How to recycle (physical) registers? 24 Iterative Backward Dependency Analysis (IBDA)
Two Key Questions 1. How to identify only useful instructions? 2. How to recycle (physical) registers? 24 Iterative Backward Dependency Analysis (IBDA) Runahead Register Reclamation
Putting it All Together 32 I-Cache Dispatch Micro-op Queue Fetch Decode Register Read Issue Execute Commit New Modified Existing Rename (RAT) Normal Mode
Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** RA-hybrid: Better performing mechanism between RA-buffer and RA 34 *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
Evaluation OoO: Baseline out-of-order core RA: Runahead execution* -- No short runahead intervals -- No overlapping intervals RA-buffer: Runahead buffer** RA-hybrid: Better performing mechanism between RA-buffer and RA 34 *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] ***[Naithani et al. HPCA’20] PRE: Precise runahead execution***
Conclusions 1. Never flushes the ROB 2. Executes only useful instructions in runahead mode 3. Efficiently manages microarchitectural resources 40 18.2% better performance
Conclusions 1. Never flushes the ROB 2. Executes only useful instructions in runahead mode 3. Efficiently manages microarchitectural resources 40 18.2% better performance 6.2% better energy