Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Precise Runahead Execution

Ajeya Naithani
February 25, 2020

Precise Runahead Execution

IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2020
Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout
Ghent University, Belgium, and Universitat Politecnica de Valencia, Spain

Ajeya Naithani

February 25, 2020
Tweet

Other Decks in Research

Transcript

  1. Precise Runahead Execution
    Ajeya Naithani, Josue Feliu,
    Almutaz Adileh, Lieven Eeckhout

    View Slide

  2. Full-Window Stalls Degrade Performance
    2

    View Slide

  3. Full-Window Stalls Degrade Performance
    2
    time

    View Slide

  4. Full-Window Stalls Degrade Performance
    2
    time
    ROB

    View Slide

  5. Full-Window Stalls Degrade Performance
    2
    time
    ROB L1
    Loads
    L

    View Slide

  6. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    memory
    access
    L1
    Loads
    L

    View Slide

  7. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    memory
    access
    L1
    Loads
    L

    View Slide

  8. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    full-window
    stall
    memory
    access
    L1
    Loads
    L

    View Slide

  9. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    full-window
    stall
    memory
    access
    L1
    stalling load
    Loads
    L

    View Slide

  10. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    L1
    stalling load
    Loads
    L

    View Slide

  11. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    L1
    stalling load
    Loads
    L

    View Slide

  12. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    L1
    stalling load
    L2
    Loads
    L

    View Slide

  13. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    L1
    stalling load
    L2 L3
    Loads
    L

    View Slide

  14. Full-Window Stalls Degrade Performance
    2
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    L1
    stalling load
    L2 L3
    Loads
    L

    View Slide

  15. Runahead Execution Prefetches
    under a Full-Window Stall
    3
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    L1 L2 L3

    View Slide

  16. Runahead Execution Prefetches
    under a Full-Window Stall
    3
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    L1 L2 L3

    View Slide

  17. Runahead Execution Prefetches
    under a Full-Window Stall
    3
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    L1 L2 L3

    View Slide

  18. Runahead Execution Prefetches
    under a Full-Window Stall
    3
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    memory
    access
    L4
    L1 L2 L3

    View Slide

  19. Runahead Execution Prefetches
    under a Full-Window Stall
    3
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    memory
    access
    L4
    memory
    access
    L5
    L1 L2 L3

    View Slide

  20. Runahead Execution Prefetches
    under a Full-Window Stall
    3
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    stalling load
    Loads
    L
    memory
    access
    L4
    memory
    access
    Increased Memory-Level Parallelism (MLP)
    L5
    L1 L2 L3

    View Slide

  21. Runahead Execution Prefetches
    under a Full-Window Stall
    3
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    stalling load
    Loads
    L
    memory
    access
    L4
    memory
    access
    Increased Memory-Level Parallelism (MLP)
    L5
    L1 L2 L3
    runahead interval

    View Slide

  22. Runahead Execution Re-Executes
    All Instructions
    4

    View Slide

  23. Runahead Execution Re-Executes
    All Instructions
    4
    time

    View Slide

  24. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3

    View Slide

  25. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3
    fetch

    View Slide

  26. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3
    fetch decode

    View Slide

  27. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3
    fetch decode rename

    View Slide

  28. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3
    fetch decode rename dispatch

    View Slide

  29. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3
    fetch decode rename dispatch issue

    View Slide

  30. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3
    fetch decode rename dispatch issue execute

    View Slide

  31. Runahead Execution Re-Executes
    All Instructions
    4
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    cache
    hit
    cache
    hit
    L4 L5
    L1 L2 L3
    fetch decode rename dispatch issue execute commit

    View Slide

  32. Runahead Buffer Finds
    Blocking Chain in the ROB
    5
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    L1
    stalling load
    L2 L3
    Loads
    L

    View Slide

  33. Runahead Buffer Finds
    Blocking Chain in the ROB
    5
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    L1
    stalling load
    L2 L3
    Loads
    L
    L1

    View Slide

  34. Runahead Buffer Finds
    Blocking Chain in the ROB
    5
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    L1
    stalling load
    L2 L3
    Loads
    L
    A2
    Producer
    A
    L1

    View Slide

  35. Runahead Buffer Finds
    Blocking Chain in the ROB
    5
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    L1
    stalling load
    L2 L3
    Loads
    L
    A1 A2
    Producer
    A
    L1

    View Slide

  36. Runahead Buffer Executes
    Blocking Chain Speculatively
    6
    time
    ROB
    full-window
    stall
    stalling load
    returns
    no progress
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    Producer
    A
    L1 L2 L3
    A1 A2 L1

    View Slide

  37. Runahead Buffer Executes
    Blocking Chain Speculatively
    6
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    Producer
    A
    L1 L2 L3
    A1 A2 L1

    View Slide

  38. Runahead Buffer Executes
    Blocking Chain Speculatively
    6
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    Producer
    A
    A1 A2 L1
    L1 L2 L3
    A1 A2 L1

    View Slide

  39. Runahead Buffer Executes
    Blocking Chain Speculatively
    6
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    Memory-Level Parallelism (MLP)
    stalling load
    Loads
    L
    Producer
    A
    A1 A2 L1
    L1 L2 L3
    A1 A2 L1

    View Slide

  40. Runahead Buffer Executes
    Blocking Chain Speculatively
    6
    time
    ROB
    full-window
    stall
    stalling load
    returns
    memory
    access
    memory
    access
    memory
    access
    stalling load
    Loads
    L
    Producer
    A
    A1 A2 L1
    L1 L2 L3
    A1 A2 L1
    Increased Memory-Level Parallelism (MLP)

    View Slide

  41. Runahead Buffer Re-Executes the Window
    7

    View Slide

  42. Runahead Buffer Re-Executes the Window
    7
    time

    View Slide

  43. Runahead Buffer Re-Executes the Window
    7
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    L1 L2 L3

    View Slide

  44. Runahead Buffer Re-Executes the Window
    7
    time
    cache
    hit
    cache
    hit
    cache
    hit
    re-executed instructions
    L1 L2 L3
    fetch decode rename dispatch issue execute commit

    View Slide

  45. Runahead Techniques Relative to OoO Core
    8

    View Slide

  46. Runahead Techniques Relative to OoO Core
    8
    Runahead
    execution*
    Runahead
    buffer**
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]

    View Slide

  47. Runahead Techniques Relative to OoO Core
    8
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]

    View Slide

  48. Runahead Techniques Relative to OoO Core
    8
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB ✓
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]

    View Slide

  49. Runahead Techniques Relative to OoO Core
    8
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB ✓ ✓
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]

    View Slide

  50. Flushing and Re-Filling Incur High Overhead
    9

    View Slide

  51. Flushing and Re-Filling Incur High Overhead
    ▪ Front-end refill = 8 cycles
    9

    View Slide

  52. Flushing and Re-Filling Incur High Overhead
    ▪ Front-end refill = 8 cycles
    ▪ ROB = 192, width = 4
    ROB fill time = 48 cycles
    9

    View Slide

  53. Flushing and Re-Filling Incur High Overhead
    ▪ Front-end refill = 8 cycles
    ▪ ROB = 192, width = 4
    ROB fill time = 48 cycles
    ▪ Total overhead = 56 cycles
    9

    View Slide

  54. Flushing and Re-Filling Incur High Overhead
    ▪ Front-end refill = 8 cycles
    ▪ ROB = 192, width = 4
    ROB fill time = 48 cycles
    ▪ Total overhead = 56 cycles
    9
    Runahead causes a
    pipeline bubble of
    56 cycles per invocation

    View Slide

  55. Flushing and Re-Filling Incur High Overhead
    10
    0.0
    0.5
    1.0
    1.5
    bwave
    cactus
    fotonik
    Gems
    lbm
    leslie
    libqua
    mcf
    milc
    omnet
    parest
    roms
    soplex
    sphinx
    wrf
    zeusm
    HMean
    normalized IPC
    OoO RA

    View Slide

  56. Flushing and Re-Filling Incur High Overhead
    10
    0.0
    0.5
    1.0
    1.5
    bwave
    cactus
    fotonik
    Gems
    lbm
    leslie
    libqua
    mcf
    milc
    omnet
    parest
    roms
    soplex
    sphinx
    wrf
    zeusm
    HMean
    normalized IPC
    OoO RA
    runahead: 15.9%

    View Slide

  57. Flushing and Re-Filling Incur High Overhead
    11
    0.0
    0.5
    1.0
    1.5
    bwave
    cactus
    fotonik
    Gems
    lbm
    leslie
    libqua
    mcf
    milc
    omnet
    parest
    roms
    soplex
    sphinx
    wrf
    zeusm
    HMean
    normalized IPC
    OoO RA RA-no-overhead
    runahead: 15.9%

    View Slide

  58. Flushing and Re-Filling Incur High Overhead
    11
    0.0
    0.5
    1.0
    1.5
    bwave
    cactus
    fotonik
    Gems
    lbm
    leslie
    libqua
    mcf
    milc
    omnet
    parest
    roms
    soplex
    sphinx
    wrf
    zeusm
    HMean
    normalized IPC
    OoO RA RA-no-overhead
    runahead: 15.9% runahead without flushing: 22.7%

    View Slide

  59. Flushing and Re-Filling Incur High Overhead
    11
    0.0
    0.5
    1.0
    1.5
    bwave
    cactus
    fotonik
    Gems
    lbm
    leslie
    libqua
    mcf
    milc
    omnet
    parest
    roms
    soplex
    sphinx
    wrf
    zeusm
    HMean
    normalized IPC
    OoO RA RA-no-overhead
    runahead: 15.9% runahead without flushing: 22.7%
    6.8%

    View Slide

  60. Runahead Techniques Relative to OoO Core
    12
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB ✓ ✓
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]

    View Slide

  61. Runahead Techniques Relative to OoO Core
    12
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB ✓ ✓
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
    Short intervals

    View Slide

  62. Runahead Techniques Relative to OoO Core
    12
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB ✓ ✓
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
    Short intervals 

    View Slide

  63. Runahead Techniques Relative to OoO Core
    12
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB ✓ ✓
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
    Short intervals  

    View Slide

  64. Runahead Techniques Relative to OoO Core
    12
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    ✓ ✓
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
    Short intervals  

    View Slide

  65. Runahead Techniques Relative to OoO Core
    12
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    ✓ ✓
    All
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
    Short intervals  

    View Slide

  66. Runahead Techniques Relative to OoO Core
    12
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    ✓ ✓
    All Only one slice
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]
    Short intervals  

    View Slide

  67. Runahead Techniques Provide
    Limited Prefetch Coverage
    ▪ Runahead execution: Executes useless instructions
    13

    View Slide

  68. Runahead Techniques Provide
    Limited Prefetch Coverage
    ▪ Runahead execution: Executes useless instructions
    ▪ Runahead buffer: High coverage for only one slice
    13

    View Slide

  69. Only One Load does not Lead to
    Majority of Memory Accesses
    14
    0%
    20%
    40%
    60%
    80%
    100%
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    LLC misses
    identical to stalling load distinct from stalling load

    View Slide

  70. Only One Load does not Lead to
    Majority of Memory Accesses
    14
    Most of the long-latency loads during runahead differ
    from the stalling load
    0%
    20%
    40%
    60%
    80%
    100%
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    LLC misses
    identical to stalling load distinct from stalling load

    View Slide

  71. Applications Access Memory
    through Multiple Slices
    15
    0%
    20%
    40%
    60%
    80%
    100%
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    unique load instructions
    1 2-4 5-7 8+

    View Slide

  72. Applications Access Memory
    through Multiple Slices
    15
    There are more than eight unique load instructions
    accessing memory during each runahead interval
    0%
    20%
    40%
    60%
    80%
    100%
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    unique load instructions
    1 2-4 5-7 8+

    View Slide

  73. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    ✓ ✓
    All Only one slice
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  74. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    ✓ ✓
    All Only one slice
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  75. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    ✓ ✓
    All Only one slice
    High
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  76. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    ✓ ✓
    All Only one slice
    High High
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  77. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  78. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  79. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  80. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  81. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same

    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals  

    View Slide

  82. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same

    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals   ✓

    View Slide

  83. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same

    All slices
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals   ✓

    View Slide

  84. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same

    All slices
    Very high
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals   ✓

    View Slide

  85. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same

    All slices
    Very high
    High
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15]
    Short intervals   ✓

    View Slide

  86. Runahead Techniques Relative to OoO Core
    16
    Runahead
    execution*
    Runahead
    buffer**
    Precise
    runahead***
    Flush ROB
    Instructions executed
    Performance
    Energy-Efficiency
    ✓ ✓
    All Only one slice
    High High
    Low Same

    All slices
    Very high
    High
    *[Mutlu et al. ISCA’ 05] **[Hashemi et al. MICRO’ 15] ***[Naithani et al. HPCA’ 20]
    Short intervals   ✓

    View Slide

  87. Precise Runahead Execution (PRE)
    17

    View Slide

  88. Precise Runahead Execution (PRE)
    Key insight: There are sufficient resources to (start) run
    ahead without flushing the ROB
    17

    View Slide

  89. Precise Runahead Execution (PRE)
    Key insight: There are sufficient resources to (start) run
    ahead without flushing the ROB
    When running ahead:
    17

    View Slide

  90. Precise Runahead Execution (PRE)
    Key insight: There are sufficient resources to (start) run
    ahead without flushing the ROB
    When running ahead:
    1. Executes only useful instructions in runahead mode
    17

    View Slide

  91. Precise Runahead Execution (PRE)
    Key insight: There are sufficient resources to (start) run
    ahead without flushing the ROB
    When running ahead:
    1. Executes only useful instructions in runahead mode
    2. Efficiently manages microarchitectural resources
    17

    View Slide

  92. Processor Resources at Full-Window Stall
    18
    0
    20
    40
    60
    80
    100
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg.
    % availability
    GP registers

    View Slide

  93. Processor Resources at Full-Window Stall
    18
    GP registers: 52%
    0
    20
    40
    60
    80
    100
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg.
    % availability
    GP registers

    View Slide

  94. Processor Resources at Full-Window Stall
    19
    GP registers: 52%
    0
    20
    40
    60
    80
    100
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg.
    % availability
    GP registers FP registers

    View Slide

  95. Processor Resources at Full-Window Stall
    19
    GP registers: 52%
    0
    20
    40
    60
    80
    100
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg.
    % availability
    GP registers FP registers
    FP registers: 56%

    View Slide

  96. Processor Resources at Full-Window Stall
    20
    GP registers: 52%
    0
    20
    40
    60
    80
    100
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg.
    % availability
    GP registers FP registers IQ entries
    FP registers: 56%

    View Slide

  97. Processor Resources at Full-Window Stall
    20
    GP registers: 52%
    0
    20
    40
    60
    80
    100
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg.
    % availability
    GP registers FP registers IQ entries
    FP registers: 56%
    IQ entries: 37%

    View Slide

  98. Processor Resources at Full-Window Stall
    20
    GP registers: 52%
    0
    20
    40
    60
    80
    100
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg.
    % availability
    GP registers FP registers IQ entries
    FP registers: 56%
    IQ entries: 37%
    There are sufficient resources to start runahead
    without flushing the ROB

    View Slide

  99. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L1 L2 L3
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative
    no progress

    View Slide

  100. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L1 L2 L3
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  101. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L1 L2 L3
    future instructions
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  102. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  103. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory memory
    Normal
    Speculative

    View Slide

  104. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1
    issue
    queue
    register
    file
    commit
    memory memory memory memory
    Normal
    Speculative

    View Slide

  105. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2
    issue
    queue
    register
    file
    commit
    memory memory memory memory
    Normal
    Speculative

    View Slide

  106. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory memory
    Normal
    Speculative

    View Slide

  107. Processor Resources
    During Runahead
    21
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory memory
    Normal
    Speculative

    View Slide

  108. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  109. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  110. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  111. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  112. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  113. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  114. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  115. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  116. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  117. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  118. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  119. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  120. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    View Slide

  121. Processor Resources
    During Runahead
    22
    time
    current ROB
    full-window
    stall
    Loads
    L
    L1 L2 L3
    future instructions
    Producer
    A
    issue
    queue
    register
    file
    commit
    memory memory memory memory
    Normal
    Speculative

    View Slide

  122. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative
    memory

    View Slide

  123. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative
    memory
    B

    View Slide

  124. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative
    memory
    B
    200 Cycles

    View Slide

  125. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative
    memory
    B

    200 Cycles

    View Slide

  126. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative
    memory

    View Slide

  127. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    memory

    View Slide

  128. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    ?
    memory

    View Slide

  129. Processor Resources
    During Runahead
    23
    time
    current ROB
    full-window
    stall
    Loads
    L
    L4
    L1 L2 L3
    future instructions
    Producer
    A
    A1 A2 A3
    issue
    queue
    register
    file
    commit
    memory memory memory
    Normal
    Speculative

    ?
    ? ? ? ?
    memory

    View Slide

  130. Two Key Questions
    24

    View Slide

  131. Two Key Questions
    1. How to identify only
    useful instructions?
    24

    View Slide

  132. Two Key Questions
    1. How to identify only
    useful instructions?
    2. How to recycle
    (physical) registers?
    24

    View Slide

  133. Two Key Questions
    1. How to identify only
    useful instructions?
    2. How to recycle
    (physical) registers?
    24
    Iterative Backward
    Dependency Analysis (IBDA)

    View Slide

  134. Two Key Questions
    1. How to identify only
    useful instructions?
    2. How to recycle
    (physical) registers?
    24
    Iterative Backward
    Dependency Analysis (IBDA)
    Runahead Register
    Reclamation

    View Slide

  135. Iteratively Identifying the Stalling Slices
    25

    View Slide

  136. Iteratively Identifying the Stalling Slices
    25
    L4
    A1
    A2
    A3
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5

    View Slide

  137. Iteratively Identifying the Stalling Slices
    25
    L4
    A1
    A2
    A3
    Register Allocation Table (RAT)
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5

    View Slide

  138. Iteratively Identifying the Stalling Slices
    25
    L4
    A1
    A2
    A3
    Register Allocation Table (RAT)
    Arch.
    register
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5

    View Slide

  139. Iteratively Identifying the Stalling Slices
    25
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5

    View Slide

  140. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5

    View Slide

  141. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5

    View Slide

  142. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A0

    View Slide

  143. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A0

    View Slide

  144. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A0

    View Slide

  145. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    A0

    View Slide

  146. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    A0

    View Slide

  147. Iteratively Identifying the Stalling Slices
    25
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    A0

    View Slide

  148. Iteratively Identifying the Stalling Slices
    26
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    Iteration-1:
    L4 stalls the window
    A0

    View Slide

  149. Iteratively Identifying the Stalling Slices
    26
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    Iteration-1:
    L4 stalls the window
    A0

    View Slide

  150. Iteratively Identifying the Stalling Slices
    27
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    A0

    View Slide

  151. Iteratively Identifying the Stalling Slices
    27
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    Iteration-2:
    L4 hits in the SST
    A0

    View Slide

  152. Iteratively Identifying the Stalling Slices
    27
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    While renaming source r4, read A3
    Iteration-2:
    L4 hits in the SST
    A0

    View Slide

  153. Iteratively Identifying the Stalling Slices
    27
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    While renaming source r4, read A3
    Iteration-2:
    L4 hits in the SST
    A3
    A0

    View Slide

  154. Iteratively Identifying the Stalling Slices
    28
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    Iteration-3:
    A3 hits in the SST
    A3
    A0

    View Slide

  155. Iteratively Identifying the Stalling Slices
    28
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    While renaming source r3, read A2
    Iteration-3:
    A3 hits in the SST
    A3
    A0

    View Slide

  156. Iteratively Identifying the Stalling Slices
    28
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    While renaming source r3, read A2
    Iteration-3:
    A3 hits in the SST
    A3
    A2
    A0

    View Slide

  157. Iteratively Identifying the Stalling Slices
    29
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    Iteration-4:
    A2 hits in the SST
    A3
    A2
    A0

    View Slide

  158. Iteratively Identifying the Stalling Slices
    29
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    While renaming source r2, read A1
    Iteration-4:
    A2 hits in the SST
    A3
    A2
    A0

    View Slide

  159. Iteratively Identifying the Stalling Slices
    29
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    While renaming source r2, read A1
    Iteration-4:
    A2 hits in the SST
    A3
    A2
    A1
    A0

    View Slide

  160. Iteratively Identifying the Stalling Slices
    29
    r1
    r2
    r3
    r4
    r5
    P1
    P2
    P3
    P4
    P5
    L4
    A1
    A2
    A3
    Phy.
    register
    Register Allocation Table (RAT)
    Arch.
    register
    Last-writer
    instruction
    r1
    r2
    r3
    r4
    r2
    r3
    r4
    r5
    A1
    A2
    A3
    L4
    Stalling Slice Table
    (SST)
    L4
    While renaming source r2, read A1
    Iteration-4:
    A2 hits in the SST
    A3
    A2
    A1
    A0

    View Slide

  161. Runahead Register Reclamation
    30

    View Slide

  162. Runahead Register Reclamation
    30
    normal mode

    View Slide

  163. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    normal mode

    View Slide

  164. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3 P0
    normal mode

    View Slide

  165. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3
    I2 mul r2  r1, r4 P5 P1 P4 P2
    P0
    normal mode

    View Slide

  166. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3
    I2 mul r2  r1, r4 P5 P1 P4 P2
    I3 ld r1  mem[x] P6 P1
    P0
    normal mode

    View Slide

  167. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3
    I2 mul r2  r1, r4 P5 P1 P4 P2
    I3 ld r1  mem[x] P6 P1
    I4 add r2  r1, r3 P7 P6 P3 P5
    P0
    normal mode

    View Slide

  168. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3
    I2 mul r2  r1, r4 P5 P1 P4 P2
    I3 ld r1  mem[x] P6 P1
    I4 add r2  r1, r3 P7 P6 P3 P5
    I5 add r2  r4, r5 P9 P4 P8 P7
    P0
    normal mode

    View Slide

  169. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3
    I2 mul r2  r1, r4 P5 P1 P4 P2
    I3 ld r1  mem[x] P6 P1
    I4 add r2  r1, r3 P7 P6 P3 P5
    I5 add r2  r4, r5 P9 P4 P8 P7
    I6 sub r1  r2, r6 P11 P9 P10 P6
    P0
    normal mode

    View Slide

  170. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3
    I2 mul r2  r1, r4 P5 P1 P4 P2
    I3 ld r1  mem[x] P6 P1
    I4 add r2  r1, r3 P7 P6 P3 P5
    I5 add r2  r4, r5 P9 P4 P8 P7
    I6 sub r1  r2, r6 P11 P9 P10 P6
    P0
    runahead mode
    normal mode

    View Slide

  171. Runahead Register Reclamation
    30
    instruction dest src1 src2
    OldPhy
    register
    I1 add r1  r2, r3 P1 P2 P3
    I2 mul r2  r1, r4 P5 P1 P4 P2
    I3 ld r1  mem[x] P6 P1
    I4 add r2  r1, r3 P7 P6 P3 P5
    I5 add r2  r4, r5 P9 P4 P8 P7
    I6 sub r1  r2, r6 P11 P9 P10 P6
    P0
    OldPhy
    register
    I1
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    P0
    runahead mode
    normal mode

    View Slide

  172. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6

    View Slide

  173. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Precise Register Deallocation Queue (PRDQ)

    View Slide

  174. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Precise Register Deallocation Queue (PRDQ)
    dispatch

    View Slide

  175. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch

    View Slide

  176. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch
    0
    0
    0
    0
    0
    0

    View Slide

  177. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch execute
    0
    0
    0
    0
    0
    0

    View Slide

  178. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch execute
    0
    0
    0
    0
    0
    1

    View Slide

  179. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch execute
    0
    0
    0
    0
    1
    1

    View Slide

  180. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch execute
    0
    0
    0
    1
    1
    1

    View Slide

  181. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch execute
    0
    0
    0
    1
    1
    1


    View Slide

  182. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch execute
    0
    0
    0
    1
    1
    1



    View Slide

  183. Runahead Register Reclamation
    31
    runahead mode
    OldPhy
    register
    I1 P0
    I2 P2
    I3 P1
    I4 P5
    I5 P7
    I6 P6
    Executed
    ?
    Precise Register Deallocation Queue (PRDQ)
    dispatch execute
    0
    0
    0
    1
    1
    1



    View Slide

  184. Putting it All Together
    32

    View Slide

  185. Putting it All Together
    32
    I-Cache
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  186. Putting it All Together
    32
    I-Cache
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  187. Putting it All Together
    32
    I-Cache
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  188. Putting it All Together
    32
    I-Cache
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  189. Putting it All Together
    32
    I-Cache
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  190. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  191. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  192. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)

    View Slide

  193. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Rename
    (RAT)
    Normal
    Mode

    View Slide

  194. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Stalling
    Slice
    Table
    402ed2
    4287fd
    428809
    ….
    Rename
    (RAT)
    Normal
    Mode

    View Slide

  195. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Stalling
    Slice
    Table
    402ed2
    4287fd
    428809
    ….
    Rename
    (RAT)
    Normal
    Mode

    View Slide

  196. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Stalling
    Slice
    Table
    402ed2
    4287fd
    428809
    ….
    Rename
    (RAT)
    Normal
    Mode

    View Slide

  197. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Stalling
    Slice
    Table
    402ed2
    4287fd
    428809
    ….
    Rename
    (RAT)
    Normal
    Mode

    View Slide

  198. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Stalling
    Slice
    Table
    402ed2
    4287fd
    428809
    ….
    PRDQ
    I1 P5 0
    I2 P3 1
    I3 0
    ….
    Rename
    (RAT)
    Normal
    Mode

    View Slide

  199. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Stalling
    Slice
    Table
    402ed2
    4287fd
    428809
    ….
    PRDQ
    I1 P5 0
    I2 P3 1
    I3 0
    ….
    Rename
    (RAT)
    Normal
    Mode
    Runahead
    Mode

    View Slide

  200. Putting it All Together
    32
    I-Cache
    Dispatch
    Micro-op
    Queue
    Fetch Decode
    Register
    Read
    Issue Execute Commit
    New Modified Existing
    Stalling
    Slice
    Table
    402ed2
    4287fd
    428809
    ….
    PRDQ
    I1 P5 0
    I2 P3 1
    I3 0
    ….
    Rename
    (RAT)
    Normal
    Mode
    Runahead
    Mode

    View Slide

  201. Evaluation
    33

    View Slide

  202. Evaluation
    Simulator: Sniper 6.0, McPAT
    33

    View Slide

  203. Evaluation
    Simulator: Sniper 6.0, McPAT
    Workloads: SPEC CPU2006/CPU2017, 1B SimPoints
    33

    View Slide

  204. Evaluation
    Simulator: Sniper 6.0, McPAT
    Workloads: SPEC CPU2006/CPU2017, 1B SimPoints
    Baseline: ROB=192, issue queue=92, load/store
    queue=64, register file=168/168
    33

    View Slide

  205. Evaluation
    OoO: Baseline out-of-order core
    34

    View Slide

  206. Evaluation
    OoO: Baseline out-of-order core
    RA: Runahead execution*
    -- No short runahead intervals
    34
    *[Mutlu et al. ISCA’05]

    View Slide

  207. Evaluation
    OoO: Baseline out-of-order core
    RA: Runahead execution*
    -- No short runahead intervals
    -- No overlapping intervals
    34
    *[Mutlu et al. ISCA’05]

    View Slide

  208. Evaluation
    OoO: Baseline out-of-order core
    RA: Runahead execution*
    -- No short runahead intervals
    -- No overlapping intervals
    RA-buffer: Runahead buffer**
    34
    *[Mutlu et al. ISCA’05]

    View Slide

  209. Evaluation
    OoO: Baseline out-of-order core
    RA: Runahead execution*
    -- No short runahead intervals
    -- No overlapping intervals
    RA-buffer: Runahead buffer**
    RA-hybrid: Better performing mechanism between RA-buffer and RA
    34
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15]

    View Slide

  210. Evaluation
    OoO: Baseline out-of-order core
    RA: Runahead execution*
    -- No short runahead intervals
    -- No overlapping intervals
    RA-buffer: Runahead buffer**
    RA-hybrid: Better performing mechanism between RA-buffer and RA
    34
    *[Mutlu et al. ISCA’05] **[Hashemi et al. MICRO’15] ***[Naithani et al. HPCA’20]
    PRE:
    Precise runahead
    execution***

    View Slide

  211. Evaluation – Performance
    35
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  212. Evaluation – Performance
    35
    RA: 15.9%
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  213. Evaluation – Performance
    35
    RA: 15.9% RA-buffer: 13.3%
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  214. Evaluation – Performance
    35
    RA: 15.9% RA-buffer: 13.3% RA-hybrid: 20%
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  215. Evaluation – Performance
    35
    RA: 15.9% RA-buffer: 13.3% PRE: 38.2%
    RA-hybrid: 20%
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  216. Evaluation – Performance
    36
    RA: 15.9% RA-buffer: 13.3% PRE: 38.2%
    RA-hybrid: 20%
    0
    20
    40
    60
    80
    100
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    % ROB stalled
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  217. Evaluation – Performance
    36
    RA: 15.9% RA-buffer: 13.3% PRE: 38.2%
    RA-hybrid: 20%
    0
    20
    40
    60
    80
    100
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    % ROB stalled
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  218. Evaluation – Performance
    36
    RA: 15.9% RA-buffer: 13.3% PRE: 38.2%
    RA-hybrid: 20%
    0
    20
    40
    60
    80
    100
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    % ROB stalled
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  219. Evaluation – Performance
    36
    RA: 15.9% RA-buffer: 13.3% PRE: 38.2%
    RA-hybrid: 20%
    0
    20
    40
    60
    80
    100
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    % ROB stalled
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  220. Evaluation – Performance
    36
    RA: 15.9% RA-buffer: 13.3% PRE: 38.2%
    RA-hybrid: 20%
    0
    20
    40
    60
    80
    100
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    % ROB stalled
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  221. Evaluation – Performance
    36
    RA: 15.9% RA-buffer: 13.3% PRE: 38.2%
    RA-hybrid: 20%
    0
    20
    40
    60
    80
    100
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    HMean
    % ROB stalled
    normalized IPC
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  222. Evaluation – Memory-Level Parallelism
    37
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0
    3.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized MLP
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  223. Evaluation – Memory-Level Parallelism
    37
    RA: 1.5X
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0
    3.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized MLP
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  224. Evaluation – Memory-Level Parallelism
    37
    RA: 1.5X RA-buffer: 1.3X
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0
    3.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized MLP
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  225. Evaluation – Memory-Level Parallelism
    37
    RA: 1.5X RA-buffer: 1.3X RA-hybrid: 1.6X
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0
    3.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized MLP
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  226. Evaluation – Memory-Level Parallelism
    37
    RA: 1.5X RA-buffer: 1.3X PRE: 2.0X
    RA-hybrid: 1.6X
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    3.0
    3.5
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized MLP
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  227. Evaluation – LLC Miss Count Reduction
    38
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized LLC misses
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  228. Evaluation – LLC Miss Count Reduction
    38
    RA: 26.4%
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized LLC misses
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  229. Evaluation – LLC Miss Count Reduction
    38
    RA: 26.4% RA-buffer: 27.7%
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized LLC misses
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  230. Evaluation – LLC Miss Count Reduction
    38
    RA: 26.4% RA-buffer: 27.7% RA-hybrid: 31%
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized LLC misses
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  231. Evaluation – LLC Miss Count Reduction
    38
    RA: 26.4% RA-buffer: 27.7% PRE: 50.2%
    RA-hybrid: 31%
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized LLC misses
    OoO RA RA-buffer RA-hybrid PRE

    View Slide

  232. Evaluation – Energy
    39
    0.80
    0.85
    0.90
    0.95
    1.00
    1.05
    1.10
    1.15
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized energy
    RA RA-Buffer PRE
    0.71 0.76 0.8
    1.16

    View Slide

  233. Evaluation – Energy
    39
    RA: +2.4%
    0.80
    0.85
    0.90
    0.95
    1.00
    1.05
    1.10
    1.15
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized energy
    RA RA-Buffer PRE
    0.71 0.76 0.8
    1.16

    View Slide

  234. Evaluation – Energy
    39
    RA: +2.4% RA-buffer: Same
    0.80
    0.85
    0.90
    0.95
    1.00
    1.05
    1.10
    1.15
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized energy
    RA RA-Buffer PRE
    0.71 0.76 0.8
    1.16

    View Slide

  235. Evaluation – Energy
    39
    RA: +2.4% RA-buffer: Same RA-hybrid: Same
    0.80
    0.85
    0.90
    0.95
    1.00
    1.05
    1.10
    1.15
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized energy
    RA RA-Buffer PRE
    0.71 0.76 0.8
    1.16

    View Slide

  236. Evaluation – Energy
    39
    RA: +2.4% RA-buffer: Same PRE: -6.2%
    RA-hybrid: Same
    0.80
    0.85
    0.90
    0.95
    1.00
    1.05
    1.10
    1.15
    zeusm
    cactus
    wrf
    Gems
    leslie
    omnet
    milc
    soplex
    sphinx
    bwave
    libqua
    lbm
    mcf
    roms
    parest
    fotonik
    avg
    normalized energy
    RA RA-Buffer PRE
    0.71 0.76 0.8
    1.16

    View Slide

  237. Conclusions
    40

    View Slide

  238. Conclusions
    1. Never flushes the ROB
    40

    View Slide

  239. Conclusions
    1. Never flushes the ROB
    2. Executes only useful instructions in runahead mode
    40

    View Slide

  240. Conclusions
    1. Never flushes the ROB
    2. Executes only useful instructions in runahead mode
    3. Efficiently manages microarchitectural resources
    40

    View Slide

  241. Conclusions
    1. Never flushes the ROB
    2. Executes only useful instructions in runahead mode
    3. Efficiently manages microarchitectural resources
    40
    18.2% better performance

    View Slide

  242. Conclusions
    1. Never flushes the ROB
    2. Executes only useful instructions in runahead mode
    3. Efficiently manages microarchitectural resources
    40
    18.2% better performance 6.2% better energy

    View Slide

  243. Precise Runahead Execution
    Ajeya Naithani
    Josue Feliu
    Almutaz Adileh
    Lieven Eeckhout

    View Slide