Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fine-tuning Branch Predictors & Cache Hierarchy in TimingSimple CPU

Fine-tuning Branch Predictors & Cache Hierarchy in TimingSimple CPU

This project slide was presented as part of the UTD CS 6304 project.

2c1e2b1fddb9e395a4272e97476ba5bb?s=128

Arjun Sunil Kumar

May 11, 2022
Tweet

More Decks by Arjun Sunil Kumar

Other Decks in Research

Transcript

  1. Analysis on Branch Predictors Computer Architecture: Project 1 Sangeetha Pradeep

    (SXP210004) Arjun Sunil Kumar (AXS210011)
  2. Overview • Prediction • Simulator • Benchmark Programs • Environment

    • Gem5 Code changes ◦ V20 ◦ V21 • Build Commands • Benchmarking ◦ Hello World ◦ Sjeng ◦ LBM • Result
  3. Prediction

  4. Why Prediction? Simply said Prediction helps to better prepare for

    the situation. Same goes with computers
  5. Why Branch Prediction? Branch Predictor guesses the next fetch address

    to be executed. Importance: • Branches are frequent: 15-25% for all instructions • Branch prediction helps in reducing CPU stalls in a pipelined processor
  6. What are Branch Predictors? They are hardware components inside CPU,

    which helps in performing branch predictions
  7. Simulator

  8. What is Gem5? The gem5 simulator is an open-source system-level

    and processor simulator. It is utilized in academic research and in industry. Fun fact gem5 was born out of the merger of m5 (CPU simulation framework) and GEMS (memory timing simulator)
  9. Branch Predictors Available In gem5 we have the following branch

    predictors • LocalBP • BiModeBP • TournamentBP
  10. Branch Predictor : LocalBP Uses the PC to index into

    a table of counters. It is one-level, ie local predictor. The counters could be 2 bit or 1 bit based on the localCtrBits. • If it is set to 1, we have Last Time Predictor. • If it is set to 2, we have Two Bit Counter Predictor
  11. Branch Predictor : BiModeBP BiModeBP is a two-level branch predictor.

    (ie Global & Choice predictor)
  12. Branch Predictor : TournamentBP Combination of one local and one

    global predictor along with choice predictor. Alpha 21264 Tournament Predictor
  13. Branch Predictor Comparison Comparison of gem5 branch predictors we will

    be using for our analysis. LocalBP() Summary - One Level BP - Stores Local Level correlation. Tradeoff: No global correlation captured. BiModeBP() Summary - Two Level BP - Stores Global Level Correlation. Tradeoff: No Local correlation captured. TournamentBP() Summary - Uses 1 Global, 1 Local & Choice predictor. Tradeoff: More number of resources (transistors)
  14. Benchmark Programs

  15. Programs used in benchmarking We mainly ran gem5 simulation on

    the below 2 programs: 2. LBM: Lattice Boltzmann methods, a Fluid Simulation algorithm. 1. Sjeng : A chess engine supporting game modes like Crazyhouse, Suicide, Losers and Bughouse.
  16. Programs used in benchmarking: Con’t Additionally we ran it for

    a Hello World program for quick validation of parameters. 3. Hello world: Simple c program for printing “Hello World”
  17. Env

  18. Challenges we ran into: While setting up gem5, we ran

    into few challenges • For local execution, the Gem5’s latest version(v21.1.0.2) was not same as the Gem5 version used in the reference slides. So gem5 code changes were not easy. • Dependency issues w.r.t running older gem5 version (v20.0.0.3) on MacOS • In Mac M1, we faced challenges w.r.t setting up Linux VM. • Scons build and benchmark simulation was taking a lot of time • MacOS Latest update to Monterey caused dependency issues.
  19. How we resolved it: The show must go on! This

    is how we made it go Local ◦ We modified the latest(v21.1.0.2) gem5 source code and managed to get our custom stats logged. ◦ Mac OS Monterey update issue was solved by updating MacPorts(v2.7.1) and X-Code Cloud ◦ To ensure correctness, we did the same exercise on gem5 older version (v20.0.0.3) in Linux AWS instance. This also gave us the freedom to run benchmark’s parallely by spinning extra instances. ◦ We used C4.4xLarge (16 cores) for faster build using 9 threads. Github ◦ We used git and feature branches to maintain different test-scenarios. This enabled us to easily push the changes to the EC2 instances without using Vim for editing in CLI.
  20. Gem5 Code Changes-V20

  21. Gem5 : V20 vs V21 We added custom stats in

    gem5 • latest version v21.1.0.2 (source code is bit different here) and • older version v20.0.0.3 (to follow from the reference slides)
  22. What we added in Stats? We added 2 custom stats

    to the gem5 branch predictors. • BTB (Branch Target Address Cache) Miss Percentage BTB Miss = 1 - (BTBHit / BTBLookups) BTB Miss Percentage = BTB Miss x 100; • Branch Misprediction Percentage Branch Misprediction Percentage = (numBranchMispred / numBranches) x 100
  23. What we changed in Params ? We enabled different branch

    predictors. • LocalBP() • BiModeBP() • TournamentBP() We also modified the following configuration parameters for each branch predictor • BTB Entries • Local Predictor Size • Global Predictor Size • Choice Predictor Size
  24. Total Params Combination Local BP BiMode BP TournamentBP { "BTBEntries":

    4096, "localPredictorSize": 2048 } { "BTBEntries": 2048, "localPredictorSize": 1024 } { "BTBEntries": 4096, "globalPredictorSize": 8192 , ”choicePredictorSize”: 8192} { "BTBEntries": 2048, "globalPredictorSize": 2048 , ”choicePredictorSize”: 2048} { "BTBEntries": 4096, “localPredictorSize”: 2048, "globalPredictorSize": 8192 , ”choicePredictorSize”: 8192} { "BTBEntries": 2048, “localPredictorSize”: 1024, "globalPredictorSize": 4096 , ”choicePredictorSize”: 4096} In total we need 6 Iterations on each benchmarked program for our analysis. High Config Low Config
  25. Where we changed? Adding LocalBP()

  26. Where we changed? Con’t Adding BTB Miss Percentage

  27. Where we changed? Con’t Adding Branch Miss Prediction Percentage

  28. Where we changed? Con’t Updating BTBEntries

  29. Where we changed? Con’t Updating predictorSize for LocalBP, BiModeBP &

    TournamentBP Note: In the screenshot we haven’t modified the default values. We will be modifying it while executing.
  30. Gem5 Code Changes-V21

  31. Where we changed? Adding LocalBP()

  32. Where we changed? Con’t Adding BTB Miss Percentage

  33. Where we changed? Con’t Adding Branch Miss Prediction Percentage

  34. Where we changed? Con’t Updating • BTBEntries • PredictorSize for

    LocalBP, BiModeBP & TournamentBP
  35. Build Commands

  36. Login to AWS EC2 instance We ran gem5 simulation in

    Linux EC2 instance. Below are the commands to connect to EC2 using SSH.
  37. Install Dependencies To install the gem5 dependencies use the below

    commands
  38. Download gem5 using Git We pushed the modified gem5 source

    code into our personal github for easily pushing changes from our local editors to ec2 instance. This was used to avoid editing code with vim from the termnimal.
  39. Build Gem5 To build gem5, we used scons

  40. Benchmarking Hello World

  41. Execution Command explained So far, we have build gem5 simulator

    (gem5.opt). We will now use • gem5.opt (Simulation platform) and • se.py (Meant for simulation using the system call emulation mode) to run benchmark on • tests/test-progs/hello/bin/x86/linux/hello ( Prints “Hello World” )
  42. Benchmarking HelloWorld To test if things are working, we can

    run simulation on a hello-world program.
  43. m5out folder There are three files generated in a directory

    m5out after execution. We will be mainly focusing on the below 2 • Config.ini : Contains a list of every SimObject created for the simulation and the values for its parameters. • Stats.txt: Contains all of the gem5 statistics registered for the simulation.
  44. Gem5 Config.ini file We can see from this file, whether

    the modified argument where taken during simulation.
  45. Bonus: Gem5 v21 Config.dot file In gem5 v21, the config.dot

    is generated when we run it with configs/tutorial/Simple.py. This file can be imported using online visualizers to create flow diagrams. Attaching a sample image below Note: This might not be generated with se.py
  46. Gem5 Stats.txt file This file contains execution statistics.

  47. Data Analysis : Columns We will be using the below

    attributes for our data analysis. • BTBHitPct • BTBMissPCT • BranchMisPredictionPct
  48. Data Analysis : Visualization

  49. Data Analysis : Observation For BranchMispredPct, we can see •

    In High Config, LocalBP < TournamentBP < BiModBP • In Low Config, LocalBP < TournamentBP < BiModBP For BTBMissPct, we can see • In High Config, LocalBP < TournamentBP < BiModBP • In Low Config, LocalBP < TournamentBP < BiModBP
  50. Data Analysis : Conclusion For Hello World, • In High

    Config: LocalBP works better • In Low Config: LocalBP works better. Since there is no branching involved, so LocalBP works better.
  51. Benchmarking Sjeng

  52. Including Sjeng in Gem5 To test Sjeng, we added the

    sjeng-code inside gem5 source code and added runGem5.sh script NOTE: We added current_timestamp to the m5out folder in script to avoid overwrite.
  53. Benchmarking Sjeng We just need to run “runGem5.sh” inside the

    sjeng folder to start the benchmark.
  54. Data Analysis : Visualization

  55. Data Analysis : Observation For BranchMisPredPct, we can see •

    In High Config, BiModBP < TournamentBP < LocalBP • In Low Config, TournamentBP < BiModBP < LocalBP For BTBMissPct, we can see • In High Config, BiModBP < TournamentBP < LocalBP • In Low Config, TournamentBP < BiModBP < LocalBP
  56. Data Analysis : Conclusion For Sjeng, • In High Config:

    BiMode works better • In Low Config: Tournament works better We see that Global Correlation is more significant here. Thought: In chess, the next player movement is calculated by traversing through the possible set of next moves. These moves are repeated global functions acting on different grid positions, hence dictating global correlation.
  57. Benchmarking LBM

  58. Including lbm in Gem5 To test lbm, we added the

    lbm-code inside gem5 source code and added runGem5.sh script NOTE: Space separated arguments to lbm are passed in double quotes.
  59. Benchmarking LBM We just need to run “runGem5.sh” inside the

    lbm folder to start the benchmark.
  60. Data Analysis : Visualization

  61. Data Analysis : Observation For BranchMisPredPct, we can see •

    In High Config, TournamentBP < LocalBP < BiModBP • In Low Config, TournamentBP < LocalBP < BiModBP For BTBMissPct, we can see they are significantly low.
  62. Data Analysis : Conclusion For lbm, • In High Config:

    Tournament works better • In Low Config: Tournament works better Since this is lbm, we see that Local Correlation is more significant here, Thought: The algorithm traverses through neighbouring lattice points within grid/array structure. These array positions are traversed again in the same pattern, thereby increasing efficiency of BTB cache. Hence we see higher BTBHitPct.
  63. Result

  64. BTB Misprediction % - Sjeng BTB Misprediction % for LocalBP

    & BiMode increases slightly when we decreases BTBEntrySize & predictorSize. BTB Misprediction % for Tournament decrease slightly when we decreases BTBEntrySize & predictorSize.
  65. BTB Misprediction % - lbm BTB Misprediction % for LocalBP

    & BiMode increases slightly when we decreases BTBEntrySize & predictorSize. BTB Misprediction % for Tournament decreases when we decreases BTBEntrySize & predictorSize.
  66. BTB Misprediction % - verdict For LocalBP & BiMode, •

    Higher the BTB Entry Size and Predictor Size, Lower the BTB Miss Pct. Thought: Increasing BTB cache size helps in reducing BTB Misprediction %. For Tournament, • Lowering the BTB Entry Size and Predictor Size, Lowered the BTB Miss Pct. Thought: This was mainly prominent in lbm (ie 7% in high & 0.36% in low). Could be that, lbm’s address space was easily accommodated in TournamentBP’s BTB cache (in lower config).
  67. Branch Misprediction % in High Config Tournament is overall better

    in High Config
  68. Branch Misprediction % in Low Config Tournament is overall better

    in Low Config
  69. Branch Misprediction % Verdict LocalBP - Suited when you don’t

    have much branches. - Eg:- Hello World BiMod - Suited when you have more global correlation. - Eg:- Sjeng Tournament - Overall best branch predictor for generic use case. - Works better in scenarios where you have Global and Local correlation. - Eg:- lbm Comparing all three branch predictors, we find tournament to be best suited for generic use cases.
  70. Closing

  71. The Team Arjun Sunil Kumar AXS210011 MS CS (Systems Track)

    Sangeetha Pradeep SXP210004 MS CS (Systems Track)
  72. References Icons • Flaticon Branch Prediction • UTD CS6304 Slides

    • UCDavis Slides Gem5 • Gem5
  73. Thank you!

  74. None
  75. Analysis on Cache Computer Architecture: Project 2 Sangeetha Pradeep (SXP210004)

    Arjun Sunil Kumar (AXS210011)
  76. Overview • Introduction • Terminologies ◦ Cache Levels ◦ Cache

    Size ◦ Block Size ◦ Cache Associativity ◦ Miss Types • Simulator • Benchmark Programs • Gem5 Stats • Benchmarking ◦ Sjeng ◦ LBM • Overall Conclusion
  77. Introduction

  78. I need the info quick! Illustration

  79. I need the info quick! Con’t

  80. What is Cache? They are faster memory used to store

    program instructions and data that are used repeatedly in the operation of programs or information that the CPU is likely to need next. Importance: Fast access to the instructions increases the overall speed of the program.
  81. Terminologies

  82. Levels? There is an hierarchy of caches, based on the

    proximity to CPU, access time and size. https://www.hardwaretimes.com/difference-between-l1-l2-and-l3-cache-what-is-cpu-cache/ Cache access time L1 < L2 < L3
  83. Cache Sizes? Cache cost order L1 > L2 > L3

    Cache size order L1 < L2 < L3 Different processors have different cache sizes based on their price tags.
  84. Block Sizes? Block size increase Spatial Locality UTD CS 6304

    Slide
  85. Associativity? http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf Illustration

  86. Associativity? Con’t http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf

  87. Associativity? Con’t https://www.geeksforgeeks.org/cache-organization-set-1-introduction/ Summarized

  88. Miss Types? http://csillustrated.berkeley.edu/PDFs/handouts/cache-2-misses-handout.pdf Illustration

  89. Miss Types? Con’t http://csillustrated.berkeley.edu/PDFs/handouts/cache-2-misses-handout.pdf

  90. Simulator

  91. Cache Configurations In gem5 we have the following cache configurations

    • Cache Levels : ◦ L1 : L1-Instruction & L1-Data with max size of 128KB total ◦ L2: Unified cache with max size of 1MB total • Associativity : Direct, 2-way, 4-way, 8-way .... n-way • Block Size: 32 bytes or 64 bytes • Block replacement policy : LRU, FIFO etc.
  92. Command Line Arguments We can use this command to pass

    in our cache parameters.
  93. Benchmark Programs

  94. Programs used in benchmarking We mainly ran gem5 simulation on

    the below 2 programs: 2. LBM: Lattice Boltzmann methods, a Fluid Simulation algorithm. 1. Sjeng : A chess engine supporting game modes like Crazyhouse, Suicide, Losers and Bughouse.
  95. Gem5 Stats

  96. CPI We are required to calculate the Cycles Per Instruction

    with the below Assumptions: • L1 miss penalty = 6 cycles • L2 miss penalty = 50 cycles • Cache hit/Instruction execution = 1 cycle So
  97. Total Params Combination Total Combinations = 32 NOTE: In order

    to reduce iterations, we are considering only border cases. • Associativity combinations: 1-way for low and 8-way for high. • L2 Cache Sizes: 256KB for low and 1MB for high.
  98. Benchmarking Sjeng

  99. Data Analysis : Visualization 1

  100. Data Analysis : Visualization 2

  101. Data Analysis : Visualization 3

  102. Data Analysis : Observation 1. For the lowest CPI, below

    are the best configurations under each factors • Associativity: L1 8-way & L2 8-way • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB • Block Size: 64KB Block size 2. For L2 Cache • Size change doesn’t significantly reduce the CPI • Associativity increase doesn’t significantly reduce CPI ie, L2 Cache is not significant in Sjeng (In chess, the next player movement is calculated by traversing through the possible set of next moves. These moves are repeated global functions acting on different grid positions. Hence they don’t add value in spatial locality.)
  103. Data Analysis : Conclusion • Lowest CPI Obtained is 1.938346516

    with the below configuration ◦ 64KB Block Size ◦ L1-I 128KB & L1-D 128KB (8-way associative) ◦ L2 1MB (8-way associative)
  104. Benchmarking LBM

  105. Data Analysis : Visualization 1

  106. Data Analysis : Visualization 2

  107. Data Analysis : Visualization 3

  108. Data Analysis : Observation 1. For the lowest CPI, below

    are the best configurations under each factors • Associativity: L1 8-way & L2 8-way • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB • Block Size: 64KB Block size 2. For L2 Cache, Increasing Cache Size and Associativity, helped in reducing CPI. (The algorithm traverses through neighbouring lattice points within grid/array structure, there by increasing spatial correlation. So increasing associativity and cache size for L2 also played a good role in reducing CPI)
  109. Data Analysis : Conclusion • Lowest CPI Obtained is 1.800737936

    with the below configuration ◦ 64KB Block Size ◦ L1-I 128KB & L1-D 128KB (8-way associative) ◦ L2 1MB (8-way associative)
  110. Overall Conclusion

  111. Optimal CPI Config For the Optimal(lowest) CPI, below are the

    best configurations for each factors • Associativity: L1 8-way & L2 8-way ◦ Working: Reduces conflict, thereby reducing cache misses. ◦ Trade off: More Transistors required for comparing more tag comparison. • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB ◦ Working: More entries can be stored. Ie better hit rate. ◦ Trade off: Cost increases • Block Size: 64KB Block size ◦ Working: Multi Block takes advantage of spatial locality. It also helps in reducing compulsory misses. ◦ Trade off: Increase in conflict miss due to reduction in blocks. Cache miss leads to flushing out entire block data thereby increasing eviction overhead.
  112. Cost Function for Cache Points taken into consideration • L1

    Cache is more expensive than L2 • Associativity complexity increases circuit cost (no. of transistors used for comparison) So, our cost function is Cost = (L1 size in KBs x L1_cost) + (L2 size in KBs x L2_cost) + (L1_assoc x assoc_cost) + (L2_assoc x assoc_cost) Unit cost are • L1_cost = 0.7$ per kB, • L2_cost = 0.05$ per kB, • assoc_cost = 0.02$ for 1-way
  113. CPI vs Cost : Visualization

  114. CPI vs Cost : Analysis A general observation here is,

    Increase in Cost reduces CPI, which leads to better performance.
  115. Closing

  116. The Team Arjun Sunil Kumar AXS210011 MS CS (Systems Track)

    Sangeetha Pradeep SXP210004 MS CS (Systems Track)
  117. References Icons • Flaticon Cache • UTD CS6304 Slides •

    Geek4Geeks • CS illustrated Berkeley Gem5 • Gem5
  118. Thank you!