to be executed. Importance: • Branches are frequent: 15-25% for all instructions • Branch prediction helps in reducing CPU stalls in a pipelined processor
and processor simulator. It is utilized in academic research and in industry. Fun fact gem5 was born out of the merger of m5 (CPU simulation framework) and GEMS (memory timing simulator)
a table of counters. It is one-level, ie local predictor. The counters could be 2 bit or 1 bit based on the localCtrBits. • If it is set to 1, we have Last Time Predictor. • If it is set to 2, we have Two Bit Counter Predictor
be using for our analysis. LocalBP() Summary - One Level BP - Stores Local Level correlation. Tradeoff: No global correlation captured. BiModeBP() Summary - Two Level BP - Stores Global Level Correlation. Tradeoff: No Local correlation captured. TournamentBP() Summary - Uses 1 Global, 1 Local & Choice predictor. Tradeoff: More number of resources (transistors)
the below 2 programs: 2. LBM: Lattice Boltzmann methods, a Fluid Simulation algorithm. 1. Sjeng : A chess engine supporting game modes like Crazyhouse, Suicide, Losers and Bughouse.
into few challenges • For local execution, the Gem5’s latest version(v21.1.0.2) was not same as the Gem5 version used in the reference slides. So gem5 code changes were not easy. • Dependency issues w.r.t running older gem5 version (v20.0.0.3) on MacOS • In Mac M1, we faced challenges w.r.t setting up Linux VM. • Scons build and benchmark simulation was taking a lot of time • MacOS Latest update to Monterey caused dependency issues.
is how we made it go Local ◦ We modified the latest(v21.1.0.2) gem5 source code and managed to get our custom stats logged. ◦ Mac OS Monterey update issue was solved by updating MacPorts(v2.7.1) and X-Code Cloud ◦ To ensure correctness, we did the same exercise on gem5 older version (v20.0.0.3) in Linux AWS instance. This also gave us the freedom to run benchmark’s parallely by spinning extra instances. ◦ We used C4.4xLarge (16 cores) for faster build using 9 threads. Github ◦ We used git and feature branches to maintain different test-scenarios. This enabled us to easily push the changes to the EC2 instances without using Vim for editing in CLI.
predictors. • LocalBP() • BiModeBP() • TournamentBP() We also modified the following configuration parameters for each branch predictor • BTB Entries • Local Predictor Size • Global Predictor Size • Choice Predictor Size
code into our personal github for easily pushing changes from our local editors to ec2 instance. This was used to avoid editing code with vim from the termnimal.
(gem5.opt). We will now use • gem5.opt (Simulation platform) and • se.py (Meant for simulation using the system call emulation mode) to run benchmark on • tests/test-progs/hello/bin/x86/linux/hello ( Prints “Hello World” )
m5out after execution. We will be mainly focusing on the below 2 • Config.ini : Contains a list of every SimObject created for the simulation and the values for its parameters. • Stats.txt: Contains all of the gem5 statistics registered for the simulation.
is generated when we run it with configs/tutorial/Simple.py. This file can be imported using online visualizers to create flow diagrams. Attaching a sample image below Note: This might not be generated with se.py
In High Config, LocalBP < TournamentBP < BiModBP • In Low Config, LocalBP < TournamentBP < BiModBP For BTBMissPct, we can see • In High Config, LocalBP < TournamentBP < BiModBP • In Low Config, LocalBP < TournamentBP < BiModBP
In High Config, BiModBP < TournamentBP < LocalBP • In Low Config, TournamentBP < BiModBP < LocalBP For BTBMissPct, we can see • In High Config, BiModBP < TournamentBP < LocalBP • In Low Config, TournamentBP < BiModBP < LocalBP
BiMode works better • In Low Config: Tournament works better We see that Global Correlation is more significant here. Thought: In chess, the next player movement is calculated by traversing through the possible set of next moves. These moves are repeated global functions acting on different grid positions, hence dictating global correlation.
Tournament works better • In Low Config: Tournament works better Since this is lbm, we see that Local Correlation is more significant here, Thought: The algorithm traverses through neighbouring lattice points within grid/array structure. These array positions are traversed again in the same pattern, thereby increasing efficiency of BTB cache. Hence we see higher BTBHitPct.
& BiMode increases slightly when we decreases BTBEntrySize & predictorSize. BTB Misprediction % for Tournament decrease slightly when we decreases BTBEntrySize & predictorSize.
& BiMode increases slightly when we decreases BTBEntrySize & predictorSize. BTB Misprediction % for Tournament decreases when we decreases BTBEntrySize & predictorSize.
Higher the BTB Entry Size and Predictor Size, Lower the BTB Miss Pct. Thought: Increasing BTB cache size helps in reducing BTB Misprediction %. For Tournament, • Lowering the BTB Entry Size and Predictor Size, Lowered the BTB Miss Pct. Thought: This was mainly prominent in lbm (ie 7% in high & 0.36% in low). Could be that, lbm’s address space was easily accommodated in TournamentBP’s BTB cache (in lower config).
have much branches. - Eg:- Hello World BiMod - Suited when you have more global correlation. - Eg:- Sjeng Tournament - Overall best branch predictor for generic use case. - Works better in scenarios where you have Global and Local correlation. - Eg:- lbm Comparing all three branch predictors, we find tournament to be best suited for generic use cases.
program instructions and data that are used repeatedly in the operation of programs or information that the CPU is likely to need next. Importance: Fast access to the instructions increases the overall speed of the program.
proximity to CPU, access time and size. https://www.hardwaretimes.com/difference-between-l1-l2-and-l3-cache-what-is-cpu-cache/ Cache access time L1 < L2 < L3
the below 2 programs: 2. LBM: Lattice Boltzmann methods, a Fluid Simulation algorithm. 1. Sjeng : A chess engine supporting game modes like Crazyhouse, Suicide, Losers and Bughouse.
to reduce iterations, we are considering only border cases. • Associativity combinations: 1-way for low and 8-way for high. • L2 Cache Sizes: 256KB for low and 1MB for high.
are the best configurations under each factors • Associativity: L1 8-way & L2 8-way • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB • Block Size: 64KB Block size 2. For L2 Cache • Size change doesn’t significantly reduce the CPI • Associativity increase doesn’t significantly reduce CPI ie, L2 Cache is not significant in Sjeng (In chess, the next player movement is calculated by traversing through the possible set of next moves. These moves are repeated global functions acting on different grid positions. Hence they don’t add value in spatial locality.)
are the best configurations under each factors • Associativity: L1 8-way & L2 8-way • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB • Block Size: 64KB Block size 2. For L2 Cache, Increasing Cache Size and Associativity, helped in reducing CPI. (The algorithm traverses through neighbouring lattice points within grid/array structure, there by increasing spatial correlation. So increasing associativity and cache size for L2 also played a good role in reducing CPI)
best configurations for each factors • Associativity: L1 8-way & L2 8-way ◦ Working: Reduces conflict, thereby reducing cache misses. ◦ Trade off: More Transistors required for comparing more tag comparison. • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB ◦ Working: More entries can be stored. Ie better hit rate. ◦ Trade off: Cost increases • Block Size: 64KB Block size ◦ Working: Multi Block takes advantage of spatial locality. It also helps in reducing compulsory misses. ◦ Trade off: Increase in conflict miss due to reduction in blocks. Cache miss leads to flushing out entire block data thereby increasing eviction overhead.
Cache is more expensive than L2 • Associativity complexity increases circuit cost (no. of transistors used for comparison) So, our cost function is Cost = (L1 size in KBs x L1_cost) + (L2 size in KBs x L2_cost) + (L1_assoc x assoc_cost) + (L2_assoc x assoc_cost) Unit cost are • L1_cost = 0.7$ per kB, • L2_cost = 0.05$ per kB, • assoc_cost = 0.02$ for 1-way