a table of counters. It is one-level, ie local predictor. The counters could be 2 bit or 1 bit based on the localCtrBits. • If it is set to 1, we have Last Time Predictor. • If it is set to 2, we have Two Bit Counter Predictor
be using for our analysis. LocalBP() Summary - One Level BP - Stores Local Level correlation. Tradeoﬀ: No global correlation captured. BiModeBP() Summary - Two Level BP - Stores Global Level Correlation. Tradeoﬀ: No Local correlation captured. TournamentBP() Summary - Uses 1 Global, 1 Local & Choice predictor. Tradeoﬀ: More number of resources (transistors)
into few challenges • For local execution, the Gem5’s latest version(v18.104.22.168) was not same as the Gem5 version used in the reference slides. So gem5 code changes were not easy. • Dependency issues w.r.t running older gem5 version (v22.214.171.124) on MacOS • In Mac M1, we faced challenges w.r.t setting up Linux VM. • Scons build and benchmark simulation was taking a lot of time • MacOS Latest update to Monterey caused dependency issues.
is how we made it go Local ◦ We modiﬁed the latest(v126.96.36.199) gem5 source code and managed to get our custom stats logged. ◦ Mac OS Monterey update issue was solved by updating MacPorts(v2.7.1) and X-Code Cloud ◦ To ensure correctness, we did the same exercise on gem5 older version (v188.8.131.52) in Linux AWS instance. This also gave us the freedom to run benchmark’s parallely by spinning extra instances. ◦ We used C4.4xLarge (16 cores) for faster build using 9 threads. Github ◦ We used git and feature branches to maintain diﬀerent test-scenarios. This enabled us to easily push the changes to the EC2 instances without using Vim for editing in CLI.
predictors. • LocalBP() • BiModeBP() • TournamentBP() We also modiﬁed the following conﬁguration parameters for each branch predictor • BTB Entries • Local Predictor Size • Global Predictor Size • Choice Predictor Size
(gem5.opt). We will now use • gem5.opt (Simulation platform) and • se.py (Meant for simulation using the system call emulation mode) to run benchmark on • tests/test-progs/hello/bin/x86/linux/hello ( Prints “Hello World” )
m5out after execution. We will be mainly focusing on the below 2 • Conﬁg.ini : Contains a list of every SimObject created for the simulation and the values for its parameters. • Stats.txt: Contains all of the gem5 statistics registered for the simulation.
is generated when we run it with conﬁgs/tutorial/Simple.py. This ﬁle can be imported using online visualizers to create ﬂow diagrams. Attaching a sample image below Note: This might not be generated with se.py
In High Conﬁg, LocalBP < TournamentBP < BiModBP • In Low Conﬁg, LocalBP < TournamentBP < BiModBP For BTBMissPct, we can see • In High Conﬁg, LocalBP < TournamentBP < BiModBP • In Low Conﬁg, LocalBP < TournamentBP < BiModBP
In High Conﬁg, BiModBP < TournamentBP < LocalBP • In Low Conﬁg, TournamentBP < BiModBP < LocalBP For BTBMissPct, we can see • In High Conﬁg, BiModBP < TournamentBP < LocalBP • In Low Conﬁg, TournamentBP < BiModBP < LocalBP
BiMode works better • In Low Conﬁg: Tournament works better We see that Global Correlation is more signiﬁcant here. Thought: In chess, the next player movement is calculated by traversing through the possible set of next moves. These moves are repeated global functions acting on diﬀerent grid positions, hence dictating global correlation.
Tournament works better • In Low Conﬁg: Tournament works better Since this is lbm, we see that Local Correlation is more signiﬁcant here, Thought: The algorithm traverses through neighbouring lattice points within grid/array structure. These array positions are traversed again in the same pattern, thereby increasing eﬃciency of BTB cache. Hence we see higher BTBHitPct.
Higher the BTB Entry Size and Predictor Size, Lower the BTB Miss Pct. Thought: Increasing BTB cache size helps in reducing BTB Misprediction %. For Tournament, • Lowering the BTB Entry Size and Predictor Size, Lowered the BTB Miss Pct. Thought: This was mainly prominent in lbm (ie 7% in high & 0.36% in low). Could be that, lbm’s address space was easily accommodated in TournamentBP’s BTB cache (in lower conﬁg).
have much branches. - Eg:- Hello World BiMod - Suited when you have more global correlation. - Eg:- Sjeng Tournament - Overall best branch predictor for generic use case. - Works better in scenarios where you have Global and Local correlation. - Eg:- lbm Comparing all three branch predictors, we ﬁnd tournament to be best suited for generic use cases.
program instructions and data that are used repeatedly in the operation of programs or information that the CPU is likely to need next. Importance: Fast access to the instructions increases the overall speed of the program.
are the best conﬁgurations under each factors • Associativity: L1 8-way & L2 8-way • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB • Block Size: 64KB Block size 2. For L2 Cache • Size change doesn’t signiﬁcantly reduce the CPI • Associativity increase doesn’t signiﬁcantly reduce CPI ie, L2 Cache is not signiﬁcant in Sjeng (In chess, the next player movement is calculated by traversing through the possible set of next moves. These moves are repeated global functions acting on diﬀerent grid positions. Hence they don’t add value in spatial locality.)
are the best conﬁgurations under each factors • Associativity: L1 8-way & L2 8-way • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB • Block Size: 64KB Block size 2. For L2 Cache, Increasing Cache Size and Associativity, helped in reducing CPI. (The algorithm traverses through neighbouring lattice points within grid/array structure, there by increasing spatial correlation. So increasing associativity and cache size for L2 also played a good role in reducing CPI)
best conﬁgurations for each factors • Associativity: L1 8-way & L2 8-way ◦ Working: Reduces conﬂict, thereby reducing cache misses. ◦ Trade oﬀ: More Transistors required for comparing more tag comparison. • Cache Size: L1-I 128KB, L1-I 128KB & L2 1MB ◦ Working: More entries can be stored. Ie better hit rate. ◦ Trade oﬀ: Cost increases • Block Size: 64KB Block size ◦ Working: Multi Block takes advantage of spatial locality. It also helps in reducing compulsory misses. ◦ Trade oﬀ: Increase in conﬂict miss due to reduction in blocks. Cache miss leads to ﬂushing out entire block data thereby increasing eviction overhead.
Cache is more expensive than L2 • Associativity complexity increases circuit cost (no. of transistors used for comparison) So, our cost function is Cost = (L1 size in KBs x L1_cost) + (L2 size in KBs x L2_cost) + (L1_assoc x assoc_cost) + (L2_assoc x assoc_cost) Unit cost are • L1_cost = 0.7$ per kB, • L2_cost = 0.05$ per kB, • assoc_cost = 0.02$ for 1-way