The Design of a Two-way In-order Superscalar Processor

The Design of a Two-way In-order Superscalar Processor Masayuki Usui
The University of Tokyo

Introduction • I designed a CPU in the class •
The ISA is based on RISC-V • The microarchitecture was inspired by the ARM Cortex-A series • The CPU includes the following features: • Two-way in-order superscalar architecture • GShare branch predictor • Early branch resolution

Background: ARM Cortex-A Since the pipeline structure of the CPU
designed by me is based on that of ARM Cortex-A processors, it is explained here.

ARM Cortex-A7 (32-bit 2-way inorder) https://community.arm.com/support-forums/f/architectures-and-processors-forum/5277/cortex-a7-pipeline-is-non-symmetric-what-does-this-attribute-mean

ARM Cortex-A35 (64-bit 2-way inorder) https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/introducing-cortex-a35-arm-s-most-efficient-application-processor

ARM Cortex-A53/55 (64-bit 2-way inorder) https://pc.watch.impress.co.jp/docs/column/kaigai/1062305.html

Microarchitecture I Designed The microarchitecture of the CPU designed by
me is explained here.

Fetch Decode AL AL BR MA1 FPn FP1 MA2 MA3
FP2 FP3 Instruction Queue Instruction Memory (SRAM) WB • BR = BRanch • AL = Arithmetic Logic • MA = Memory Access • FP = Floating Point • FP1, FP2, and FP3: Pipelined Floating-Point (PFP) unit • FPn: Multicycle Floating-Point (MFP) unit Pipeline Structure

Inst Inst Inst Inst Dequeue Enqueue Instruction Queue Instruction Fetch
(IF) Instruction Decode (ID) Inst A Inst B Execute (EX) AL: Arithmetic Logic AL: Arithmetic Logic BR: BRanch MA: Memory Access MFP: Multicycle Floating-Point PFP: Pipelined Floating-Point 1. Distribute each instruction into the appropriate unit (AL–PFP) 2. Consider various hazards and determine the number of dispatchable instructions (0–2) 3. Flush the distributed instructions if they turned out not to be dispatchable Instruction Memory (SRAM) Fetch Decoder Decoder Microarchitecture (Core)

AL: Arithmetic Logic MFP: Multicycle Floating-Point PFP: Pipelined Floating-Point Microarchitecture
(Execution Units) Reg Reg Operand A Operand B Result ALU BR: BRanch Reg Reg Operand A Operand B Result Comparator Reg Reg Operand A Operand B Result FPU IDLE COMP LETED BUSY An FSM that waits until computation is completed Input Output • PFP covers floating-point operations whose latency is 3 • MFP covers other floating-point operations

Microarchitecture (Execution Units) Dirty SRAM Tag SRAM Data SRAM Reg
Address Reg =? Hit or Miss Reg Reg Data Reg Dirty MA1 MA2 MA3 IDLE SEND READ RECV READ WRITE BACK An FSM that handles a cache miss Data The latency of SRAMs is 1 MA: Memory Access • Memory accesses are fully pipelined if no cache miss occurs • The memory access unit stalls if a cache miss occurs

IF ID AL AL BR MA1 FPn FP1 MA2 MA3
FP2 FP3 WB Branch Prediction A (conditional) branch is taken according to the result of the branch predictor Cancel the taken branch if the prediction turned out to be wrong

00 Strongly Not Taken 01 Weakly Not Taken 10 Weakly
Taken 11 Strongly Taken Taken Taken Taken Not Taken Not Taken Not Taken Taken Not Taken Predict Not Taken Predict Taken 2-bit Saturating Counter

1000 1100 Address 1011 History XOR 10 2-bit prediction GShare
Branch Predictor Pattern History Table (PHT) Global History Register (GHR) Program Counter (PC) MSB LSB n bits n bits Index

IF ID AL AL BR MA1 FPn FP1 MA2 MA3
FP2 FP3 WB Early Branch Resolution A branch is taken at the ID stage if the instruction is jal • Early branch resolution is applied to jal, but not to jalr • jalr involves fetching a register, but jal does not

Implementation Details The implementation details of the CPU designed by
me, most of which are specific to FPGAs, are explained here.

Using DRAM with Xilinx FPGAs MIG: Memory Interface Generator (provided
by Xilinx) DDR2 SDRAM (off chip) Core (designed by me) Asynchronous FIFO for response signals Asynchronous FIFO for request signals NOTE: Asynchronous FIFOs are used for the clock domain crossing (CDC). FPGA logic

Multi-Ported Memories for FPGAs • A 2-way superscalar processor demands
2x ports for a register file: • 2 write ports • 4 read ports • However, FPGAs only provides memory blocks with one write ports and one read ports • Therefore, we combine multiple 1W/1R memories to create a 2W/4R memory

Multi-Ported Memories for FPGAs (cont’d) • LaForest et al. introduced
a structure called the Live Value Table (LVT) to implement multi-ported memories on FPGAs [1] • LaForest et al. exploited the XOR operation to construct multi-ported memories [2] • I actually implemented the above two kinds of design and compared performance on an FPGA • The results showed that LVT-based design outperforms the XOR-based design • Hence, I adopt an LVT-based multiport memory in my design

References [1] Charles Eric LaForest and J. Gregory Steffan. 2010.
Efficient multi-ported memories for FPGAs. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays (FPGA '10). Association for Computing Machinery, New York, NY, USA, 41–50. [2] Charles Eric Laforest, Ming G. Liu, Emma Rae Rapati, and J. Gregory Steffan. 2012. Multi-ported memories for FPGAs via XOR. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays (FPGA '12). Association for Computing Machinery, New York, NY, USA, 209–218.

The Design of a Two-way In-order Superscalar Pr...

The Design of a Two-way In-order Superscalar Processor

Masayuki Usui

More Decks by Masayuki Usui

Other Decks in Technology

Featured

Transcript