Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Design of a Two-way In-order Superscalar Pr...

Masayuki Usui
November 09, 2023

The Design of a Two-way In-order Superscalar Processor

東大理情名物のCPU実験では、プロセッサを設計し、実際にFPGA上で動作させます。そこで、私は2命令発行の静的スーパースカラプロセッサを設計しました。このスライドではその設計の詳細について言及しています。

This slides explain the two-way in-order superscalar processor that I designed in the class.

Masayuki Usui

November 09, 2023
Tweet

More Decks by Masayuki Usui

Other Decks in Technology

Transcript

  1. Introduction • I designed a CPU in the class •

    The ISA is based on RISC-V • The microarchitecture was inspired by the ARM Cortex-A series • The CPU includes the following features: • Two-way in-order superscalar architecture • GShare branch predictor • Early branch resolution
  2. Background: ARM Cortex-A Since the pipeline structure of the CPU

    designed by me is based on that of ARM Cortex-A processors, it is explained here.
  3. Fetch Decode AL AL BR MA1 FPn FP1 MA2 MA3

    FP2 FP3 Instruction Queue Instruction Memory (SRAM) WB • BR = BRanch • AL = Arithmetic Logic • MA = Memory Access • FP = Floating Point • FP1, FP2, and FP3: Pipelined Floating-Point (PFP) unit • FPn: Multicycle Floating-Point (MFP) unit Pipeline Structure
  4. Inst Inst Inst Inst Dequeue Enqueue Instruction Queue Instruction Fetch

    (IF) Instruction Decode (ID) Inst A Inst B Execute (EX) AL: Arithmetic Logic AL: Arithmetic Logic BR: BRanch MA: Memory Access MFP: Multicycle Floating-Point PFP: Pipelined Floating-Point 1. Distribute each instruction into the appropriate unit (AL–PFP) 2. Consider various hazards and determine the number of dispatchable instructions (0–2) 3. Flush the distributed instructions if they turned out not to be dispatchable Instruction Memory (SRAM) Fetch Decoder Decoder Microarchitecture (Core)
  5. AL: Arithmetic Logic MFP: Multicycle Floating-Point PFP: Pipelined Floating-Point Microarchitecture

    (Execution Units) Reg Reg Operand A Operand B Result ALU BR: BRanch Reg Reg Operand A Operand B Result Comparator Reg Reg Operand A Operand B Result FPU IDLE COMP LETED BUSY An FSM that waits until computation is completed Input Output • PFP covers floating-point operations whose latency is 3 • MFP covers other floating-point operations
  6. Microarchitecture (Execution Units) Dirty SRAM Tag SRAM Data SRAM Reg

    Address Reg =? Hit or Miss Reg Reg Data Reg Dirty MA1 MA2 MA3 IDLE SEND READ RECV READ WRITE BACK An FSM that handles a cache miss Data The latency of SRAMs is 1 MA: Memory Access • Memory accesses are fully pipelined if no cache miss occurs • The memory access unit stalls if a cache miss occurs
  7. IF ID AL AL BR MA1 FPn FP1 MA2 MA3

    FP2 FP3 WB Branch Prediction A (conditional) branch is taken according to the result of the branch predictor Cancel the taken branch if the prediction turned out to be wrong
  8. 00 Strongly Not Taken 01 Weakly Not Taken 10 Weakly

    Taken 11 Strongly Taken Taken Taken Taken Not Taken Not Taken Not Taken Taken Not Taken Predict Not Taken Predict Taken 2-bit Saturating Counter
  9. 1000 1100 Address 1011 History XOR 10 2-bit prediction GShare

    Branch Predictor Pattern History Table (PHT) Global History Register (GHR) Program Counter (PC) MSB LSB n bits n bits Index
  10. IF ID AL AL BR MA1 FPn FP1 MA2 MA3

    FP2 FP3 WB Early Branch Resolution A branch is taken at the ID stage if the instruction is jal • Early branch resolution is applied to jal, but not to jalr • jalr involves fetching a register, but jal does not
  11. Implementation Details The implementation details of the CPU designed by

    me, most of which are specific to FPGAs, are explained here.
  12. Using DRAM with Xilinx FPGAs MIG: Memory Interface Generator (provided

    by Xilinx) DDR2 SDRAM (off chip) Core (designed by me) Asynchronous FIFO for response signals Asynchronous FIFO for request signals NOTE: Asynchronous FIFOs are used for the clock domain crossing (CDC). FPGA logic
  13. Multi-Ported Memories for FPGAs • A 2-way superscalar processor demands

    2x ports for a register file: • 2 write ports • 4 read ports • However, FPGAs only provides memory blocks with one write ports and one read ports • Therefore, we combine multiple 1W/1R memories to create a 2W/4R memory
  14. Multi-Ported Memories for FPGAs (cont’d) • LaForest et al. introduced

    a structure called the Live Value Table (LVT) to implement multi-ported memories on FPGAs [1] • LaForest et al. exploited the XOR operation to construct multi-ported memories [2] • I actually implemented the above two kinds of design and compared performance on an FPGA • The results showed that LVT-based design outperforms the XOR-based design • Hence, I adopt an LVT-based multiport memory in my design
  15. References [1] Charles Eric LaForest and J. Gregory Steffan. 2010.

    Efficient multi-ported memories for FPGAs. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays (FPGA '10). Association for Computing Machinery, New York, NY, USA, 41–50. [2] Charles Eric Laforest, Ming G. Liu, Emma Rae Rapati, and J. Gregory Steffan. 2012. Multi-ported memories for FPGAs via XOR. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays (FPGA '12). Association for Computing Machinery, New York, NY, USA, 209–218.