Intro to Computer Architecture

by Tarek Eldeeb

Slide 1

Slide 1 text

Intro for Computer Architecture Tarek ElDeeb SilMinds, LLC May 27, 2012

Slide 2

Slide 2 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is Computer Architecture? According to the 1913 Webster, architecture is: the art or science of building;. . . or construction, in a more general sense. Recent Dictionaries . . . 4: (computer science) the structure and organization of a computer’s hardware or system software; “the architecture of a computer’s system software” [syn: computer architecture] Tarek ElDeeb Intro for Computer Architecture

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does computer architecture ﬁt? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Deﬁnes : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you deﬁne ‘better’? Tarek ElDeeb Intro for Computer Architecture

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system ﬂoating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Comparing ISAs Archi Bits Date Op Type Design Regs Encoding Endianness Alpha 64 1992 3 Reg-Reg RISC 32 Fixed Bi ARM 32 1983 3 Reg-Reg RISC 16 Thumb-2: Variable (16/32 bit) Bi MIPS 64 (32→64) 1981 3 Reg-Reg RISC 32 Fixed (32- bit) Bi PowerPC 32/64 (32→64) 1991 3 Reg-Reg RISC 32 Fixed, Vari- able Big/Bi SPARC 64 (32→64) 1985 3 Reg-Reg RISC 32 Fixed Big → Bi z/Archi 64 (32→64) 1964 ? Reg-Mem/ Mem-Mem CISC 16 Fixed Big VAX 32 1977 6 Mem-Mem CISC 16 Variable Little x86 32 (16→32) 1978 2 Reg-Mem CISC 8 Variable Little x86-64 64 2003 2 Reg-Mem CISC 16 Variable Little Tarek ElDeeb Intro for Computer Architecture

Slide 32

Slide 32 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efﬁciency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary How do design goals change? Tarek ElDeeb Intro for Computer Architecture

Slide 48

Slide 48 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary The Upshot! Make the common case fast but remember that the uncommon case eventually sets the limit. You must have a balanced system where the resources are distributed according to where time is spent. Your system’s performance must be above the required average! The peak will be reduced by dependencies and memory stalls. Tarek ElDeeb Intro for Computer Architecture

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How does the information move? By the rule of law. Each unit gets the inputs at a prescribed time and should deliver the output before a prescribed time. Synchronous, with clocks. By consensus. Tell me when you finish your part. Asynchronous, with handshaking. By its natural flow. Gates within a unit have a delay. Once the first level of gates ends its function, those gates start on new data while the second level of gates is processing the first data without extra signaling. Wavepipelines, must set a barrier somewhere! Tarek ElDeeb Intro for Computer Architecture

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Tarek ElDeeb Intro for Computer Architecture

Slide 63

Slide 63 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Tarek ElDeeb Intro for Computer Architecture

Slide 64

Slide 64 text

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout Tarek ElDeeb Intro for Computer Architecture

Slide 69

Slide 69 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb Tarek ElDeeb Intro for Computer Architecture

Slide 70

Slide 70 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup Tarek ElDeeb Intro for Computer Architecture

Slide 71

Slide 71 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup MAX Tarek ElDeeb Intro for Computer Architecture

Slide 72

Slide 72 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup τSkew Tarek ElDeeb Intro for Computer Architecture

Slide 73

Slide 73 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup τSkew If τComb−min < τSkew ? Tarek ElDeeb Intro for Computer Architecture

Slide 74

Slide 74 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Slide 80

Slide 80 text

Slide 81

Slide 81 text

Slide 82

Slide 82 text

Slide 83

Slide 83 text

Slide 84

Slide 84 text

Slide 85

Slide 85 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards RAW, WAW and WAR. RAR? Forward Cycle # 1 2 3 4 5 Add R5 ← R2, R1 IF D EX PA Add R4 ← R5, R3 IF D EX PA Stalls Cycle # 1 2 3 4 5 LWR5 ← () IF D EX PA AddR4 ← R5, R3 IF D – EX Ins # 3 IF – D Ins # 4 – IF Tarek ElDeeb Intro for Computer Architecture

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Slide 92

Slide 92 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture

Slide 93

Slide 93 text

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture

Slide 97

Slide 97 text

Slide 98

Slide 98 text

Slide 99

Slide 99 text

Slide 100

Slide 100 text

Slide 101

Slide 101 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 102

Slide 102 text

Slide 103

Slide 103 text

Slide 104

Slide 104 text

Slide 105

Slide 105 text

Slide 106

Slide 106 text

Slide 107

Slide 107 text

Slide 108

Slide 108 text

Slide 109

Slide 109 text

Slide 110

Slide 110 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 111

Slide 111 text

Slide 112

Slide 112 text

Slide 113

Slide 113 text

Slide 114

Slide 114 text

Slide 115

Slide 115 text

Slide 116

Slide 116 text

Slide 117

Slide 117 text

Slide 118

Slide 118 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors ... Performance Vector versus multiple issue (superscalar) Vector Multiple Issue Pros good Sp on large scien- tific loads good Sp on small problems general purpose Cons limited to regular data complex scheduling Vector Registers over- head large D cache requires a high memory BW inefficient use of ALUs Tarek ElDeeb Intro for Computer Architecture

Slide 119

Slide 119 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 120

Slide 120 text

Slide 121

Slide 121 text

Slide 122

Slide 122 text

Slide 123

Slide 123 text

Slide 124

Slide 124 text

Slide 125

Slide 125 text

Slide 126

Slide 126 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture

Slide 127

Slide 127 text

Slide 128

Slide 128 text

Slide 129

Slide 129 text

Slide 130

Slide 130 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to connect multiple processors? (Core-0) Memory Switch (Core-1) Memory Switch (Core-2) Memory Switch (Core- . . . ) Memory Switch Tarek ElDeeb Intro for Computer Architecture

Slide 131

Slide 131 text

Slide 132

Slide 132 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to connect multiple processors? (Core-1) Memory Switch (Core-2) Memory Switch Switch Memory (Core-0) Switch Memory (Core- . . . ) Tarek ElDeeb Intro for Computer Architecture

Slide 133

Slide 133 text

Slide 134

Slide 134 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Scaling up with interconnects Tarek ElDeeb Intro for Computer Architecture

Slide 135

Slide 135 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture

Slide 136

Slide 136 text

Slide 137

Slide 137 text

Slide 138

Slide 138 text

Slide 139

Slide 139 text

Deﬁnitions Instruction Set Architectures Performance Pipelines Parallelism Summary uestions .. ? [email protected] Tarek ElDeeb Intro for Computer Architecture