Slide 1

Slide 1 text

Intro for Computer Architecture Tarek ElDeeb SilMinds, LLC May 27, 2012

Slide 2

Slide 2 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is Computer Architecture? According to the 1913 Webster, architecture is: the art or science of building;. . . or construction, in a more general sense. Recent Dictionaries . . . 4: (computer science) the structure and organization of a computer’s hardware or system software; “the architecture of a computer’s system software” [syn: computer architecture] Tarek ElDeeb Intro for Computer Architecture

Slide 3

Slide 3 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is Computer Architecture? According to the 1913 Webster, architecture is: the art or science of building;. . . or construction, in a more general sense. Recent Dictionaries . . . 4: (computer science) the structure and organization of a computer’s hardware or system software; “the architecture of a computer’s system software” [syn: computer architecture] Tarek ElDeeb Intro for Computer Architecture

Slide 4

Slide 4 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is Computer Architecture? According to the 1913 Webster, architecture is: the art or science of building;. . . or construction, in a more general sense. Recent Dictionaries . . . 4: (computer science) the structure and organization of a computer’s hardware or system software; “the architecture of a computer’s system software” [syn: computer architecture] Tarek ElDeeb Intro for Computer Architecture

Slide 5

Slide 5 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture

Slide 6

Slide 6 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture

Slide 7

Slide 7 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture

Slide 8

Slide 8 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture

Slide 9

Slide 9 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture

Slide 10

Slide 10 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture

Slide 11

Slide 11 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture

Slide 12

Slide 12 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture

Slide 13

Slide 13 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture

Slide 14

Slide 14 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture

Slide 15

Slide 15 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture

Slide 16

Slide 16 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture

Slide 17

Slide 17 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture

Slide 18

Slide 18 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture: Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture

Slide 19

Slide 19 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 20

Slide 20 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 21

Slide 21 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 22

Slide 22 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 23

Slide 23 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 24

Slide 24 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 25

Slide 25 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 26

Slide 26 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 27

Slide 27 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 28

Slide 28 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 29

Slide 29 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 30

Slide 30 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture

Slide 31

Slide 31 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Comparing ISAs Archi Bits Date Op Type Design Regs Encoding Endianness Alpha 64 1992 3 Reg-Reg RISC 32 Fixed Bi ARM 32 1983 3 Reg-Reg RISC 16 Thumb-2: Variable (16/32 bit) Bi MIPS 64 (32→64) 1981 3 Reg-Reg RISC 32 Fixed (32- bit) Bi PowerPC 32/64 (32→64) 1991 3 Reg-Reg RISC 32 Fixed, Vari- able Big/Bi SPARC 64 (32→64) 1985 3 Reg-Reg RISC 32 Fixed Big → Bi z/Archi 64 (32→64) 1964 ? Reg-Mem/ Mem-Mem CISC 16 Fixed Big VAX 32 1977 6 Mem-Mem CISC 16 Variable Little x86 32 (16→32) 1978 2 Reg-Mem CISC 8 Variable Little x86-64 64 2003 2 Reg-Mem CISC 16 Variable Little Tarek ElDeeb Intro for Computer Architecture

Slide 32

Slide 32 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 33

Slide 33 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 34

Slide 34 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 35

Slide 35 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 36

Slide 36 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 37

Slide 37 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 38

Slide 38 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 39

Slide 39 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 40

Slide 40 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 41

Slide 41 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture

Slide 42

Slide 42 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture

Slide 43

Slide 43 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture

Slide 44

Slide 44 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture

Slide 45

Slide 45 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture

Slide 46

Slide 46 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture

Slide 47

Slide 47 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How do design goals change? Tarek ElDeeb Intro for Computer Architecture

Slide 48

Slide 48 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 49

Slide 49 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 50

Slide 50 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 51

Slide 51 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 52

Slide 52 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 53

Slide 53 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 54

Slide 54 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 55

Slide 55 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’ or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture

Slide 56

Slide 56 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary The Upshot! Make the common case fast but remember that the uncommon case eventually sets the limit. You must have a balanced system where the resources are distributed according to where time is spent. Your system’s performance must be above the required average! The peak will be reduced by dependencies and memory stalls. Tarek ElDeeb Intro for Computer Architecture

Slide 57

Slide 57 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary The Upshot! Make the common case fast but remember that the uncommon case eventually sets the limit. You must have a balanced system where the resources are distributed according to where time is spent. Your system’s performance must be above the required average! The peak will be reduced by dependencies and memory stalls. Tarek ElDeeb Intro for Computer Architecture

Slide 58

Slide 58 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary The Upshot! Make the common case fast but remember that the uncommon case eventually sets the limit. You must have a balanced system where the resources are distributed according to where time is spent. Your system’s performance must be above the required average! The peak will be reduced by dependencies and memory stalls. Tarek ElDeeb Intro for Computer Architecture

Slide 59

Slide 59 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How does the information move? By the rule of law. Each unit gets the inputs at a prescribed time and should deliver the output before a prescribed time. Synchronous, with clocks. By consensus. Tell me when you finish your part. Asynchronous, with handshaking. By its natural flow. Gates within a unit have a delay. Once the first level of gates ends its function, those gates start on new data while the second level of gates is processing the first data without extra signaling. Wavepipelines, must set a barrier somewhere! Tarek ElDeeb Intro for Computer Architecture

Slide 60

Slide 60 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How does the information move? By the rule of law. Each unit gets the inputs at a prescribed time and should deliver the output before a prescribed time. Synchronous, with clocks. By consensus. Tell me when you finish your part. Asynchronous, with handshaking. By its natural flow. Gates within a unit have a delay. Once the first level of gates ends its function, those gates start on new data while the second level of gates is processing the first data without extra signaling. Wavepipelines, must set a barrier somewhere! Tarek ElDeeb Intro for Computer Architecture

Slide 61

Slide 61 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How does the information move? By the rule of law. Each unit gets the inputs at a prescribed time and should deliver the output before a prescribed time. Synchronous, with clocks. By consensus. Tell me when you finish your part. Asynchronous, with handshaking. By its natural flow. Gates within a unit have a delay. Once the first level of gates ends its function, those gates start on new data while the second level of gates is processing the first data without extra signaling. Wavepipelines, must set a barrier somewhere! Tarek ElDeeb Intro for Computer Architecture

Slide 62

Slide 62 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Tarek ElDeeb Intro for Computer Architecture

Slide 63

Slide 63 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Tarek ElDeeb Intro for Computer Architecture

Slide 64

Slide 64 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Tarek ElDeeb Intro for Computer Architecture

Slide 65

Slide 65 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Throughput Tarek ElDeeb Intro for Computer Architecture

Slide 66

Slide 66 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Throughput Operational Frequency. Tarek ElDeeb Intro for Computer Architecture

Slide 67

Slide 67 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Throughput Operational Frequency. Optimal number of stages? Tarek ElDeeb Intro for Computer Architecture

Slide 68

Slide 68 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout Tarek ElDeeb Intro for Computer Architecture

Slide 69

Slide 69 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb Tarek ElDeeb Intro for Computer Architecture

Slide 70

Slide 70 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup Tarek ElDeeb Intro for Computer Architecture

Slide 71

Slide 71 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup MAX Tarek ElDeeb Intro for Computer Architecture

Slide 72

Slide 72 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup τSkew Tarek ElDeeb Intro for Computer Architecture

Slide 73

Slide 73 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock edges posedge posedge τout τComb τsetup τSkew If τComb−min < τSkew ? Tarek ElDeeb Intro for Computer Architecture

Slide 74

Slide 74 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 75

Slide 75 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 76

Slide 76 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 77

Slide 77 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 78

Slide 78 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 79

Slide 79 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 80

Slide 80 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 81

Slide 81 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 82

Slide 82 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 83

Slide 83 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 84

Slide 84 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture

Slide 85

Slide 85 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture

Slide 86

Slide 86 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture

Slide 87

Slide 87 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture

Slide 88

Slide 88 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture

Slide 89

Slide 89 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards RAW, WAW and WAR. RAR? Forward Cycle # 1 2 3 4 5 Add R5 ← R2, R1 IF D EX PA Add R4 ← R5, R3 IF D EX PA Stalls Cycle # 1 2 3 4 5 LWR5 ← () IF D EX PA AddR4 ← R5, R3 IF D – EX Ins # 3 IF – D Ins # 4 – IF Tarek ElDeeb Intro for Computer Architecture

Slide 90

Slide 90 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards RAW, WAW and WAR. RAR? Forward Cycle # 1 2 3 4 5 Add R5 ← R2, R1 IF D EX PA Add R4 ← R5, R3 IF D EX PA Stalls Cycle # 1 2 3 4 5 LWR5 ← () IF D EX PA AddR4 ← R5, R3 IF D – EX Ins # 3 IF – D Ins # 4 – IF Tarek ElDeeb Intro for Computer Architecture

Slide 91

Slide 91 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards RAW, WAW and WAR. RAR? Forward Cycle # 1 2 3 4 5 Add R5 ← R2, R1 IF D EX PA Add R4 ← R5, R3 IF D EX PA Stalls Cycle # 1 2 3 4 5 LWR5 ← () IF D EX PA AddR4 ← R5, R3 IF D – EX Ins # 3 IF – D Ins # 4 – IF Tarek ElDeeb Intro for Computer Architecture

Slide 92

Slide 92 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture

Slide 93

Slide 93 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture

Slide 94

Slide 94 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture

Slide 95

Slide 95 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture

Slide 96

Slide 96 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture

Slide 97

Slide 97 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture

Slide 98

Slide 98 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture

Slide 99

Slide 99 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture

Slide 100

Slide 100 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture

Slide 101

Slide 101 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 102

Slide 102 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 103

Slide 103 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 104

Slide 104 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 105

Slide 105 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 106

Slide 106 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 107

Slide 107 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 108

Slide 108 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 109

Slide 109 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture

Slide 110

Slide 110 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 111

Slide 111 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 112

Slide 112 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 113

Slide 113 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 114

Slide 114 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 115

Slide 115 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 116

Slide 116 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 117

Slide 117 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture

Slide 118

Slide 118 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors ... Performance Vector versus multiple issue (superscalar) Vector Multiple Issue Pros good Sp on large scien- tific loads good Sp on small prob- lems general purpose Cons limited to regular data complex scheduling Vector Registers over- head large D cache requires a high memory BW inefficient use of ALUs Tarek ElDeeb Intro for Computer Architecture

Slide 119

Slide 119 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 120

Slide 120 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 121

Slide 121 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 122

Slide 122 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 123

Slide 123 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 124

Slide 124 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 125

Slide 125 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture

Slide 126

Slide 126 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture

Slide 127

Slide 127 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture

Slide 128

Slide 128 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture

Slide 129

Slide 129 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture

Slide 130

Slide 130 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to connect multiple processors? (Core-0) Memory Switch (Core-1) Memory Switch (Core-2) Memory Switch (Core- . . . ) Memory Switch Tarek ElDeeb Intro for Computer Architecture

Slide 131

Slide 131 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to connect multiple processors? (Core-0) Memory Switch (Core-1) Memory Switch (Core-2) Memory Switch (Core- . . . ) Memory Switch A Centralized Switching Unit Interconnect Tarek ElDeeb Intro for Computer Architecture

Slide 132

Slide 132 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to connect multiple processors? (Core-1) Memory Switch (Core-2) Memory Switch Switch Memory (Core-0) Switch Memory (Core- . . . ) Tarek ElDeeb Intro for Computer Architecture

Slide 133

Slide 133 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to connect multiple processors? (Core-1) Memory Switch (Core-2) Memory Switch Switch Memory (Core-0) Switch Memory (Core- . . . ) Tarek ElDeeb Intro for Computer Architecture

Slide 134

Slide 134 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Scaling up with interconnects Tarek ElDeeb Intro for Computer Architecture

Slide 135

Slide 135 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture

Slide 136

Slide 136 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture

Slide 137

Slide 137 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture

Slide 138

Slide 138 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture

Slide 139

Slide 139 text

Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary uestions .. ? [email protected] Tarek ElDeeb Intro for Computer Architecture