Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Computer Architecture

Tarek Eldeeb
February 01, 2022

Intro to Computer Architecture

A beamer presentation that gives a good insight about computer architecture.
Sources are found in my blog.

Tarek Eldeeb

February 01, 2022
Tweet

More Decks by Tarek Eldeeb

Other Decks in Education

Transcript

  1. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    Computer Architecture? According to the 1913 Webster, architecture is: the art or science of building;. . . or construction, in a more general sense. Recent Dictionaries . . . 4: (computer science) the structure and organization of a computer’s hardware or system software; “the architecture of a computer’s system software” [syn: computer architecture] Tarek ElDeeb Intro for Computer Architecture
  2. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    Computer Architecture? According to the 1913 Webster, architecture is: the art or science of building;. . . or construction, in a more general sense. Recent Dictionaries . . . 4: (computer science) the structure and organization of a computer’s hardware or system software; “the architecture of a computer’s system software” [syn: computer architecture] Tarek ElDeeb Intro for Computer Architecture
  3. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    Computer Architecture? According to the 1913 Webster, architecture is: the art or science of building;. . . or construction, in a more general sense. Recent Dictionaries . . . 4: (computer science) the structure and organization of a computer’s hardware or system software; “the architecture of a computer’s system software” [syn: computer architecture] Tarek ElDeeb Intro for Computer Architecture
  4. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does

    computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture
  5. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does

    computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture
  6. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does

    computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture
  7. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does

    computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture
  8. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does

    computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture
  9. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Where does

    computer architecture fit? Application Software OS, Compilers, Network . . . COMPUTER ARCHITECTURE Digital Design Circuits, Devices . . . Our interest Tarek ElDeeb Intro for Computer Architecture
  10. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture
  11. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture
  12. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture
  13. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Structure Within the processor : Registers, Operational Units (Integer, Floating Point, special purpose, . . . ) Outside the processor: Memory, I/O, . . . Examples : Sun SPARC, MIPS, Intel x86 (IA32), IBM S/390. Defines : data (types, endianness storage, and addressing modes), instruction (operation code) set, and instruction formats. Tarek ElDeeb Intro for Computer Architecture
  14. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture
  15. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture
  16. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture
  17. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Computer architecture:

    Organization Within the processor : Pipeline(s), Control Unit, Instruction Cache, Data Cache, Branch Prediction, . . . Outside the processor: Secondary Caches, Memory Interleaving, Redundant Disk Arrays, Multi-Processors, . . . From a programmer point of view, should I know about the organization? Which implementation is better? How do you define ‘better’? Tarek ElDeeb Intro for Computer Architecture
  18. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  19. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  20. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  21. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  22. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  23. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  24. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  25. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  26. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  27. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  28. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  29. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Instruction Set

    Architecture Common instructions Arithmetic and Logic Data transfer Control Optional instructions system floating-point graphics Some Control instructions un/conditional branches, function calls, and returns Tarek ElDeeb Intro for Computer Architecture
  30. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Comparing ISAs

    Archi Bits Date Op Type Design Regs Encoding Endianness Alpha 64 1992 3 Reg-Reg RISC 32 Fixed Bi ARM 32 1983 3 Reg-Reg RISC 16 Thumb-2: Variable (16/32 bit) Bi MIPS 64 (32→64) 1981 3 Reg-Reg RISC 32 Fixed (32- bit) Bi PowerPC 32/64 (32→64) 1991 3 Reg-Reg RISC 32 Fixed, Vari- able Big/Bi SPARC 64 (32→64) 1985 3 Reg-Reg RISC 32 Fixed Big → Bi z/Archi 64 (32→64) 1964 ? Reg-Mem/ Mem-Mem CISC 16 Fixed Big VAX 32 1977 6 Mem-Mem CISC 16 Variable Little x86 32 (16→32) 1978 2 Reg-Mem CISC 8 Variable Little x86-64 64 2003 2 Reg-Mem CISC 16 Variable Little Tarek ElDeeb Intro for Computer Architecture
  31. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  32. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  33. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  34. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  35. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  36. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  37. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  38. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  39. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  40. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Different ISAs

    CISC vs RISC. CPI? Memory Access Direct: mem[1204]; Register indirect: mem[R4]; Displacement: mem[R1+constant]; Relative to PC: mem[PC+constant]; Instruction Format Fixed Length Variable Length Hybrid (common in em bedded systems) Tarek ElDeeb Intro for Computer Architecture
  41. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals

    Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture
  42. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals

    Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture
  43. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals

    Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture
  44. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals

    Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture
  45. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Design goals

    Functional Should be correct! What functions should it support? Reliable A spacecraft is different from a PC. Is it really? Performance It is not just the frequency but the speed of real tasks. You cannot please everyone all the time. Low cost design cost (how big are the teams? How long do they take? ), manufacturing cost, testing cost, . . . Energy efficiency this is the “running cost”. Energy is drawn from various sources. The cooling is a big issue. Tarek ElDeeb Intro for Computer Architecture
  46. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How do

    design goals change? Tarek ElDeeb Intro for Computer Architecture
  47. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  48. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  49. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  50. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  51. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  52. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  53. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  54. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Performance? ‘Latency’

    or ‘Thoughput’? How do we measure time? Real application. Portability? Kernel. Real Complexity? Selected Set of Application Benchmarks: SPEC, TPC, . . . CPU time: T1 = Dynamic instruction count × average CPI × Clock cycle time Speed-up: Sp = T1 Tp = T1 0.25T1 + 0.75T1/P . Try P = 3 and P = ∞ Tarek ElDeeb Intro for Computer Architecture
  55. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary The Upshot!

    Make the common case fast but remember that the uncommon case eventually sets the limit. You must have a balanced system where the resources are distributed according to where time is spent. Your system’s performance must be above the required average! The peak will be reduced by dependencies and memory stalls. Tarek ElDeeb Intro for Computer Architecture
  56. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary The Upshot!

    Make the common case fast but remember that the uncommon case eventually sets the limit. You must have a balanced system where the resources are distributed according to where time is spent. Your system’s performance must be above the required average! The peak will be reduced by dependencies and memory stalls. Tarek ElDeeb Intro for Computer Architecture
  57. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary The Upshot!

    Make the common case fast but remember that the uncommon case eventually sets the limit. You must have a balanced system where the resources are distributed according to where time is spent. Your system’s performance must be above the required average! The peak will be reduced by dependencies and memory stalls. Tarek ElDeeb Intro for Computer Architecture
  58. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How does

    the information move? By the rule of law. Each unit gets the inputs at a prescribed time and should deliver the output before a prescribed time. Synchronous, with clocks. By consensus. Tell me when you finish your part. Asynchronous, with handshaking. By its natural flow. Gates within a unit have a delay. Once the first level of gates ends its function, those gates start on new data while the second level of gates is processing the first data without extra signaling. Wavepipelines, must set a barrier somewhere! Tarek ElDeeb Intro for Computer Architecture
  59. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How does

    the information move? By the rule of law. Each unit gets the inputs at a prescribed time and should deliver the output before a prescribed time. Synchronous, with clocks. By consensus. Tell me when you finish your part. Asynchronous, with handshaking. By its natural flow. Gates within a unit have a delay. Once the first level of gates ends its function, those gates start on new data while the second level of gates is processing the first data without extra signaling. Wavepipelines, must set a barrier somewhere! Tarek ElDeeb Intro for Computer Architecture
  60. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How does

    the information move? By the rule of law. Each unit gets the inputs at a prescribed time and should deliver the output before a prescribed time. Synchronous, with clocks. By consensus. Tell me when you finish your part. Asynchronous, with handshaking. By its natural flow. Gates within a unit have a delay. Once the first level of gates ends its function, those gates start on new data while the second level of gates is processing the first data without extra signaling. Wavepipelines, must set a barrier somewhere! Tarek ElDeeb Intro for Computer Architecture
  61. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    pipelining? Reg Complex Combinational Logic . . . Reg Tarek ElDeeb Intro for Computer Architecture
  62. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Tarek ElDeeb Intro for Computer Architecture
  63. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Tarek ElDeeb Intro for Computer Architecture
  64. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Throughput Tarek ElDeeb Intro for Computer Architecture
  65. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Throughput Operational Frequency. Tarek ElDeeb Intro for Computer Architecture
  66. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary What is

    pipelining? Reg Complex Combinational Logic . . . Reg Reg Logic .. Reg Logic .. Reg Logic .. Reg Latency Throughput Operational Frequency. Optimal number of stages? Tarek ElDeeb Intro for Computer Architecture
  67. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock

    edges posedge posedge τout Tarek ElDeeb Intro for Computer Architecture
  68. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock

    edges posedge posedge τout τComb Tarek ElDeeb Intro for Computer Architecture
  69. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock

    edges posedge posedge τout τComb τsetup Tarek ElDeeb Intro for Computer Architecture
  70. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock

    edges posedge posedge τout τComb τsetup MAX Tarek ElDeeb Intro for Computer Architecture
  71. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock

    edges posedge posedge τout τComb τsetup τSkew Tarek ElDeeb Intro for Computer Architecture
  72. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary About clock

    edges posedge posedge τout τComb τsetup τSkew If τComb−min < τSkew ? Tarek ElDeeb Intro for Computer Architecture
  73. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  74. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  75. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  76. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  77. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  78. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  79. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  80. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  81. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  82. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  83. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Pipelining a

    CPU Pipelines belong to the organization of the processor As we have seen, we need to analyze the instruction frequencies of the anticipated workload. The main stages of a processor are to fetch the instructions, execute them, and then to save the results. These may be divided into 1 address generation for the instruction (IA), 2 instruction fetch (IF), 3 decode (D), 4 address generation (AG), 5 data fetch (DF), 6 execution (EX), and 7 put away (PA). Static, dynamic and multiple-issues pipelines Tarek ElDeeb Intro for Computer Architecture
  84. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the

    CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture
  85. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the

    CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture
  86. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the

    CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture
  87. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Effecting the

    CPI Ideal pipe operation .. Cycle # 1 2 3 4 5 Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX A branch instruction .. Ins # 1 IF D EX PA Ins # 2 IF D EX PA Ins # 2’ IF D Assuming the branch frequency is 15%, then the CPI = 1 + 0.15 × 2 = 1.3 Branch Prediction? 2-bit. Tarek ElDeeb Intro for Computer Architecture
  88. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards

    RAW, WAW and WAR. RAR? Forward Cycle # 1 2 3 4 5 Add R5 ← R2, R1 IF D EX PA Add R4 ← R5, R3 IF D EX PA Stalls Cycle # 1 2 3 4 5 LWR5 ← () IF D EX PA AddR4 ← R5, R3 IF D – EX Ins # 3 IF – D Ins # 4 – IF Tarek ElDeeb Intro for Computer Architecture
  89. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards

    RAW, WAW and WAR. RAR? Forward Cycle # 1 2 3 4 5 Add R5 ← R2, R1 IF D EX PA Add R4 ← R5, R3 IF D EX PA Stalls Cycle # 1 2 3 4 5 LWR5 ← () IF D EX PA AddR4 ← R5, R3 IF D – EX Ins # 3 IF – D Ins # 4 – IF Tarek ElDeeb Intro for Computer Architecture
  90. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards

    RAW, WAW and WAR. RAR? Forward Cycle # 1 2 3 4 5 Add R5 ← R2, R1 IF D EX PA Add R4 ← R5, R3 IF D EX PA Stalls Cycle # 1 2 3 4 5 LWR5 ← () IF D EX PA AddR4 ← R5, R3 IF D – EX Ins # 3 IF – D Ins # 4 – IF Tarek ElDeeb Intro for Computer Architecture
  91. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards

    .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture
  92. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards

    .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture
  93. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards

    .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture
  94. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Data Hazards

    .. cnt’d Multi-cycle execution. In-order completion? Cycle # 1 2 3 4 5 6 7 Ins # 1 IF D EX EX EX PA Ins # 2 IF D EX PA Ins # 3 IF D EX EX PA Register Renaming Lw R1 Div R5 <− R1,R2 Add R1 <− R3,R4 Mul R0 <− R1,R7 Rename R1 in Instructions # 3,4 to R6 Dynamic scheduling Tarek ElDeeb Intro for Computer Architecture
  95. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring

    around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture
  96. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring

    around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture
  97. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring

    around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture
  98. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring

    around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture
  99. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary ILP: Exploring

    around Scoreboard (control flow) and the Tomasulo (data flow) Super-scalar: N2 Dependencies and buses. Branch Prediction? Alternatives: Compiler loop unrolling and renaming VLIW (More than a super-scalar) Schedule (order) and Issue (start) Schedule Issue Static HW HW Dynamic (out-of-order) HW HW In-Order Superscalar SW HW Pure VLIW SW SW Tarek ElDeeb Intro for Computer Architecture
  100. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  101. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  102. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  103. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  104. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  105. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  106. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  107. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  108. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary More into

    VLIW Pros: Simple HW, and higher performance Cons: Complex organization disposed to compilers, Porting (Transmeta), and Variables Cache effect NOPs GPUs, Itanium ... Tarek ElDeeb Intro for Computer Architecture
  109. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  110. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  111. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  112. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  113. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  114. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  115. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  116. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    SIMD. MIMd = VLIW ? Performance The amount of the program expressed in a vectorizable form Vector startup costs. Length? Chaining Support Simultaneous Access to/from Memory # of Vector registers Typical Speedup: Ps ≤ 4 ( Chaining boosts to Ps ≤ 7) Tarek ElDeeb Intro for Computer Architecture
  117. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Vector Processors

    ... Performance Vector versus multiple issue (superscalar) Vector Multiple Issue Pros good Sp on large scien- tific loads good Sp on small prob- lems general purpose Cons limited to regular data complex scheduling Vector Registers over- head large D cache requires a high memory BW inefficient use of ALUs Tarek ElDeeb Intro for Computer Architecture
  118. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level

    Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture
  119. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level

    Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture
  120. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level

    Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture
  121. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level

    Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture
  122. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level

    Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture
  123. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level

    Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture
  124. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Thread Level

    Parallelism ILP have stalled since the late-1990s TLP? The Block Multi-threading Interleaved Multi-threading. GPUs? Multi-cycles? Simultaneous Multi-threading (with superscalars) Maximum typical threads Tarek ElDeeb Intro for Computer Architecture
  125. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do

    we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture
  126. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do

    we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture
  127. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do

    we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture
  128. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Why do

    we have multiple processors? The large problems exceed the capacity of the largest processors and using a few (or many!) in parallel could help. The chip area available is better used to support multiple cores than to just increase the cache size and levels! Some environments are inherently “parallel”. Search engines? Partitioning, scheduling and synchronization (cache coherency) Tarek ElDeeb Intro for Computer Architecture
  129. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to

    connect multiple processors? (Core-0) Memory Switch (Core-1) Memory Switch (Core-2) Memory Switch (Core- . . . ) Memory Switch Tarek ElDeeb Intro for Computer Architecture
  130. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to

    connect multiple processors? (Core-0) Memory Switch (Core-1) Memory Switch (Core-2) Memory Switch (Core- . . . ) Memory Switch A Centralized Switching Unit Interconnect Tarek ElDeeb Intro for Computer Architecture
  131. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to

    connect multiple processors? (Core-1) Memory Switch (Core-2) Memory Switch Switch Memory (Core-0) Switch Memory (Core- . . . ) Tarek ElDeeb Intro for Computer Architecture
  132. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary How to

    connect multiple processors? (Core-1) Memory Switch (Core-2) Memory Switch Switch Memory (Core-0) Switch Memory (Core- . . . ) Tarek ElDeeb Intro for Computer Architecture
  133. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Scaling up

    with interconnects Tarek ElDeeb Intro for Computer Architecture
  134. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum

    Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture
  135. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum

    Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture
  136. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum

    Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture
  137. Definitions Instruction Set Architectures Performance Pipelines Parallelism Summary Let’s Sum

    Up Set your design goal; functionality, performance, power and price Structure and Organization Make the common case fast, and distribute the resources accordingly Instructions and threads level dependencies and parallelism Tarek ElDeeb Intro for Computer Architecture