Slide 1

Slide 1 text

Hardware and its Concurrency Habits © 2023 Meta Platforms Paul E. McKenney, Meta Platforms Kernel Team Kernel Recipes, September 27, 2023 https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 3

Slide 2

Slide 2 text

2 Recette Pour le Codage Simultané ● Une pincée de connaissance des lois de la physique ● Compréhension modeste du matériel informatique ● Compréhension approfondie des exigences ● Conception soignée, y compris la synchronisation ● Validation brutale https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

Slide 3

Slide 3 text

3 Recette Pour le Codage Simultané ● Une pincée de connaissance des lois de la physique ● Compréhension modeste du matériel informatique ● Compréhension approfondie des exigences ● Conception soignée, y compris la synchronisation ● Validation brutale https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

Slide 4

Slide 4 text

5 “Let Them Run Free!!!”

Slide 5

Slide 5 text

6 “Let Them Run Free!!!” CPU Benchmark Trackmeet CPU Benchmark Trackmeet

Slide 6

Slide 6 text

7 “Let Them Run Free!!!” CPU Benchmark Trackmeet CPU Benchmark Trackmeet Sadly, it is now more of an obstacle course than a track...

Slide 7

Slide 7 text

8 Don’t Make ‘em Like They Used To!

Slide 8

Slide 8 text

9 Don’t Make ‘em Like They Used To 4.0 GHz clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat.

Slide 9

Slide 9 text

10 4.0 GHz clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat. Don’t Make ‘em Like They Used To! ● No cache No cache ● Shallow pipeline Shallow pipeline ● In-order execution In-order execution ● One instruction at a time One instruction at a time ● Predictable (slow) Predictable (slow) execution execution ● Large cache Large cache ● Deep pipeline Deep pipeline ● Out of order Out of order ● Super scalar Super scalar ● Unpredictable (fast) Unpredictable (fast) execution execution

Slide 10

Slide 10 text

11 4.0 GHz clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat. Don’t Make ‘em Like They Used To! ● No cache No cache ● Shallow pipeline Shallow pipeline ● In-order execution In-order execution ● One instruction at a time One instruction at a time ● Predictable (slow) Predictable (slow) execution execution ● Large cache Large cache ● Deep pipeline Deep pipeline ● Out of order Out of order ● Super scalar Super scalar ● Unpredictable (fast) Unpredictable (fast) execution execution “Tiny Bulldozer” “Semi Tractor-Trailer”

Slide 11

Slide 11 text

12 4.0 GHz clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat. Don’t Make ‘em Like They Used To! ● No cache No cache ● Shallow pipeline Shallow pipeline ● In-order execution In-order execution ● One instruction at a time One instruction at a time ● Predictable (slow) Predictable (slow) execution execution ● Large cache Large cache ● Deep pipeline Deep pipeline ● Out of order Out of order ● Super scalar Super scalar ● Unpredictable (fast) Unpredictable (fast) execution execution What would be the computing-systems equivalents of a freight train?

Slide 12

Slide 12 text

13 “Good Olde Days” CPU Architecture 80386 Architecture (Wikipedia user “Appaloosa” GDFL, simplified and reformatted) 32 bit Protection Test Unit ALU ● Barrel Shifter ● Multiply/Divide ● Register File Segmentation Bus Control 32 Bit Paging Prefetcher / Limit Checker Code Queue (16 Bytes) Instruction Decoder 3-Decoded Instruction Queue Decode and Sequencing Control ROM Dedicated ALU Bus 34 bit 32 bit Effective Address Status Flags ALU Control

Slide 13

Slide 13 text

14 The 80386 Taught Me Concurrency That and a logic analyzer...

Slide 14

Slide 14 text

15 But Instructions Took Several Cycles!

Slide 15

Slide 15 text

16 Pipelined Execution For The Win!!! (Wikipedia user “Amit6” CC BY-SA 3.0, reformatted)

Slide 16

Slide 16 text

17 Superscalar Execution For The Win!!! Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit

Slide 17

Slide 17 text

18 Why All This Hardware Complexity?

Slide 18

Slide 18 text

19 Laws of Physics: Atoms Are Too Big!!! Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016.

Slide 19

Slide 19 text

20 Laws of Physics: Atoms Are Too Big!!! Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016. Speed controlled by base thickness: At least one atom thick!!!

Slide 20

Slide 20 text

21 Laws of Physics: Light Is Too Slow!!! “One nanosecond per foot” courtesy of Grace Hopper (https://www.youtube.com/watch?v=9eyFDBPk4Yw) https://en.wikipedia.org/wiki/List_of_refractive_indices A 50% sugar solution is “light syrup”. ● Following the footsteps of Admiral Hopper: – Light goes 11.803 inches/ns in a vacuum ● Or, if you prefer, 1.0097 lengths of A4 paper per nanosecond ● Light goes 1 width of A4 paper per nanosecond in 50% sugar solution – But over and back: 5.9015 in/ns – But not 1GHz! Instead, ~2GHz: ~3in/ns – But Cu: ~1 in/ns, or Si transistors: ~0.1 in/ns – Plus other slowdowns: prototols, electronics, ...

Slide 21

Slide 21 text

22 Laws of Physics: Data Is Slower!!! CPUs Caches Interconnects Memories DRAM & NVM

Slide 22

Slide 22 text

23 Laws of Physics: Data Is Slower!!! CPUs Caches Interconnects Memories DRAM & NVM Light is way too slow in Cu and Si and atoms are way too big!!!

Slide 23

Slide 23 text

24 Laws of Physics: Data Is Slower!!! CPUs Caches Interconnects Memories DRAM & NVM Protocol overheads (Mathematics!) Multiplexing & Demultiplexing (Electronics) Clock-domain transitions (Electronics) Phase changes (Chemistry) Light is way too slow in Cu and Si and atoms are way too big!!!

Slide 24

Slide 24 text

25 Laws of Physics: Summary ● The speed of light is finite (especially in Cu and Si) and atoms are of non-zero size ● Mathematics, electronics, and chemistry also take their toll ● Systems are fast, so this matters

Slide 25

Slide 25 text

26 Laws of Physics: Summary ● The speed of light is finite (especially in Cu and Si) and atoms are of non-zero size ● Mathematics, electronics, and chemistry also take their toll ● Systems are fast, so this matters “Gentlemen, you have two fundamental problems: (1) the finite speed of light and (2) the atomic nature of matter.” * * Gordon Moore quoting Stephen Hawking

Slide 26

Slide 26 text

27 Why All This Hardware Complexity? CPUs Caches Interconnects Memories DRAM & NVM Protocol overheads (Mathematics!) Multiplexing & Demultiplexing (Electronics) Clock-domain transitions (Electronics) Phase changes (Chemistry) Slow light and big atoms create modern computing obstacle course!!! Light is way too slow in Cu and Si and atoms are way too big!!!

Slide 27

Slide 27 text

28 Account For All CPU Complexity??? ● Sometimes, yes! (Assembly language!) ● But we also need portability: CPUs change – From family to family – With each revision of silicon – To work around hardware bugs – As a given physical CPU ages

Slide 28

Slide 28 text

29 One of the ALUs Might Be Disabled Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit X X

Slide 29

Slide 29 text

30 Thus, Simple Portable CPU Model 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit CPU CPU Store Store Buffer Buffer Cache Cache Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)

Slide 30

Slide 30 text

31 And Lots Of CPUs Per System!!! Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Interconnect Interconnect CPUs 0-27 & CPUs 224-251 CPUs 28-55 & CPUs 252-279 CPUs 56-83 & CPUs 280-307 CPUs 84-111 & CPUs 308-335 CPUs 112-139 & CPUs 336-363 CPUs 140-167 & CPUs 364-391 CPUs 168-195 & CPUs 392-419 CPUs 196-223 & CPUs 420-447

Slide 31

Slide 31 text

32 Obstacles for Modern Computers

Slide 32

Slide 32 text

33 Obstacle: Pipeline Flush PIPELINE ERROR PIPELINE ERROR BRANCH MISPREDICTION Running at full speed requires perfect branch prediction

Slide 33

Slide 33 text

34 Obstacle: Memory Reference A single fetch all the way from memory can cost hundreds of clock cycles

Slide 34

Slide 34 text

35 Obstacle: Atomic Operation Atomic operations require locking cachelines and/or busses, incurring significant delays

Slide 35

Slide 35 text

36 Obstacle: Memory Barrier Memory barriers result in stalls and/or ordering constraints, again incurring delays Memory Barrier Memory Barrier

Slide 36

Slide 36 text

37 Obstacle: Thermal Throttling Efficient use of CPU hardware generates heat, throttling the CPU clock frequency

Slide 37

Slide 37 text

38 Obstacle: Cache Miss Cache misses result in waiting for data to arrive (from memory or other CPUs) CACHE- MISS TOLL BOOTH CACHE- MISS TOLL BOOTH

Slide 38

Slide 38 text

39 Obstacle: Input/Output Operation And here you thought that cache misses were slow...

Slide 39

Slide 39 text

40 Which Obstables To Focus On? 1) I/O operations (but often higher-level issue) 2) Communications cache misses 3) Memory barriers and atomic operations 4) Capacity/geometry cache misses (memory) 5) Branch prediction

Slide 40

Slide 40 text

41 Which Obstables To Focus On? 1) I/O operations (but often higher-level issue) 2) Communications cache misses 3) Memory barriers and atomic operations 4) Capacity/geometry cache misses (memory) 5) Branch prediction These obstacles can (usually) be overcome in a portable manner.

Slide 41

Slide 41 text

42 Xeon Platinum 8176 2.1GHz: CPU 0

Slide 42

Slide 42 text

43 Location Really Matters!!! Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Interconnect Interconnect CPUs 0-27 & CPUs 224-251 CPUs 28-55 & CPUs 252-279 CPUs 56-83 & CPUs 280-307 CPUs 84-111 & CPUs 308-335 CPUs 112-139 & CPUs 336-363 CPUs 140-167 & CPUs 364-391 CPUs 168-195 & CPUs 392-419 CPUs 196-223 & CPUs 420-447

Slide 43

Slide 43 text

44 Latency Demonstration

Slide 44

Slide 44 text

49 Can Hardware Help???

Slide 45

Slide 45 text

50 Can Hardware Help??? Source Drain Sub-Atomic Base Vacuum-gap transistor: At these scales, the atmosphere is a vacuum!!!

Slide 46

Slide 46 text

51 We Really Can Hand-Place Atoms... Actually a carbon monoxide molecule that I moved across a few planes of copper

Slide 47

Slide 47 text

52 We Really Can Hand-Place Atoms... But not trillions of them in a cost-effective manner!!! Does Not Scale

Slide 48

Slide 48 text

53 Incremental Help From Hardware

Slide 49

Slide 49 text

54 Hardware 3D Integration 3 cm 1.5 cm Half the distance, twice the speed!!! Both stacked chiplets and lithographically stacked transistors.

Slide 50

Slide 50 text

55 Stacked Chiplets/Dies Diagram by Shmuel Csaba Otto Traian, CCSA4.0

Slide 51

Slide 51 text

56 Lithographically Stacked Transistors https://ieeexplore.ieee.org/document/9976473 https://spectrum.ieee.org/intels-stacked-nanosheet-transistors-could-be-the-next-step-in-moores-law

Slide 52

Slide 52 text

57 Hardware 3D Integration * Give or take issues with power, cooling, alignment, interconnect drivers, and so on. 3 cm 1.5 cm Half the distance, twice the speed *

Slide 53

Slide 53 text

58 Hardware Integration is Helping Q3 2017: 56 CPUs with ~100ns latencies

Slide 54

Slide 54 text

59 Hardware Integration is Helping November 2008: 16 CPUs with ~100ns latencies: More than 3x in nine years!!!

Slide 55

Slide 55 text

60 Hardware Accelerators

Slide 56

Slide 56 text

61 Hardware Accelerators, Theory Data Accelerator Unidirectional data flow, no out and back, twice the speed!!!

Slide 57

Slide 57 text

62 Hardware Accelerators, Practice Data Accelerator Sadly, back to request-response, but better latency with local memory? Accelerator-local memory System main memory

Slide 58

Slide 58 text

63 So Why Hardware Accelerators??? ● Optimized data transfers (e.g., larger blocks) ● Optimized hard-wired computation ● Better performance per watt ● Better performance per unit capital cost

Slide 59

Slide 59 text

64 Hardware Has Been Helping All Along

Slide 60

Slide 60 text

65 What Hardware Is Up Against CPUs Caches Interconnects Memories DRAM & NVM Protocol overheads (Mathematics!) Multiplexing & Demultiplexing (Electronics) Clock-domain transitions (Electronics) Phase changes (Chemistry) Light is way too slow in Cu and Si and atoms are way too big!!!

Slide 61

Slide 61 text

66 Therefore, Memory Hierarchies!!!

Slide 62

Slide 62 text

67 Simple Portable CPU Model Redux Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit CPU CPU Store Store Buffer Buffer Cache Cache (Not yet)

Slide 63

Slide 63 text

68 Read-Side Hardware Help (1/7) CPU 0 Cache CPU 3 Cache x=42,y=63 CPU 1 CPU 2 r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x) Request cacheline x

Slide 64

Slide 64 text

69 Read-Side Hardware Help (2/8) CPU 0 Cache CPU 3 Cache x=42,y=63 CPU 1 CPU 2 Request cacheline x r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x)

Slide 65

Slide 65 text

70 Read-Side Hardware Help (3/8) CPU 0 Cache CPU 3 Cache x=42,y=63 CPU 1 CPU 2 Request cacheline x r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x)

Slide 66

Slide 66 text

71 Read-Side Hardware Help (4/8) CPU 0 Cache CPU 3 Cache CPU 1 CPU 2 Cacheline x = 42, y = 63 r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x)

Slide 67

Slide 67 text

72 Read-Side Hardware Help (5/8) CPU 0 Cache x=42,y=63 CPU 3 Cache CPU 1 CPU 2 Cacheline x = 42, y = 63 r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x)

Slide 68

Slide 68 text

73 Read-Side Hardware Help (6/8) CPU 0 Cache x=42,y=63 CPU 3 Cache CPU 1 CPU 2 r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x)

Slide 69

Slide 69 text

74 Read-Side Hardware Help (7/8) CPU 0 Cache x=42,y=63 CPU 3 Cache CPU 1 CPU 2 Caches help beat laws of physics given spatial locality!!! r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x)

Slide 70

Slide 70 text

75 Read-Side Hardware Help (8/8) CPU 0 Cache x=42,y=63 CPU 3 Cache CPU 1 CPU 2 Caches help beat laws of physics given temporal locality!!! r1 = READ_ONCE(x) r2 = READ_ONCE(y) r3 = READ_ONCE(x)

Slide 71

Slide 71 text

76 Levels of Cache on My Laptop Index Line Size Associativity Size 0 64 8 32K 1 64 8 32K 2 64 4 256K 3 64 16 16,384K When taking on the laws of physics, don’t be afraid to use a few transistors

Slide 72

Slide 72 text

77 Levels of Cache on Large Old Server Index Line Size Associativity Size 0 64 8 32K 1 64 6 32K 2 64 16 1,024K 3 64 11 39,424K When taking on the laws of physics, don’t be afraid to use lots of transistors

Slide 73

Slide 73 text

79 But What About Writes?

Slide 74

Slide 74 text

80 Write-Side Hardware Help (Summary) ● Store buffers for the win!!! Sort of… – Cache line for variable x is initially at CPU 3 – CPU 0 writes 1 to x, but doesn't have cacheline ● So holds the write in CPU 0's store buffer ● And requests exclusive access to the cacheline (which takes time) – CPU 3 reads x, obtaining “0” immediately from cacheline – CPU 0 receive's x's cacheline ● And CPU 0's write finally gets to the cacheline ● Overwriting the value that CPU 3 read, despite the write starting earlier ● Writes only appear to be instantaneous!!!

Slide 75

Slide 75 text

81 Simple Portable CPU Model Redux Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit CPU CPU Store Store Buffer Buffer Cache Cache Now!!!

Slide 76

Slide 76 text

82 Write-Side Hardware Help (1/7) CPU 0 Store Buffer Cache CPU 3 Store Buffer Cache x=0 CPU 1 CPU 2 WRITE_ONCE(x, 1) READ_ONCE(x)

Slide 77

Slide 77 text

83 Write-Side Hardware Help (2/7) CPU 0 Store Buffer x=1 Cache CPU 3 Store Buffer Cache x=0 CPU 1 CPU 2 Request cacheline x WRITE_ONCE(x, 1) READ_ONCE(x) The store buffer allows writes to completes quickly!!! Take that, laws of physics!!!

Slide 78

Slide 78 text

84 Write-Side Hardware Help (3/7) CPU 0 Store Buffer x=1 Cache CPU 3 Store Buffer Cache x=0 CPU 1 CPU 2 Request cacheline x WRITE_ONCE(x, 1) READ_ONCE(x) Except that later read gets older value...

Slide 79

Slide 79 text

85 Write-Side Hardware Help (4/7) CPU 0 Store Buffer x=1 Cache CPU 3 Store Buffer Cache x=0 CPU 1 CPU 2 Request cacheline x WRITE_ONCE(x, 1) READ_ONCE(x)

Slide 80

Slide 80 text

86 Write-Side Hardware Help (5/7) CPU 0 Store Buffer x=1 Cache CPU 3 Store Buffer Cache CPU 1 CPU 2 WRITE_ONCE(x, 1) READ_ONCE(x) Respond with cacheline x = 0

Slide 81

Slide 81 text

87 Write-Side Hardware Help (6/7) CPU 0 Store Buffer x=1 Cache x=0 CPU 3 Store Buffer Cache CPU 1 CPU 2 WRITE_ONCE(x, 1) READ_ONCE(x) Respond with cacheline x = 0

Slide 82

Slide 82 text

88 Write-Side Hardware Help (7/7) CPU 0 Store Buffer Cache x=1 CPU 3 Store Buffer Cache CPU 1 CPU 2 WRITE_ONCE(x, 1) READ_ONCE(x) Quick write completion, sort of. Laws of physics: Slow or misordered!!!

Slide 83

Slide 83 text

89 Misordering? Or Propagation Delay? CPU 0 CPU 1 CPU 2 CPU 3 WRITE_ONCE(x, 1); r1 = READ_ONCE(x) == 0; X == 0 X == 1 fr Time

Slide 84

Slide 84 text

90 And Careful What You Wish For!!! CPU 0 CPU 1 CPU 2 CPU 3 WRITE_ONCE(x, 1); r1 = READ_ONCE(x) == 0; X == 0 X == 1 fr Time Hardware tricks help reduce the red triangle. But too bad about Meltdown and Spectre...

Slide 85

Slide 85 text

91 Can Software Help?

Slide 86

Slide 86 text

92 Can Software Help? ● Increasingly, yes!!! ● Use concurrent libraries, applications, subsystems, and so on – Let a few do the careful coding and tuning – Let a great many benefit from the work of a few ● Use proper APIs to deal with memory ordering – Chapter 14: “Advanced Synchronization: Memory Ordering” https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

Slide 87

Slide 87 text

93 Summary

Slide 88

Slide 88 text

94 Summary ● Modern hardware is highly optimized – Most of the time! – Incremental improvements due to integration – But the speed of light is too slow and atoms too big ● Use concurrent software where available ● Structure your code to avoid the big obstacles – Micro-optimizations only sometimes needed

Slide 89

Slide 89 text

95 For More Information ● “Computer Architecture: A Quantitative Approach”, Hennessey & Patterson (Recent HW) ● “Parallel Computer Architecture: A Hardware/Software Approach”, Culler et al. – Includes SGI Origin and Sequent NUMA-Q ● “Programing Massively Parallel Processors: A Hands-on Approach”, Kirk & Hwu – Primarily NVIDIA GPGPUs ● https://developer.nvidia.com/educators/existing-courses – List of NVIDIA university courseware ● https://gpuopen.com/professional-compute – List of AMD GPGPU-related content ● “Is Parallel Programming Hard, And, If So, What Can You Do About It?” – https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

Slide 90

Slide 90 text

96 L'antidote de Codage Simultané de la Femme de Paul ● Cordial de mûres sauvages de l'Himalaya – Mettre deux litres de mûres dans un pot de quatre litres – Ajouter cinq huitièmes litres de sucre – Remplissez le pot de vodka – Secouez tous les jours pendant cinq jours – Secouez chaque semaine pendant cinq semaines – Passer au tamis: Ajoutez des baies à la glace, consommez le liquide filtré comme vous voulez

Slide 91

Slide 91 text

97 Paul’s Wife’s Concurrency Antidote ● Wild Himalayan Blackberry Cordial – Put 8 cups wild himalayan blackberries in 1 gallon jar – Add 2½ cups sugar – Fill jar with vodka – Shake every day for five days – Shake every week for five weeks – Pour through sieve: Add berries to ice cream, consume filtered liquid as you wish

Slide 92

Slide 92 text

98 Paul’s Wife’s Concurrency Antidote ● Wild Himalayan Blackberry Cordial – 8 cups wild himalayan blackberries in 1 gallon jar – Add 2½ cups sugar – Fill jar with vodka – Shake every day for five days – Shake every week for five weeks – Pour through sieve: Add berries to ice cream, consume filtered liquid as you wish If there is no right tool, invent it!!! Questions?