Slide 1

Slide 1 text

Case Study: Concurrent Counting As Easy as 1, 2, 3! © 2024 Meta Platforms Paul E. McKenney, Meta Platforms Kernel Team Kernel Recipes, September 24, 2024 https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 5

Slide 2

Slide 2 text

2 Recette Pour le Comptage Simultané ● Une pincée de connaissance des lois de la physique ● Compréhension modeste du matériel informatique ● Compréhension approfondie des exigences ● Conception soignée, y compris la synchronisation ● Validation brutale https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

Slide 3

Slide 3 text

3 Recette Pour le Comptage Simultané ● Une pincée de connaissance des lois de la physique ● Compréhension modeste du matériel informatique ● Compréhension approfondie des exigences ● Conception soignée, y compris la synchronisation ● Validation brutale https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

Slide 4

Slide 4 text

4 Recipe for Concurrent Counting ● One pinch knowledge of laws of physics ● Modest understanding of hardware ● Thorough understanding of requirements ● Careful design, including synchronization ● Brutal validation https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 3

Slide 5

Slide 5 text

5 Distribution

Slide 6

Slide 6 text

6 What Laws of Physics???

Slide 7

Slide 7 text

7 Laws of Physics: Atoms Are Too Big!!! Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016.

Slide 8

Slide 8 text

8 Laws of Physics: Atoms Are Too Big!!! Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016. Speed controlled by base thickness: At least one atom thick!!!

Slide 9

Slide 9 text

9 Laws of Physics: Light Is Too Slow!!! “One nanosecond per foot” courtesy of Grace Hopper (https://www.youtube.com/watch?v=9eyFDBPk4Yw) https://en.wikipedia.org/wiki/List_of_refractive_indices A 50% sugar solution is “light syrup”. ● Following the footsteps of Admiral Hopper: – Light goes 11.803 inches/ns in a vacuum ● Or, if you prefer, 1.0097 lengths of A4 paper per nanosecond ● Light goes 1 width of A4 paper per nanosecond in 50% sugar solution – But over and back: 5.9015 in/ns – But not 1GHz! Instead, ~2GHz: ~3in/ns – But Cu: ~1 in/ns, or Si transistors: ~0.1 in/ns – Plus other slowdowns: prototols, electronics, ...

Slide 10

Slide 10 text

10 Laws of Physics: Data Is Slower!!! CPUs Caches Interconnects Memories DRAM & NVM Protocol overheads (Mathematics!) Multiplexing & Demultiplexing (Electronics) Clock-domain transitions (Electronics) Phase changes (Chemistry) Light is way too slow in Cu and Si and atoms are way too big!!!

Slide 11

Slide 11 text

11 Laws of Physics: Summary ● The speed of light is finite (especially in Cu and Si) and atoms are of non-zero size ● Mathematics, electronics, and chemistry also take their toll ● Systems are fast, so this matters “Gentlemen, you have two fundamental problems: (1) the finite speed of light and (2) the atomic nature of matter.” * * Gordon Moore quoting Stephen Hawking

Slide 12

Slide 12 text

12 Laws of Physics: Summary ● The speed of light is finite (especially in Cu and Si) and atoms are of non-zero size ● Mathematics, electronics, and chemistry also take their toll ● Systems are fast, so this matters “Gentlemen, you have two fundamental problems: (1) the finite speed of light and (2) the atomic nature of matter.” * * Gordon Moore quoting Stephen Hawking Hardware architects are well aware of all this

Slide 13

Slide 13 text

13 Superscalar Execution For The Win!!! Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit

Slide 14

Slide 14 text

14 Don’t Make ‘em Like They Used To 4.0 GHz clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat.

Slide 15

Slide 15 text

15 Don’t Make ‘em Like They Used To 4.0 GHz clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat. And this is 1980s MC68000: Try 1970s!

Slide 16

Slide 16 text

16 Paul’s and Bryan’s 1977 Schoolwork “And we were #$(*&@# happy to have 4Kx12 bits!!!” https://www.rcsri.org/collection/pdp-12/skylark-ps.txt Photo: The Retro-Computing Society of Rhode Island, Inc. DEC PDP-12 625 KHz CPU (1.6μs/ref) One register (why more?) 12-bit instructions 1.6μs 4Kx12 bits core memory No stack & no caches 512x512 graphics (green!) No graphics frame buffer Size of a refrigerator sin() & cos() < 21μs Photo in museum

Slide 17

Slide 17 text

17 Paul’s and Bryan’s 1977 Schoolwork “And we were #$(*&@# happy to have 4Kx12 bits!!!” https://www.rcsri.org/collection/pdp-12/skylark-ps.txt Photo: The Retro-Computing Society of Rhode Island, Inc. DEC PDP-12 625 KHz CPU (1.6μs/ref) One register (why more?) 12-bit instructions 1.6μs 4Kx12 bits core memory No stack & no caches 512x512 graphics (green!) No graphics frame buffer Size of a refrigerator sin() & cos() < 21μs Photo in museum Nice and simple, but slower than a herd of turtles in a truckload of tar!!!

Slide 18

Slide 18 text

18 Account For All CPU Complexity??? ● Sometimes, yes! (Assembly language!) ● But we also need portability: CPUs change – From family to family – With each revision of silicon – To work around hardware bugs – As a given physical CPU ages

Slide 19

Slide 19 text

19 One of the ALUs Might Be Disabled Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit X X

Slide 20

Slide 20 text

20 Thus, Simple Portable CPU Model 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit CPU CPU Store Store Buffer Buffer Cache Cache Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)

Slide 21

Slide 21 text

21 CPU, Store Buffer, Cache, and “Bus”

Slide 22

Slide 22 text

22 CPU, Store Buffer, Cache, and “Bus” ● CPU computes with words (e.g., 64 bits) ● Store buffer holds words waiting for cache lines ● Cache is a hardware hash table containing “cache lines” (e.g., 64 bytes which is 512 bits) ● “Bus” (AKA “interconnect”) communicates cache lines among CPUs’ caches and memory

Slide 23

Slide 23 text

23 CPU, Store Buffer, Cache, and “Bus” CPU 0 Store Buffer Cache CPU 3 Store Buffer Cache CPU 1 CPU 2 Communication is slow! So use locality of reference to avoid unnecessary communication!!! “Bus” carries cache lines Memory

Slide 24

Slide 24 text

24 CPU, Store Buffer, Cache, and “Bus” CPU 0 Store Buffer Cache CPU 3 Store Buffer Cache CPU 1 CPU 2 Communication is slow! So use locality of reference to avoid unnecessary communication!!! “Bus” carries cache lines In later examples, data will have no chance to reach memory

Slide 25

Slide 25 text

25 Let’s Do Some Concurrent Counting!!!

Slide 26

Slide 26 text

26 Just Count Concurrently!!! unsigned long ctr; void inc_count(void) { ctr++; } void read_count(void) { return ctr; }

Slide 27

Slide 27 text

27 Just Count Concurrently!!! unsigned long ctr; void inc_count(void) { ctr++; } void read_count(void) { return ctr; } Anyone care to critique this code?

Slide 28

Slide 28 text

28 Just Count Concurrently!!! unsigned long ctr; void inc_count(void) { ctr++; } void read_count(void) { return ctr; } Anyone care to critique this code? Running on 6 CPUs loses 87% of the counts, which is not good!

Slide 29

Slide 29 text

29 Why Are Counts Lost???

Slide 30

Slide 30 text

30 Why Are Counts Lost??? (1/7) CPU 0 Store Buffer Cache CPU 3 Store Buffer Cache ctr=0 CPU 1 CPU 2 ctr++; ctr++;

Slide 31

Slide 31 text

31 Why Are Counts Lost??? (2/7) CPU 0 Store Buffer ctr=1 Cache CPU 3 Store Buffer Cache ctr=1 CPU 1 CPU 2 Request cacheline ctr The store buffer allows increments to completes quickly!!! Take that, laws of physics!!! ctr++; ctr++;

Slide 32

Slide 32 text

32 Why Are Counts Lost??? (3/7) CPU 0 Store Buffer ctr=2 Cache CPU 3 Store Buffer Cache ctr=2 CPU 1 CPU 2 Request cacheline ctr ctr++; ctr++;

Slide 33

Slide 33 text

33 Why Are Counts Lost??? (4/7) CPU 0 Store Buffer ctr=3 Cache CPU 3 Store Buffer Cache ctr=3 CPU 1 CPU 2 Request cacheline ctr ctr++; ctr++;

Slide 34

Slide 34 text

34 Why Are Counts Lost??? (5/7) CPU 0 Store Buffer ctr=4 Cache CPU 3 Store Buffer ctr=4 Cache CPU 1 CPU 2 Respond with cacheline ctr=3 ctr++; ctr++;

Slide 35

Slide 35 text

35 Why Are Counts Lost??? (6/7) CPU 0 Store Buffer ctr=5 Cache ctr=3 CPU 3 Store Buffer ctr=5 Cache CPU 1 CPU 2 Respond with cacheline ctr=3 ctr++; ctr++;

Slide 36

Slide 36 text

36 Why Are Counts Lost??? (7/7) CPU 0 Store Buffer Cache ctr=6 CPU 3 Cache CPU 1 CPU 2 ctr++; ctr++; Store Buffer ctr=6

Slide 37

Slide 37 text

37 Why Are Counts Lost??? (7/7 Redux) CPU 0 Store Buffer Cache ctr=6 CPU 3 Cache CPU 1 CPU 2 Quick write completion, sort of. Laws of physics: Slow or lost counts!!! ctr++; ctr++; Store Buffer ctr=6 Here we add Here we just overwrite

Slide 38

Slide 38 text

38 Why Are Counts Lost??? (7/7 Redux) CPU 0 Store Buffer Cache ctr=6 CPU 3 Cache CPU 1 CPU 2 Quick write completion, sort of. Laws of physics: Slow or lost counts!!! ctr++; ctr++; Store Buffer ctr=6 Here we add Here we just overwrite This is why we have atomic operations

Slide 39

Slide 39 text

39 Just Count Atomically!!!

Slide 40

Slide 40 text

40 Just Count Atomically!!! atomic_t ctr; void inc_count(void) { atomic_inc(&ctr); } void read_count(void) { return atomic_read(&ctr); }

Slide 41

Slide 41 text

41 Just Count Atomically!!! atomic_t ctr; void inc_count(void) { atomic_inc(&ctr); } void read_count(void) { return atomic_read(&ctr); } Anyone care to critique this code?

Slide 42

Slide 42 text

42 Counting Atomically Does Not Scale 448-CPU Intel(R) Xeon(R) Platinum 8176 @ 2.10GHz

Slide 43

Slide 43 text

43 Counting Atomically Does Not Scale 448-CPU Intel(R) Xeon(R) Platinum 8176 @ 2.10GHz

Slide 44

Slide 44 text

44 And Still Does Not Scale!!! 166-CPU AMD EPYC-Milan Processor @ 2.0GHz

Slide 45

Slide 45 text

45 Spare a Thought for Those CPUs! One one thousand. Two one thousand. Three one thousand...

Slide 46

Slide 46 text

46 Cache Line Thrashes Among CPUs!!! Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Interconnect Interconnect CPUs 0-27 & CPUs 224-251 CPUs 28-55 & CPUs 252-279 CPUs 56-83 & CPUs 280-307 CPUs 84-111 & CPUs 308-335 CPUs 112-139 & CPUs 336-363 CPUs 140-167 & CPUs 364-391 CPUs 168-195 & CPUs 392-419 CPUs 196-223 & CPUs 420-447 HW optimizations? https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf Chapter 5 Quick Quiz 5.11

Slide 47

Slide 47 text

47 Counting Atomically Does Not Scale 448-CPU Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz So we should always avoid atomics?

Slide 48

Slide 48 text

48 Counting Atomically Can Be OK... 448-CPU Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz Here is just fine! Here, not so much!!!

Slide 49

Slide 49 text

49 Counting Atomically Can Be OK... 448-CPU Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz Here is just fine! Here, not so much!!! But sometimes we can do way better!

Slide 50

Slide 50 text

50 How Accurate is the Reader’s Value? n = read_count(); do_something(n); // Here? do_something_else(n); // How about here??? do_other_thing(n); // And here??? Jjj

Slide 51

Slide 51 text

51 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) Value of ctr Time

Slide 52

Slide 52 text

52 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time

Slide 53

Slide 53 text

53 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value!

Slide 54

Slide 54 text

54 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value! The count “n” is stale before it even has a chance to be returned

Slide 55

Slide 55 text

55 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value! The count “n” is stale before it even has a chance to be returned So why pay such a high cost for a value that is stale anyway???

Slide 56

Slide 56 text

56 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value! The count “n” is stale before it even has a chance to be returned So why pay such a high cost for a value that is stale anyway??? How can we do better if reads are infrequent???

Slide 57

Slide 57 text

57 How to Count Better??? (1/7) CPU 0 Store Buffer Cache ctr0=0 CPU 3 Store Buffer Cache ctr3=0 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=0 Store Buffer Cache ctr2=0 ctr1++; ctr2++; ctr2++;

Slide 58

Slide 58 text

58 How to Count Better??? (2/7) CPU 0 Store Buffer Cache ctr0=1 CPU 3 Store Buffer Cache ctr3=1 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=1 Store Buffer Cache ctr2=1 ctr1++; ctr2++; ctr2++;

Slide 59

Slide 59 text

59 How to Count Better??? (3/7) CPU 0 Store Buffer Cache ctr0=2 CPU 3 Store Buffer Cache ctr3=2 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=2 Store Buffer Cache ctr2=2 ctr1++; ctr2++; ctr2++;

Slide 60

Slide 60 text

60 How to Count Better??? (4/7) CPU 0 Store Buffer Cache ctr0=3 CPU 3 Store Buffer Cache ctr3=3 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=3 Store Buffer Cache ctr2=3 ctr1++; ctr2++; ctr2++;

Slide 61

Slide 61 text

61 How to Count Better??? (5/7) CPU 0 Store Buffer Cache ctr0=4 CPU 3 Store Buffer Cache ctr3=3 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=4 Store Buffer Cache ctr2=4 ctr1++; ctr2++; read_count(); Request ctr0, ctr1, and ctr2

Slide 62

Slide 62 text

62 How to Count Better??? (6/7) CPU 0 Store Buffer Cache ctr0=4 (S) CPU 3 Store Buffer Cache ctr0=4, ctr1=4, ctr2=4, ctr3=3 CPU 1 CPU 2 Store Buffer Cache ctr1=4 (S) Store Buffer Cache ctr2=4 (S) read_count(); Respond with ctr0, ctr1, and ctr2

Slide 63

Slide 63 text

63 How to Count Better??? (7/7) CPU 0 Store Buffer Cache ctr0=4 (S) CPU 3 Store Buffer Cache ctr0=4, ctr1=4, ctr2=4, ctr3=3 CPU 1 CPU 2 Store Buffer Cache ctr1=4 (S) Store Buffer Cache ctr2=4 (S) read_count(); (15)

Slide 64

Slide 64 text

64 How to Count Better Assessment ● Updates are very fast: Non-atomic increment of a local counter variable with no ordering ● Reads are very slow: Sum up all threads’ counters and return this sum – Maximum error limited by the potential change in platonic counter value during summation

Slide 65

Slide 65 text

65 Updates: Accuracy & Scalability! 166-CPU AMD EPYC-Milan Processor @ 2.0GHz

Slide 66

Slide 66 text

66 Reads: Not So Much... 166-CPU AMD EPYC-Milan Processor @ 2.0GHz

Slide 67

Slide 67 text

67 How to Count Better: Pseudocode unsigned long counter; static inline void inc_count(void) { WRITE_ONCE(counter, counter + 1); } static inline unsigned long read_count(void) { int t; unsigned long sum = 0; for_each_thread(t) sum += READ_ONCE(*counterp[t]); return sum; } For more details: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Section 5.2

Slide 68

Slide 68 text

68 How to Count Better: Pseudocode unsigned long counter; static inline void inc_count(void) { WRITE_ONCE(counter, counter + 1); } static inline unsigned long read_count(void) { int t; unsigned long sum = 0; for_each_thread(t) sum += READ_ONCE(*counterp[t]); return sum; } Updates a few ns, reads 100s of μs!

Slide 69

Slide 69 text

69 How to Count Better: Pseudocode unsigned long counter; static inline void inc_count(void) { WRITE_ONCE(counter, counter + 1); } static inline unsigned long read_count(void) { int t; unsigned long sum = 0; for_each_thread(t) sum += READ_ONCE(*counterp[t]); return sum; } Updates a few ns, reads 100s of μs! But can we read even faster???

Slide 70

Slide 70 text

70 Speed-Reading How-to: Pseudocode unsigned long counter_sum; static inline unsigned long read_count_fast(void) { return READ_ONCE(counter_sum); } void update_counter_sum(void) { for (;;) { WRITE_ONCE(counter_sum, read_count()); // Apply many updates: Batch! schedule_timeout_idle(10); // Sleep 10 jiffies: Delay! } } For more details: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Section 5.2.4

Slide 71

Slide 71 text

71 Accuracy & Scalability for Both!!!

Slide 72

Slide 72 text

72 Accuracy & Scalability for Both!!! All operations a few nanoseconds, so why not just use this everywhere?

Slide 73

Slide 73 text

73 Speed-Reading How-to: Tradeoffs For more details: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Section 5.2.4 ● Need an extra thread (per counter?) ● That threads’s wakeups not good for batteries ● Extra delay from update to read ● Counter reads are normally very infrequent, so why bother? – Few will care about 160µs every five seconds...

Slide 74

Slide 74 text

74 Other Complications ● Approximate limits (Section 5.3) ● Exact limits (Section 5.4) – Without atomic operations (Section 5.4.3) – Only sometimes (Section 5.4.6) ● Validation (Section 5.5.1)

Slide 75

Slide 75 text

75 Linux-Kernel Case Studies

Slide 76

Slide 76 text

76 Statistical Counters (Case 1/3)

Slide 77

Slide 77 text

77 Statistical Counters (Case 1/3) ● Heavily used: More than 300 uses of this_cpu_inc() and __this_cpu_inc() ● Additional uses as throughput counters – For example this_cpu_inc() to count network packets and this_cpu_add() to count bytes ● Typically open-coded ● Taken for granted since at least 1987!

Slide 78

Slide 78 text

78 Per-CPU Refcounts (Case 2/3)

Slide 79

Slide 79 text

79 Per-CPU Refcounts (Case 2/3) ● Switch from per-CPU to global mode: – Start with counter equal to 1 – Per-CPU refcounts used in common case – At tear-down time, switch to global refcount ● Combine sum of per-CPU refcounts to global ● If tearing down, wait for RCU grace period – Remove initial reference ● Can also switch back to per-CPU if not tearing down

Slide 80

Slide 80 text

80 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global pcrc0 pcrc1 pcrc2 pcrc3 percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Time

Slide 81

Slide 81 text

81 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global pcrc0 pcrc1 pcrc2 pcrc3 Often to tear down percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Time

Slide 82

Slide 82 text

82 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global pcrc0 pcrc1 pcrc2 pcrc3 Often to tear down percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Time Study the percpu-ref code before attempting to do this yourself!

Slide 83

Slide 83 text

83 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global pcrc0 pcrc1 pcrc2 pcrc3 Often to tear down percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Study the percpu-ref code before attempting to do this yourself! Synchronize in per-CPU mode? Time

Slide 84

Slide 84 text

84 SRCU Counts of Readers (Case 3/3)

Slide 85

Slide 85 text

85 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current CPU 0 CPU 1 CPU 2

Slide 86

Slide 86 text

86 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_lock() CPU 0 CPU 1 CPU 2 lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current

Slide 87

Slide 87 text

87 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_lock() CPU 0 CPU 1 CPU 2 lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_unlock() lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:1 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:1 unlock[1]:0 Current srcu_read_lock()

Slide 88

Slide 88 text

88 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_lock() CPU 0 CPU 1 CPU 2 lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_unlock() lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:1 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:1 unlock[1]:0 Current srcu_read_lock() Please study the SRCU code before attempting to code this yourself!

Slide 89

Slide 89 text

89 Counting’s General Lessons

Slide 90

Slide 90 text

90 Counting’s General Lessons ● Partitioning helps performance and scalability ● Partial partioning helps, at least partially ● Batching updates can also help ● Careful use of delay can help greatly (Chapter 9) ● Engineering, not science: Thus tradeoffs – Hardware and workloads affect design

Slide 91

Slide 91 text

91 Pratique du Comptage

Slide 92

Slide 92 text

92 Summary

Slide 93

Slide 93 text

93 Summary ● Modern hardware is highly optimized – Most of the time! – Incremental improvements due to integration – But the speed of light is too slow and atoms too big ● Use concurrent software where available ● Structure your code & data to avoid big obstacles – Plenty of tricks for fast concurrent counting!

Slide 94

Slide 94 text

94 For More Information ● “Is Parallel Programming Hard, And, If So, What Can You Do About It?” – https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html – Chapter 3 (“Hardware and its Habits”) and Chapter 5 (“Counting”) ● “What every programmer should know about memory, Part 1” – https://lwn.net/Articles/250967/ (contains links to parts 2-9) ● “Who's afraid of a big bad optimizing compiler?” – https://lwn.net/Articles/793253/ ● “Calibrating your fear of big bad optimizing compilers” – https://lwn.net/Articles/799218/ ● “What Every C Programmer Should Know About Undefined Behavior #3/3” – https://blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html (links to #1 & #2)

Slide 95

Slide 95 text

95 L'antidote de Codage Simultané de la Femme de Paul ● Cordial de mûres sauvages de l'Himalaya – Mettre deux litres de mûres dans un pot de quatre litres – Ajouter cinq huitièmes litres de sucre – Remplissez le pot de vodka – Secouez tous les jours pendant cinq jours – Secouez chaque semaine pendant cinq semaines – Passer au tamis: Ajoutez des baies à la glace, consommez le liquide filtré comme vous voulez

Slide 96

Slide 96 text

96 Paul’s Wife’s Concurrency Antidote ● Wild Himalayan Blackberry Cordial – Put 8 cups wild himalayan blackberries in 1 gallon jar – Add 2½ cups sugar – Fill jar with vodka – Shake every day for five days – Shake every week for five weeks – Pour through sieve: Add berries to ice cream, consume filtered liquid as you wish

Slide 97

Slide 97 text

97 Paul’s Wife’s Concurrency Antidote ● Wild Himalayan Blackberry Cordial – Put 8 cups wild himalayan blackberries in 1 gallon jar – Add 2½ cups sugar – Fill jar with vodka – Shake every day for five days – Shake every week for five weeks – Pour through sieve: Add berries to ice cream, consume filtered liquid as you wish But what constitutes an overdose???

Slide 98

Slide 98 text

98 Recette de Test de Surdosage ● 1 liter de chapelure de pain sec ● 100 ml. de sauge, 60 ml. d’oignons, 100 ml. de céleri ● 100 ml. de maïs soufflé (non cuit), une pincée de sel ● 1 liter de bouillon Bien mixer, farcir la dinde et cuire 5 heures à 150°C ou jusqu'à ce que le maïs soufflé fasse exploser la dinde.

Slide 99

Slide 99 text

99 Overdose-Test Recipe ● 4 c. crushed dry bread ● ½ c. sage, ¼ c. onions, ½ c. celery ● ½ c. uncooked popcorn, 1 tsp. salt ● 5 c. broth Mix well; stuff turkey. Cook 5 hours at 300°F or until popcorn blows the a** off the turkey.

Slide 100

Slide 100 text

100 Overdose-Test Recipe ● 4 c. crushed dry bread ● ½ c. sage, ¼ c. onions, ½ c. celery ● ½ c. uncooked popcorn, 1 tsp. salt ● 5 c. broth Mix well; stuff turkey. Cook 5 hours at 300°F or until popcorn blows the a** off the turkey. If this recipe seems like a good idea, you have well and truly overdosed!!!

Slide 101

Slide 101 text

101 And As Always...

Slide 102

Slide 102 text

102 And As Always... If there is no right tool, invent it!!!

Slide 103

Slide 103 text

103 Questions? https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 5

Slide 104

Slide 104 text

104 Backup

Slide 105

Slide 105 text

105 Who is Afraid of an Optimizer???

Slide 106

Slide 106 text

106 Who is Afraid of an Optimizer??? ● If you are wise, you will be! Compilers can: – Reorder references – Fuse and merge loads and stores – Tear loads and stores – Invent loads and stores – Omit “unused” loads and stores ● Which can fatally surprise your concurrent code https://lwn.net/Articles/793253/ “Who's afraid of a big bad optimizing compiler?” https://lwn.net/Articles/799218/ ”Calibrating your fear of big bad optimizing compilers”

Slide 107

Slide 107 text

107 Who is Afraid of an Optimizer??? ● And don’t forget undefined behavior! Compilers can: – Assume signed ints never wrap (except in the kernel!) – Treat non-atomic non-volatile variables as private ● See previous slide – Assume array indexes are always in-bounds – Assume pointers from two different kmalloc() calls are never equal to each other ● Which can also fatally surprise your concurrent code https://lwn.net/Articles/793253/ “Who's afraid of a big bad optimizing compiler?” https://lwn.net/Articles/799218/ ”Calibrating your fear of big bad optimizing compilers”

Slide 108

Slide 108 text

108 PDP-12 and Trigonometric Functions

Slide 109

Slide 109 text

109 Paul’s and Bryan’s 1977 Schoolwork “And we were #$(*&@# happy to have 4Kx12 bits!!!” https://www.rcsri.org/collection/pdp-12/skylark-ps.txt Photo: The Retro-Computing Society of Rhode Island, Inc. DEC PDP-12 625 KHz CPU (1.6μs/ref) One register (why more?) 12-bit instructions 1.6μs 4Kx12 bits core memory No stack & no caches 512x512 graphics (green!) No graphics frame buffer Size of a refrigerator sin() & cos() < 21μs Photo in museum

Slide 110

Slide 110 text

110 sin() & cos() < 21μs? How??? ● Forget degrees and radians ● Divide the circle into 128 parts ● Use a lookup table

Slide 111

Slide 111 text

111 sin() & cos() < 21μs? Here’s How!!! TRGMSK, 0177 / Variable named TRGMSK with value 0177 TRGSTO, 0 / Temporary variable SINSRC, 0 AND TRGMSK / 3.2μs TAD SNTABL / 3.2μs DCA TRGSTO / 3.2μs TAD I TRGSTO / 4.8μs JMP I SINSRC / 3.2μs /20.8μs Total (24.0μs including JMS)

Slide 112

Slide 112 text

112 sin() & cos() < 21μs? Here’s How!!! SINTBL, 0000 /0 Degrees, values octal s.bbbbbbbbbbb 0144 0310 0454 0617 0761 1122 1261 1417 ...