Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Case Study: Concurrent Counting

Case Study: Concurrent Counting

Counting is perhaps the simplest and most natural possible form of mathematics. However, counting efficiently and scalably on a large multicore system is quite challenging. And this is all to the good, because the simplicity of the underlying concept of counting allows us to explore the fundamental issues of concurrency without the distractions of elaborate data structures or complex synchronization primitives. These issues include design, coding, and validation.

This talk will therefore use counting as an introduction to concurrency.

Paul MCKENNEY

Kernel Recipes

October 02, 2024
Tweet

More Decks by Kernel Recipes

Other Decks in Technology

Transcript

  1. Case Study: Concurrent Counting As Easy as 1, 2, 3!

    © 2024 Meta Platforms Paul E. McKenney, Meta Platforms Kernel Team Kernel Recipes, September 24, 2024 https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 5
  2. 2 Recette Pour le Comptage Simultané • Une pincée de

    connaissance des lois de la physique • Compréhension modeste du matériel informatique • Compréhension approfondie des exigences • Conception soignée, y compris la synchronisation • Validation brutale https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3
  3. 3 Recette Pour le Comptage Simultané • Une pincée de

    connaissance des lois de la physique • Compréhension modeste du matériel informatique • Compréhension approfondie des exigences • Conception soignée, y compris la synchronisation • Validation brutale https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3
  4. 4 Recipe for Concurrent Counting • One pinch knowledge of

    laws of physics • Modest understanding of hardware • Thorough understanding of requirements • Careful design, including synchronization • Brutal validation https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 3
  5. 7 Laws of Physics: Atoms Are Too Big!!! Each spot

    is an atom. Qingxiao Wang/UT Dallas ca. 2016.
  6. 8 Laws of Physics: Atoms Are Too Big!!! Each spot

    is an atom. Qingxiao Wang/UT Dallas ca. 2016. Speed controlled by base thickness: At least one atom thick!!!
  7. 9 Laws of Physics: Light Is Too Slow!!! “One nanosecond

    per foot” courtesy of Grace Hopper (https://www.youtube.com/watch?v=9eyFDBPk4Yw) https://en.wikipedia.org/wiki/List_of_refractive_indices A 50% sugar solution is “light syrup”. • Following the footsteps of Admiral Hopper: – Light goes 11.803 inches/ns in a vacuum • Or, if you prefer, 1.0097 lengths of A4 paper per nanosecond • Light goes 1 width of A4 paper per nanosecond in 50% sugar solution – But over and back: 5.9015 in/ns – But not 1GHz! Instead, ~2GHz: ~3in/ns – But Cu: ~1 in/ns, or Si transistors: ~0.1 in/ns – Plus other slowdowns: prototols, electronics, ...
  8. 10 Laws of Physics: Data Is Slower!!! CPUs Caches Interconnects

    Memories DRAM & NVM Protocol overheads (Mathematics!) Multiplexing & Demultiplexing (Electronics) Clock-domain transitions (Electronics) Phase changes (Chemistry) Light is way too slow in Cu and Si and atoms are way too big!!!
  9. 11 Laws of Physics: Summary • The speed of light

    is finite (especially in Cu and Si) and atoms are of non-zero size • Mathematics, electronics, and chemistry also take their toll • Systems are fast, so this matters “Gentlemen, you have two fundamental problems: (1) the finite speed of light and (2) the atomic nature of matter.” * * Gordon Moore quoting Stephen Hawking
  10. 12 Laws of Physics: Summary • The speed of light

    is finite (especially in Cu and Si) and atoms are of non-zero size • Mathematics, electronics, and chemistry also take their toll • Systems are fast, so this matters “Gentlemen, you have two fundamental problems: (1) the finite speed of light and (2) the atomic nature of matter.” * * Gordon Moore quoting Stephen Hawking Hardware architects are well aware of all this
  11. 13 Superscalar Execution For The Win!!! Intel Core 2 Architecture

    (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit
  12. 14 Don’t Make ‘em Like They Used To 4.0 GHz

    clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat.
  13. 15 Don’t Make ‘em Like They Used To 4.0 GHz

    clock, 20 MB L3 cache, 20 stage pipeline... The only pipeline I need is to cool off that hot- headed brat. And this is 1980s MC68000: Try 1970s!
  14. 16 Paul’s and Bryan’s 1977 Schoolwork “And we were #$(*&@#

    happy to have 4Kx12 bits!!!” https://www.rcsri.org/collection/pdp-12/skylark-ps.txt Photo: The Retro-Computing Society of Rhode Island, Inc. DEC PDP-12 625 KHz CPU (1.6μs/ref) One register (why more?) 12-bit instructions 1.6μs 4Kx12 bits core memory No stack & no caches 512x512 graphics (green!) No graphics frame buffer Size of a refrigerator sin() & cos() < 21μs Photo in museum
  15. 17 Paul’s and Bryan’s 1977 Schoolwork “And we were #$(*&@#

    happy to have 4Kx12 bits!!!” https://www.rcsri.org/collection/pdp-12/skylark-ps.txt Photo: The Retro-Computing Society of Rhode Island, Inc. DEC PDP-12 625 KHz CPU (1.6μs/ref) One register (why more?) 12-bit instructions 1.6μs 4Kx12 bits core memory No stack & no caches 512x512 graphics (green!) No graphics frame buffer Size of a refrigerator sin() & cos() < 21μs Photo in museum Nice and simple, but slower than a herd of turtles in a truckload of tar!!!
  16. 18 Account For All CPU Complexity??? • Sometimes, yes! (Assembly

    language!) • But we also need portability: CPUs change – From family to family – With each revision of silicon – To work around hardware bugs – As a given physical CPU ages
  17. 19 One of the ALUs Might Be Disabled Intel Core

    2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted) 128 entry ITLB 32 KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit X X
  18. 20 Thus, Simple Portable CPU Model 128 entry ITLB 32

    KB Instruction Cache (8 way) 32 Byte Pre-Decode, Fetch Buffer 18 Entry Instruction Queue 128 Bit 6 Instructions Instruction Fetch Unit Micro- code Simple Decoder Simple Decoder Simple Decoder Complex Decoder 1μop 1μop 1μop 4μops 7 Entry μop Buffer 4μops Register Alias Table and Allocator 96 Entry Reorder Buffer (ROB) Retirement Register File 4μops 4μops Store Data Store Address SSE ALU ALU Branch SSE Shuffle MUL ALU SSE Shuffle ALU ALU Load Address 4μops 32 Entry Reservation Station 128 Bit FMUL FDIV 128 Bit FADD Memory Ordering Buffer (MOB) 128 Bit 32 KB Dual Ported Data Cache (8 way) 16 entry DTLB Store 128 Bit Load 256 entry L2 DTLB Shared L2 Cache (16 way) Shared Bus Interface Unit 256 Bit CPU CPU Store Store Buffer Buffer Cache Cache Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)
  19. 22 CPU, Store Buffer, Cache, and “Bus” • CPU computes

    with words (e.g., 64 bits) • Store buffer holds words waiting for cache lines • Cache is a hardware hash table containing “cache lines” (e.g., 64 bytes which is 512 bits) • “Bus” (AKA “interconnect”) communicates cache lines among CPUs’ caches and memory
  20. 23 CPU, Store Buffer, Cache, and “Bus” CPU 0 Store

    Buffer Cache CPU 3 Store Buffer Cache CPU 1 CPU 2 Communication is slow! So use locality of reference to avoid unnecessary communication!!! “Bus” carries cache lines Memory
  21. 24 CPU, Store Buffer, Cache, and “Bus” CPU 0 Store

    Buffer Cache CPU 3 Store Buffer Cache CPU 1 CPU 2 Communication is slow! So use locality of reference to avoid unnecessary communication!!! “Bus” carries cache lines In later examples, data will have no chance to reach memory
  22. 26 Just Count Concurrently!!! unsigned long ctr; void inc_count(void) {

    ctr++; } void read_count(void) { return ctr; }
  23. 27 Just Count Concurrently!!! unsigned long ctr; void inc_count(void) {

    ctr++; } void read_count(void) { return ctr; } Anyone care to critique this code?
  24. 28 Just Count Concurrently!!! unsigned long ctr; void inc_count(void) {

    ctr++; } void read_count(void) { return ctr; } Anyone care to critique this code? Running on 6 CPUs loses 87% of the counts, which is not good!
  25. 30 Why Are Counts Lost??? (1/7) CPU 0 Store Buffer

    Cache CPU 3 Store Buffer Cache ctr=0 CPU 1 CPU 2 ctr++; ctr++;
  26. 31 Why Are Counts Lost??? (2/7) CPU 0 Store Buffer

    ctr=1 Cache CPU 3 Store Buffer Cache ctr=1 CPU 1 CPU 2 Request cacheline ctr The store buffer allows increments to completes quickly!!! Take that, laws of physics!!! ctr++; ctr++;
  27. 32 Why Are Counts Lost??? (3/7) CPU 0 Store Buffer

    ctr=2 Cache CPU 3 Store Buffer Cache ctr=2 CPU 1 CPU 2 Request cacheline ctr ctr++; ctr++;
  28. 33 Why Are Counts Lost??? (4/7) CPU 0 Store Buffer

    ctr=3 Cache CPU 3 Store Buffer Cache ctr=3 CPU 1 CPU 2 Request cacheline ctr ctr++; ctr++;
  29. 34 Why Are Counts Lost??? (5/7) CPU 0 Store Buffer

    ctr=4 Cache CPU 3 Store Buffer ctr=4 Cache CPU 1 CPU 2 Respond with cacheline ctr=3 ctr++; ctr++;
  30. 35 Why Are Counts Lost??? (6/7) CPU 0 Store Buffer

    ctr=5 Cache ctr=3 CPU 3 Store Buffer ctr=5 Cache CPU 1 CPU 2 Respond with cacheline ctr=3 ctr++; ctr++;
  31. 36 Why Are Counts Lost??? (7/7) CPU 0 Store Buffer

    Cache ctr=6 CPU 3 Cache CPU 1 CPU 2 ctr++; ctr++; Store Buffer ctr=6
  32. 37 Why Are Counts Lost??? (7/7 Redux) CPU 0 Store

    Buffer Cache ctr=6 CPU 3 Cache CPU 1 CPU 2 Quick write completion, sort of. Laws of physics: Slow or lost counts!!! ctr++; ctr++; Store Buffer ctr=6 Here we add Here we just overwrite
  33. 38 Why Are Counts Lost??? (7/7 Redux) CPU 0 Store

    Buffer Cache ctr=6 CPU 3 Cache CPU 1 CPU 2 Quick write completion, sort of. Laws of physics: Slow or lost counts!!! ctr++; ctr++; Store Buffer ctr=6 Here we add Here we just overwrite This is why we have atomic operations
  34. 40 Just Count Atomically!!! atomic_t ctr; void inc_count(void) { atomic_inc(&ctr);

    } void read_count(void) { return atomic_read(&ctr); }
  35. 41 Just Count Atomically!!! atomic_t ctr; void inc_count(void) { atomic_inc(&ctr);

    } void read_count(void) { return atomic_read(&ctr); } Anyone care to critique this code?
  36. 45 Spare a Thought for Those CPUs! One one thousand.

    Two one thousand. Three one thousand...
  37. 46 Cache Line Thrashes Among CPUs!!! Last-Level Cache Last-Level Cache

    Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Last-Level Cache Interconnect Interconnect CPUs 0-27 & CPUs 224-251 CPUs 28-55 & CPUs 252-279 CPUs 56-83 & CPUs 280-307 CPUs 84-111 & CPUs 308-335 CPUs 112-139 & CPUs 336-363 CPUs 140-167 & CPUs 364-391 CPUs 168-195 & CPUs 392-419 CPUs 196-223 & CPUs 420-447 HW optimizations? https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e2.pdf Chapter 5 Quick Quiz 5.11
  38. 47 Counting Atomically Does Not Scale 448-CPU Intel(R) Xeon(R) Platinum

    8176 CPU @ 2.10GHz So we should always avoid atomics?
  39. 48 Counting Atomically Can Be OK... 448-CPU Intel(R) Xeon(R) Platinum

    8176 CPU @ 2.10GHz Here is just fine! Here, not so much!!!
  40. 49 Counting Atomically Can Be OK... 448-CPU Intel(R) Xeon(R) Platinum

    8176 CPU @ 2.10GHz Here is just fine! Here, not so much!!! But sometimes we can do way better!
  41. 50 How Accurate is the Reader’s Value? n = read_count();

    do_something(n); // Here? do_something_else(n); // How about here??? do_other_thing(n); // And here??? Jjj
  42. 52 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n

    = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time
  43. 53 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n

    = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value!
  44. 54 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n

    = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value! The count “n” is stale before it even has a chance to be returned
  45. 55 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n

    = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value! The count “n” is stale before it even has a chance to be returned So why pay such a high cost for a value that is stale anyway???
  46. 56 How Accurate is the Reader’s Value? ctr++ atomic_inc(&ctr) n

    = read_count(); do_something(n); do_something_else(n); do_other_thing(n); Value of ctr Time Stale value! The count “n” is stale before it even has a chance to be returned So why pay such a high cost for a value that is stale anyway??? How can we do better if reads are infrequent???
  47. 57 How to Count Better??? (1/7) CPU 0 Store Buffer

    Cache ctr0=0 CPU 3 Store Buffer Cache ctr3=0 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=0 Store Buffer Cache ctr2=0 ctr1++; ctr2++; ctr2++;
  48. 58 How to Count Better??? (2/7) CPU 0 Store Buffer

    Cache ctr0=1 CPU 3 Store Buffer Cache ctr3=1 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=1 Store Buffer Cache ctr2=1 ctr1++; ctr2++; ctr2++;
  49. 59 How to Count Better??? (3/7) CPU 0 Store Buffer

    Cache ctr0=2 CPU 3 Store Buffer Cache ctr3=2 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=2 Store Buffer Cache ctr2=2 ctr1++; ctr2++; ctr2++;
  50. 60 How to Count Better??? (4/7) CPU 0 Store Buffer

    Cache ctr0=3 CPU 3 Store Buffer Cache ctr3=3 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=3 Store Buffer Cache ctr2=3 ctr1++; ctr2++; ctr2++;
  51. 61 How to Count Better??? (5/7) CPU 0 Store Buffer

    Cache ctr0=4 CPU 3 Store Buffer Cache ctr3=3 CPU 1 CPU 2 ctr0++; Store Buffer Cache ctr1=4 Store Buffer Cache ctr2=4 ctr1++; ctr2++; read_count(); Request ctr0, ctr1, and ctr2
  52. 62 How to Count Better??? (6/7) CPU 0 Store Buffer

    Cache ctr0=4 (S) CPU 3 Store Buffer Cache ctr0=4, ctr1=4, ctr2=4, ctr3=3 CPU 1 CPU 2 Store Buffer Cache ctr1=4 (S) Store Buffer Cache ctr2=4 (S) read_count(); Respond with ctr0, ctr1, and ctr2
  53. 63 How to Count Better??? (7/7) CPU 0 Store Buffer

    Cache ctr0=4 (S) CPU 3 Store Buffer Cache ctr0=4, ctr1=4, ctr2=4, ctr3=3 CPU 1 CPU 2 Store Buffer Cache ctr1=4 (S) Store Buffer Cache ctr2=4 (S) read_count(); (15)
  54. 64 How to Count Better Assessment • Updates are very

    fast: Non-atomic increment of a local counter variable with no ordering • Reads are very slow: Sum up all threads’ counters and return this sum – Maximum error limited by the potential change in platonic counter value during summation
  55. 67 How to Count Better: Pseudocode unsigned long counter; static

    inline void inc_count(void) { WRITE_ONCE(counter, counter + 1); } static inline unsigned long read_count(void) { int t; unsigned long sum = 0; for_each_thread(t) sum += READ_ONCE(*counterp[t]); return sum; } For more details: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Section 5.2
  56. 68 How to Count Better: Pseudocode unsigned long counter; static

    inline void inc_count(void) { WRITE_ONCE(counter, counter + 1); } static inline unsigned long read_count(void) { int t; unsigned long sum = 0; for_each_thread(t) sum += READ_ONCE(*counterp[t]); return sum; } Updates a few ns, reads 100s of μs!
  57. 69 How to Count Better: Pseudocode unsigned long counter; static

    inline void inc_count(void) { WRITE_ONCE(counter, counter + 1); } static inline unsigned long read_count(void) { int t; unsigned long sum = 0; for_each_thread(t) sum += READ_ONCE(*counterp[t]); return sum; } Updates a few ns, reads 100s of μs! But can we read even faster???
  58. 70 Speed-Reading How-to: Pseudocode unsigned long counter_sum; static inline unsigned

    long read_count_fast(void) { return READ_ONCE(counter_sum); } void update_counter_sum(void) { for (;;) { WRITE_ONCE(counter_sum, read_count()); // Apply many updates: Batch! schedule_timeout_idle(10); // Sleep 10 jiffies: Delay! } } For more details: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Section 5.2.4
  59. 72 Accuracy & Scalability for Both!!! All operations a few

    nanoseconds, so why not just use this everywhere?
  60. 73 Speed-Reading How-to: Tradeoffs For more details: https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Section 5.2.4

    • Need an extra thread (per counter?) • That threads’s wakeups not good for batteries • Extra delay from update to read • Counter reads are normally very infrequent, so why bother? – Few will care about 160µs every five seconds...
  61. 74 Other Complications • Approximate limits (Section 5.3) • Exact

    limits (Section 5.4) – Without atomic operations (Section 5.4.3) – Only sometimes (Section 5.4.6) • Validation (Section 5.5.1)
  62. 77 Statistical Counters (Case 1/3) • Heavily used: More than

    300 uses of this_cpu_inc() and __this_cpu_inc() • Additional uses as throughput counters – For example this_cpu_inc() to count network packets and this_cpu_add() to count bytes • Typically open-coded • Taken for granted since at least 1987!
  63. 79 Per-CPU Refcounts (Case 2/3) • Switch from per-CPU to

    global mode: – Start with counter equal to 1 – Per-CPU refcounts used in common case – At tear-down time, switch to global refcount • Combine sum of per-CPU refcounts to global • If tearing down, wait for RCU grace period – Remove initial reference • Can also switch back to per-CPU if not tearing down
  64. 80 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global

    pcrc0 pcrc1 pcrc2 pcrc3 percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Time
  65. 81 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global

    pcrc0 pcrc1 pcrc2 pcrc3 Often to tear down percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Time
  66. 82 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global

    pcrc0 pcrc1 pcrc2 pcrc3 Often to tear down percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Time Study the percpu-ref code before attempting to do this yourself!
  67. 83 Per-CPU Refcounts (Case 2/3) pcrc0 pcrc1 pcrc2 pcrc3 global

    pcrc0 pcrc1 pcrc2 pcrc3 Often to tear down percpu_ref_switch_to_atomic_rcu() percpu_ref_switch_to_percpu() Study the percpu-ref code before attempting to do this yourself! Synchronize in per-CPU mode? Time
  68. 85 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0

    unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current CPU 0 CPU 1 CPU 2
  69. 86 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0

    unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_lock() CPU 0 CPU 1 CPU 2 lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current
  70. 87 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0

    unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_lock() CPU 0 CPU 1 CPU 2 lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_unlock() lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:1 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:1 unlock[1]:0 Current srcu_read_lock()
  71. 88 SRCU Counts of Readers (Case 3/3) lock[0]:0 lock[1]:0 unlock[0]:0

    unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_lock() CPU 0 CPU 1 CPU 2 lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:0 unlock[1]:0 Current srcu_read_unlock() lock[0]:1 lock[1]:0 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:1 unlock[0]:0 unlock[1]:0 lock[0]:0 lock[1]:0 unlock[0]:1 unlock[1]:0 Current srcu_read_lock() Please study the SRCU code before attempting to code this yourself!
  72. 90 Counting’s General Lessons • Partitioning helps performance and scalability

    • Partial partioning helps, at least partially • Batching updates can also help • Careful use of delay can help greatly (Chapter 9) • Engineering, not science: Thus tradeoffs – Hardware and workloads affect design
  73. 93 Summary • Modern hardware is highly optimized – Most

    of the time! – Incremental improvements due to integration – But the speed of light is too slow and atoms too big • Use concurrent software where available • Structure your code & data to avoid big obstacles – Plenty of tricks for fast concurrent counting!
  74. 94 For More Information • “Is Parallel Programming Hard, And,

    If So, What Can You Do About It?” – https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html – Chapter 3 (“Hardware and its Habits”) and Chapter 5 (“Counting”) • “What every programmer should know about memory, Part 1” – https://lwn.net/Articles/250967/ (contains links to parts 2-9) • “Who's afraid of a big bad optimizing compiler?” – https://lwn.net/Articles/793253/ • “Calibrating your fear of big bad optimizing compilers” – https://lwn.net/Articles/799218/ • “What Every C Programmer Should Know About Undefined Behavior #3/3” – https://blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html (links to #1 & #2)
  75. 95 L'antidote de Codage Simultané de la Femme de Paul

    • Cordial de mûres sauvages de l'Himalaya – Mettre deux litres de mûres dans un pot de quatre litres – Ajouter cinq huitièmes litres de sucre – Remplissez le pot de vodka – Secouez tous les jours pendant cinq jours – Secouez chaque semaine pendant cinq semaines – Passer au tamis: Ajoutez des baies à la glace, consommez le liquide filtré comme vous voulez
  76. 96 Paul’s Wife’s Concurrency Antidote • Wild Himalayan Blackberry Cordial

    – Put 8 cups wild himalayan blackberries in 1 gallon jar – Add 2½ cups sugar – Fill jar with vodka – Shake every day for five days – Shake every week for five weeks – Pour through sieve: Add berries to ice cream, consume filtered liquid as you wish
  77. 97 Paul’s Wife’s Concurrency Antidote • Wild Himalayan Blackberry Cordial

    – Put 8 cups wild himalayan blackberries in 1 gallon jar – Add 2½ cups sugar – Fill jar with vodka – Shake every day for five days – Shake every week for five weeks – Pour through sieve: Add berries to ice cream, consume filtered liquid as you wish But what constitutes an overdose???
  78. 98 Recette de Test de Surdosage • 1 liter de

    chapelure de pain sec • 100 ml. de sauge, 60 ml. d’oignons, 100 ml. de céleri • 100 ml. de maïs soufflé (non cuit), une pincée de sel • 1 liter de bouillon Bien mixer, farcir la dinde et cuire 5 heures à 150°C ou jusqu'à ce que le maïs soufflé fasse exploser la dinde.
  79. 99 Overdose-Test Recipe • 4 c. crushed dry bread •

    ½ c. sage, ¼ c. onions, ½ c. celery • ½ c. uncooked popcorn, 1 tsp. salt • 5 c. broth Mix well; stuff turkey. Cook 5 hours at 300°F or until popcorn blows the a** off the turkey.
  80. 100 Overdose-Test Recipe • 4 c. crushed dry bread •

    ½ c. sage, ¼ c. onions, ½ c. celery • ½ c. uncooked popcorn, 1 tsp. salt • 5 c. broth Mix well; stuff turkey. Cook 5 hours at 300°F or until popcorn blows the a** off the turkey. If this recipe seems like a good idea, you have well and truly overdosed!!!
  81. 106 Who is Afraid of an Optimizer??? • If you

    are wise, you will be! Compilers can: – Reorder references – Fuse and merge loads and stores – Tear loads and stores – Invent loads and stores – Omit “unused” loads and stores • Which can fatally surprise your concurrent code https://lwn.net/Articles/793253/ “Who's afraid of a big bad optimizing compiler?” https://lwn.net/Articles/799218/ ”Calibrating your fear of big bad optimizing compilers”
  82. 107 Who is Afraid of an Optimizer??? • And don’t

    forget undefined behavior! Compilers can: – Assume signed ints never wrap (except in the kernel!) – Treat non-atomic non-volatile variables as private • See previous slide – Assume array indexes are always in-bounds – Assume pointers from two different kmalloc() calls are never equal to each other • Which can also fatally surprise your concurrent code https://lwn.net/Articles/793253/ “Who's afraid of a big bad optimizing compiler?” https://lwn.net/Articles/799218/ ”Calibrating your fear of big bad optimizing compilers”
  83. 109 Paul’s and Bryan’s 1977 Schoolwork “And we were #$(*&@#

    happy to have 4Kx12 bits!!!” https://www.rcsri.org/collection/pdp-12/skylark-ps.txt Photo: The Retro-Computing Society of Rhode Island, Inc. DEC PDP-12 625 KHz CPU (1.6μs/ref) One register (why more?) 12-bit instructions 1.6μs 4Kx12 bits core memory No stack & no caches 512x512 graphics (green!) No graphics frame buffer Size of a refrigerator sin() & cos() < 21μs Photo in museum
  84. 110 sin() & cos() < 21μs? How??? • Forget degrees

    and radians • Divide the circle into 128 parts • Use a lookup table
  85. 111 sin() & cos() < 21μs? Here’s How!!! TRGMSK, 0177

    / Variable named TRGMSK with value 0177 TRGSTO, 0 / Temporary variable SINSRC, 0 AND TRGMSK / 3.2μs TAD SNTABL / 3.2μs DCA TRGSTO / 3.2μs TAD I TRGSTO / 4.8μs JMP I SINSRC / 3.2μs /20.8μs Total (24.0μs including JMS)
  86. 112 sin() & cos() < 21μs? Here’s How!!! SINTBL, 0000

    /0 Degrees, values octal s.bbbbbbbbbbb 0144 0310 0454 0617 0761 1122 1261 1417 ...