Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hardware and its Concurrency Habits

Hardware and its Concurrency Habits

This talk will present a high-level overview of the hardware structure of modern computer systems, and then will use this structure to show how the laws of physics put limitations on the design and implementation of concurrent software. Venue permitting, it will include a stark visual exposition of these limitations. Finally, this talk will compare how well various synchronization use cases are aligned with modern hardware structure and the laws of physics. Audience members will gain a firmer intuition on the costs and benefits of different synchronization mechanisms.

Paul E McKenney

Kernel Recipes

September 30, 2023
Tweet

More Decks by Kernel Recipes

Other Decks in Programming

Transcript

  1. Hardware and its Concurrency
    Habits
    © 2023 Meta Platforms
    Paul E. McKenney, Meta Platforms Kernel Team
    Kernel Recipes, September 27, 2023
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 3

    View full-size slide

  2. 2
    Recette Pour le Codage Simultané

    Une pincée de connaissance des lois de la
    physique

    Compréhension modeste du matériel informatique

    Compréhension approfondie des exigences

    Conception soignée, y compris la synchronisation

    Validation brutale
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

    View full-size slide

  3. 3
    Recette Pour le Codage Simultané

    Une pincée de connaissance des lois de la
    physique

    Compréhension modeste du matériel informatique

    Compréhension approfondie des exigences

    Conception soignée, y compris la synchronisation

    Validation brutale
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

    View full-size slide

  4. 5
    “Let Them Run Free!!!”

    View full-size slide

  5. 6
    “Let Them Run Free!!!”
    CPU Benchmark Trackmeet
    CPU Benchmark Trackmeet

    View full-size slide

  6. 7
    “Let Them Run Free!!!”
    CPU Benchmark Trackmeet
    CPU Benchmark Trackmeet
    Sadly, it is now more of an obstacle course than a track...

    View full-size slide

  7. 8
    Don’t Make ‘em Like They Used To!

    View full-size slide

  8. 9
    Don’t Make ‘em Like They Used To
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.

    View full-size slide

  9. 10
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.
    Don’t Make ‘em Like They Used To!

    No cache
    No cache

    Shallow pipeline
    Shallow pipeline

    In-order execution
    In-order execution

    One instruction at a time
    One instruction at a time

    Predictable (slow)
    Predictable (slow)
    execution
    execution

    Large cache
    Large cache

    Deep pipeline
    Deep pipeline

    Out of order
    Out of order

    Super scalar
    Super scalar

    Unpredictable (fast)
    Unpredictable (fast)
    execution
    execution

    View full-size slide

  10. 11
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.
    Don’t Make ‘em Like They Used To!

    No cache
    No cache

    Shallow pipeline
    Shallow pipeline

    In-order execution
    In-order execution

    One instruction at a time
    One instruction at a time

    Predictable (slow)
    Predictable (slow)
    execution
    execution

    Large cache
    Large cache

    Deep pipeline
    Deep pipeline

    Out of order
    Out of order

    Super scalar
    Super scalar

    Unpredictable (fast)
    Unpredictable (fast)
    execution
    execution
    “Tiny Bulldozer” “Semi Tractor-Trailer”

    View full-size slide

  11. 12
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.
    Don’t Make ‘em Like They Used To!

    No cache
    No cache

    Shallow pipeline
    Shallow pipeline

    In-order execution
    In-order execution

    One instruction at a time
    One instruction at a time

    Predictable (slow)
    Predictable (slow)
    execution
    execution

    Large cache
    Large cache

    Deep pipeline
    Deep pipeline

    Out of order
    Out of order

    Super scalar
    Super scalar

    Unpredictable (fast)
    Unpredictable (fast)
    execution
    execution
    What would be the computing-systems equivalents of a freight train?

    View full-size slide

  12. 13
    “Good Olde Days” CPU Architecture
    80386 Architecture (Wikipedia user “Appaloosa” GDFL, simplified and reformatted)
    32 bit
    Protection
    Test Unit
    ALU

    Barrel Shifter

    Multiply/Divide

    Register File
    Segmentation
    Bus
    Control
    32 Bit
    Paging
    Prefetcher /
    Limit Checker
    Code Queue
    (16 Bytes)
    Instruction
    Decoder
    3-Decoded
    Instruction
    Queue
    Decode and
    Sequencing
    Control
    ROM
    Dedicated ALU Bus
    34 bit
    32 bit Effective Address
    Status Flags
    ALU Control

    View full-size slide

  13. 14
    The 80386 Taught Me Concurrency
    That and a logic analyzer...

    View full-size slide

  14. 15
    But Instructions Took Several Cycles!

    View full-size slide

  15. 16
    Pipelined Execution For The Win!!!
    (Wikipedia user “Amit6” CC BY-SA 3.0, reformatted)

    View full-size slide

  16. 17
    Superscalar Execution For The Win!!!
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit

    View full-size slide

  17. 18
    Why All This Hardware Complexity?

    View full-size slide

  18. 19
    Laws of Physics: Atoms Are Too Big!!!
    Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016.

    View full-size slide

  19. 20
    Laws of Physics: Atoms Are Too Big!!!
    Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016.
    Speed controlled by base thickness:
    At least one atom thick!!!

    View full-size slide

  20. 21
    Laws of Physics: Light Is Too Slow!!!
    “One nanosecond per foot” courtesy of Grace Hopper (https://www.youtube.com/watch?v=9eyFDBPk4Yw)
    https://en.wikipedia.org/wiki/List_of_refractive_indices A 50% sugar solution is “light syrup”.

    Following the footsteps of Admiral Hopper:
    – Light goes 11.803 inches/ns in a vacuum

    Or, if you prefer, 1.0097 lengths of A4 paper per nanosecond

    Light goes 1 width of A4 paper per nanosecond in 50% sugar solution
    – But over and back: 5.9015 in/ns
    – But not 1GHz! Instead, ~2GHz: ~3in/ns
    – But Cu: ~1 in/ns, or Si transistors: ~0.1 in/ns
    – Plus other slowdowns: prototols, electronics, ...

    View full-size slide

  21. 22
    Laws of Physics: Data Is Slower!!!
    CPUs Caches Interconnects
    Memories
    DRAM & NVM

    View full-size slide

  22. 23
    Laws of Physics: Data Is Slower!!!
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View full-size slide

  23. 24
    Laws of Physics: Data Is Slower!!!
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Protocol
    overheads
    (Mathematics!)
    Multiplexing &
    Demultiplexing
    (Electronics)
    Clock-domain
    transitions
    (Electronics)
    Phase
    changes
    (Chemistry)
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View full-size slide

  24. 25
    Laws of Physics: Summary

    The speed of light is finite (especially in Cu and
    Si) and atoms are of non-zero size

    Mathematics, electronics, and chemistry also
    take their toll

    Systems are fast, so this matters

    View full-size slide

  25. 26
    Laws of Physics: Summary

    The speed of light is finite (especially in Cu and
    Si) and atoms are of non-zero size

    Mathematics, electronics, and chemistry also
    take their toll

    Systems are fast, so this matters
    “Gentlemen, you have two fundamental problems:
    (1) the finite speed of light and (2) the atomic nature of matter.” *
    * Gordon Moore quoting Stephen Hawking

    View full-size slide

  26. 27
    Why All This Hardware Complexity?
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Protocol
    overheads
    (Mathematics!)
    Multiplexing &
    Demultiplexing
    (Electronics)
    Clock-domain
    transitions
    (Electronics)
    Phase
    changes
    (Chemistry)
    Slow light and big atoms create modern computing obstacle course!!!
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View full-size slide

  27. 28
    Account For All CPU Complexity???

    Sometimes, yes! (Assembly language!)

    But we also need portability: CPUs change
    – From family to family
    – With each revision of silicon
    – To work around hardware bugs
    – As a given physical CPU ages

    View full-size slide

  28. 29
    One of the ALUs Might Be Disabled
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    X
    X

    View full-size slide

  29. 30
    Thus, Simple Portable CPU Model
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    CPU
    CPU
    Store
    Store
    Buffer
    Buffer
    Cache
    Cache
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)

    View full-size slide

  30. 31
    And Lots Of CPUs Per System!!!
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Interconnect
    Interconnect
    CPUs 0-27 &
    CPUs 224-251
    CPUs 28-55 &
    CPUs 252-279
    CPUs 56-83 &
    CPUs 280-307
    CPUs 84-111 &
    CPUs 308-335
    CPUs 112-139 &
    CPUs 336-363
    CPUs 140-167 &
    CPUs 364-391
    CPUs 168-195 &
    CPUs 392-419
    CPUs 196-223 &
    CPUs 420-447

    View full-size slide

  31. 32
    Obstacles for Modern Computers

    View full-size slide

  32. 33
    Obstacle: Pipeline Flush
    PIPELINE
    ERROR
    PIPELINE
    ERROR
    BRANCH
    MISPREDICTION
    Running at full speed requires perfect branch prediction

    View full-size slide

  33. 34
    Obstacle: Memory Reference
    A single fetch all the way from memory can cost hundreds of clock cycles

    View full-size slide

  34. 35
    Obstacle: Atomic Operation
    Atomic operations require locking cachelines and/or busses, incurring significant delays

    View full-size slide

  35. 36
    Obstacle: Memory Barrier
    Memory barriers result in stalls and/or ordering constraints, again incurring delays
    Memory
    Barrier
    Memory
    Barrier

    View full-size slide

  36. 37
    Obstacle: Thermal Throttling
    Efficient use of CPU hardware generates heat, throttling the CPU clock frequency

    View full-size slide

  37. 38
    Obstacle: Cache Miss
    Cache misses result in waiting for data to arrive (from memory or other CPUs)
    CACHE-
    MISS
    TOLL
    BOOTH
    CACHE-
    MISS
    TOLL
    BOOTH

    View full-size slide

  38. 39
    Obstacle: Input/Output Operation
    And here you thought that cache misses were slow...

    View full-size slide

  39. 40
    Which Obstables To Focus On?
    1) I/O operations (but often higher-level issue)
    2) Communications cache misses
    3) Memory barriers and atomic operations
    4) Capacity/geometry cache misses (memory)
    5) Branch prediction

    View full-size slide

  40. 41
    Which Obstables To Focus On?
    1) I/O operations (but often higher-level issue)
    2) Communications cache misses
    3) Memory barriers and atomic operations
    4) Capacity/geometry cache misses (memory)
    5) Branch prediction
    These obstacles can (usually) be overcome in a portable manner.

    View full-size slide

  41. 42
    Xeon Platinum 8176 2.1GHz: CPU 0

    View full-size slide

  42. 43
    Location Really Matters!!!
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Interconnect
    Interconnect
    CPUs 0-27 &
    CPUs 224-251
    CPUs 28-55 &
    CPUs 252-279
    CPUs 56-83 &
    CPUs 280-307
    CPUs 84-111 &
    CPUs 308-335
    CPUs 112-139 &
    CPUs 336-363
    CPUs 140-167 &
    CPUs 364-391
    CPUs 168-195 &
    CPUs 392-419
    CPUs 196-223 &
    CPUs 420-447

    View full-size slide

  43. 44
    Latency Demonstration

    View full-size slide

  44. 49
    Can Hardware Help???

    View full-size slide

  45. 50
    Can Hardware Help???
    Source
    Drain
    Sub-Atomic Base
    Vacuum-gap transistor: At these scales, the atmosphere is a vacuum!!!

    View full-size slide

  46. 51
    We Really Can Hand-Place Atoms...
    Actually a carbon monoxide molecule that I moved across a few planes of copper

    View full-size slide

  47. 52
    We Really Can Hand-Place Atoms...
    But not trillions of them in a cost-effective manner!!!
    Does
    Not Scale

    View full-size slide

  48. 53
    Incremental Help From Hardware

    View full-size slide

  49. 54
    Hardware 3D Integration
    3 cm
    1.5 cm
    Half the distance,
    twice the speed!!!
    Both stacked chiplets and lithographically stacked transistors.

    View full-size slide

  50. 55
    Stacked Chiplets/Dies
    Diagram by Shmuel Csaba Otto Traian, CCSA4.0

    View full-size slide

  51. 56
    Lithographically Stacked Transistors
    https://ieeexplore.ieee.org/document/9976473
    https://spectrum.ieee.org/intels-stacked-nanosheet-transistors-could-be-the-next-step-in-moores-law

    View full-size slide

  52. 57
    Hardware 3D Integration
    * Give or take issues with power, cooling, alignment, interconnect drivers, and so on.
    3 cm
    1.5 cm
    Half the distance,
    twice the speed *

    View full-size slide

  53. 58
    Hardware Integration is Helping
    Q3 2017: 56 CPUs with ~100ns latencies

    View full-size slide

  54. 59
    Hardware Integration is Helping
    November 2008: 16 CPUs with ~100ns latencies: More than 3x in nine years!!!

    View full-size slide

  55. 60
    Hardware Accelerators

    View full-size slide

  56. 61
    Hardware Accelerators, Theory
    Data Accelerator
    Unidirectional data flow, no out and back, twice the speed!!!

    View full-size slide

  57. 62
    Hardware Accelerators, Practice
    Data Accelerator
    Sadly, back to request-response, but better latency with local memory?
    Accelerator-local
    memory
    System main memory

    View full-size slide

  58. 63
    So Why Hardware Accelerators???

    Optimized data transfers (e.g., larger blocks)

    Optimized hard-wired computation

    Better performance per watt

    Better performance per unit capital cost

    View full-size slide

  59. 64
    Hardware Has Been Helping All Along

    View full-size slide

  60. 65
    What Hardware Is Up Against
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Protocol
    overheads
    (Mathematics!)
    Multiplexing &
    Demultiplexing
    (Electronics)
    Clock-domain
    transitions
    (Electronics)
    Phase
    changes
    (Chemistry)
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View full-size slide

  61. 66
    Therefore, Memory Hierarchies!!!

    View full-size slide

  62. 67
    Simple Portable CPU Model Redux
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    CPU
    CPU
    Store
    Store
    Buffer
    Buffer
    Cache
    Cache
    (Not yet)

    View full-size slide

  63. 68
    Read-Side Hardware Help (1/7)
    CPU 0
    Cache
    CPU 3
    Cache
    x=42,y=63
    CPU 1 CPU 2
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)
    Request cacheline x

    View full-size slide

  64. 69
    Read-Side Hardware Help (2/8)
    CPU 0
    Cache
    CPU 3
    Cache
    x=42,y=63
    CPU 1 CPU 2
    Request cacheline x
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View full-size slide

  65. 70
    Read-Side Hardware Help (3/8)
    CPU 0
    Cache
    CPU 3
    Cache
    x=42,y=63
    CPU 1 CPU 2
    Request cacheline x
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View full-size slide

  66. 71
    Read-Side Hardware Help (4/8)
    CPU 0
    Cache
    CPU 3
    Cache
    CPU 1 CPU 2
    Cacheline x = 42, y = 63
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View full-size slide

  67. 72
    Read-Side Hardware Help (5/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    Cacheline x = 42, y = 63
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View full-size slide

  68. 73
    Read-Side Hardware Help (6/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View full-size slide

  69. 74
    Read-Side Hardware Help (7/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    Caches help beat laws of physics given spatial locality!!!
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View full-size slide

  70. 75
    Read-Side Hardware Help (8/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    Caches help beat laws of physics given temporal locality!!!
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View full-size slide

  71. 76
    Levels of Cache on My Laptop
    Index Line Size Associativity Size
    0 64 8 32K
    1 64 8 32K
    2 64 4 256K
    3 64 16 16,384K
    When taking on the laws of physics, don’t be afraid to use a few transistors

    View full-size slide

  72. 77
    Levels of Cache on Large Old Server
    Index Line Size Associativity Size
    0 64 8 32K
    1 64 6 32K
    2 64 16 1,024K
    3 64 11 39,424K
    When taking on the laws of physics, don’t be afraid to use lots of transistors

    View full-size slide

  73. 79
    But What About Writes?

    View full-size slide

  74. 80
    Write-Side Hardware Help (Summary)

    Store buffers for the win!!! Sort of…
    – Cache line for variable x is initially at CPU 3
    – CPU 0 writes 1 to x, but doesn't have cacheline

    So holds the write in CPU 0's store buffer

    And requests exclusive access to the cacheline (which takes time)
    – CPU 3 reads x, obtaining “0” immediately from cacheline
    – CPU 0 receive's x's cacheline

    And CPU 0's write finally gets to the cacheline

    Overwriting the value that CPU 3 read, despite the write starting earlier

    Writes only appear to be instantaneous!!!

    View full-size slide

  75. 81
    Simple Portable CPU Model Redux
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    CPU
    CPU
    Store
    Store
    Buffer
    Buffer
    Cache
    Cache
    Now!!!

    View full-size slide

  76. 82
    Write-Side Hardware Help (1/7)
    CPU 0
    Store Buffer
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)

    View full-size slide

  77. 83
    Write-Side Hardware Help (2/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    Request cacheline x
    WRITE_ONCE(x, 1) READ_ONCE(x)
    The store buffer allows writes to completes quickly!!! Take that, laws of physics!!!

    View full-size slide

  78. 84
    Write-Side Hardware Help (3/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    Request cacheline x
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Except that later read gets older value...

    View full-size slide

  79. 85
    Write-Side Hardware Help (4/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    Request cacheline x
    WRITE_ONCE(x, 1) READ_ONCE(x)

    View full-size slide

  80. 86
    Write-Side Hardware Help (5/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Respond with cacheline x = 0

    View full-size slide

  81. 87
    Write-Side Hardware Help (6/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    x=0
    CPU 3
    Store Buffer
    Cache
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Respond with cacheline x = 0

    View full-size slide

  82. 88
    Write-Side Hardware Help (7/7)
    CPU 0
    Store Buffer
    Cache
    x=1
    CPU 3
    Store Buffer
    Cache
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Quick write completion, sort of. Laws of physics: Slow or misordered!!!

    View full-size slide

  83. 89
    Misordering? Or Propagation Delay?
    CPU 0
    CPU 1
    CPU 2
    CPU 3
    WRITE_ONCE(x, 1);
    r1 = READ_ONCE(x) == 0;
    X
    ==
    0 X
    ==
    1
    fr
    Time

    View full-size slide

  84. 90
    And Careful What You Wish For!!!
    CPU 0
    CPU 1
    CPU 2
    CPU 3
    WRITE_ONCE(x, 1);
    r1 = READ_ONCE(x) == 0;
    X
    ==
    0 X
    ==
    1
    fr
    Time
    Hardware tricks help reduce the red triangle. But too bad about Meltdown and Spectre...

    View full-size slide

  85. 91
    Can Software Help?

    View full-size slide

  86. 92
    Can Software Help?

    Increasingly, yes!!!

    Use concurrent libraries, applications,
    subsystems, and so on
    – Let a few do the careful coding and tuning
    – Let a great many benefit from the work of a few

    Use proper APIs to deal with memory ordering
    – Chapter 14: “Advanced Synchronization: Memory Ordering”
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

    View full-size slide

  87. 94
    Summary

    Modern hardware is highly optimized
    – Most of the time!
    – Incremental improvements due to integration
    – But the speed of light is too slow and atoms too big

    Use concurrent software where available

    Structure your code to avoid the big obstacles
    – Micro-optimizations only sometimes needed

    View full-size slide

  88. 95
    For More Information

    “Computer Architecture: A Quantitative Approach”, Hennessey & Patterson (Recent HW)

    “Parallel Computer Architecture: A Hardware/Software Approach”, Culler et al.
    – Includes SGI Origin and Sequent NUMA-Q

    “Programing Massively Parallel Processors: A Hands-on Approach”, Kirk & Hwu
    – Primarily NVIDIA GPGPUs

    https://developer.nvidia.com/educators/existing-courses
    – List of NVIDIA university courseware

    https://gpuopen.com/professional-compute
    – List of AMD GPGPU-related content

    “Is Parallel Programming Hard, And, If So, What Can You Do About It?”
    – https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

    View full-size slide

  89. 96
    L'antidote de Codage Simultané de la Femme de Paul

    Cordial de mûres sauvages de l'Himalaya
    – Mettre deux litres de mûres dans un pot de quatre litres
    – Ajouter cinq huitièmes litres de sucre
    – Remplissez le pot de vodka
    – Secouez tous les jours pendant cinq jours
    – Secouez chaque semaine pendant cinq semaines
    – Passer au tamis: Ajoutez des baies à la glace, consommez le
    liquide filtré comme vous voulez

    View full-size slide

  90. 97
    Paul’s Wife’s Concurrency Antidote

    Wild Himalayan Blackberry Cordial
    – Put 8 cups wild himalayan blackberries in 1 gallon jar
    – Add 2½ cups sugar
    – Fill jar with vodka
    – Shake every day for five days
    – Shake every week for five weeks
    – Pour through sieve: Add berries to ice cream, consume filtered
    liquid as you wish

    View full-size slide

  91. 98
    Paul’s Wife’s Concurrency Antidote

    Wild Himalayan Blackberry Cordial
    – 8 cups wild himalayan blackberries in 1 gallon jar
    – Add 2½ cups sugar
    – Fill jar with vodka
    – Shake every day for five days
    – Shake every week for five weeks
    – Pour through sieve: Add berries to ice cream, consume filtered
    liquid as you wish
    If there is no right tool, invent it!!!
    Questions?

    View full-size slide