How shit works: the CPU

Tomer Gabel
November 16, 2016

A talk given at BuildStuff 2016 in Vilnius, Lithuania.

The beautiful thing about software engineering is that it gives you the warm and fuzzy illusion of total understanding: I control this machine because I know how it operates. This is the result of layers upon layers of successful abstractions, which hide immense sophistication and complexity. As with any abstraction, though, these sometimes leak, and that's when a good grounding in what's under the hood pays off.

The second talk in this series peels a few layers of abstraction and takes a look under the hood of our "car engine", the CPU. While hardly anyone codes in assembly language anymore, your C# or JavaScript (or Scala or...) application still ends up executing machine code instructions on a processor; that is why Java has a memory model, why memory layout still matters at scale, and why you're usually free to ignore these considerations and go about your merry way.

You'll come away knowing a little bit about a lot of different moving parts under the hood; after all, isn't understanding how the machine operates what this is all about?

  Full Disclosure Bullshit ahead! • I'm not an expert •

    Explanations may be: – Simplified – Inaccurate – Wrong :-) • We'll barely scratch the surface
  // Generate a bunch of bytes byte[]

    data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; 1. Which is faster? 2. By how much? 3. And crucially… why?!
  # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Score

    Error Units Baseline.sum avgt 6 115.666 ± 3.137 us/op Presorted.sum avgt 6 13.741 ± 0.524 us/op Surprise, Terror and Ruthless Efficiency
  It Is Known • Your high-level code… long sum =

    0; for (i = 0; i < length; i++) if (data[i] >= 0) sum += data[i]; • Gets compiled down to… movsx eax,BYTE PTR [rax+rdx*1+0x10] cmp eax,0x0 movabs rdx,0x11f3a9f60 movabs rcx,0x128 jl 0x000000010679e077 movabs rcx,0x138 mov r8,QWORD PTR [rdx+rcx*1] lea r8,[r8+0x1] mov QWORD PTR [rdx+rcx*1],r8 jl 0x000000010679e092 movsxd rax,eax add rax,rbx mov rbx,rax inc edi
  It Is Less Known • What happens then? • The

    instruction goes through phases… Fetch Decode Execute Memory Access Write- back Instruction Stream
  CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out
  CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it
  CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it – Talks to memory
  CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O
  CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O • Immense complexity!
  Execution Units • Arithmetic-Logic Unit (ALU) –

    Boolean algebra – Arithmetic – Memory accesses – Flow control • Floating Point Unit (FPU) • Memory Management Unit (MMU) – Memory mapping – Paging – Access control
  Pipelining Sequential Execution Latency = 5 cycles Throughput= 0.2 ops / cycle

    Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2 Pipelining Sequential Execution Latency = 5 cycles Throughput= 0.2 ops / cycle
  13. Fetch Decode Execute Memory Access Write- back I1 I0 I2

    Fetch Decode Execute Memory Access Fetch Decode Execute Pipelining Sequential Execution Pipelined Execution Latency = 5 cycles Throughput= 0.2 ops / cycle Latency = 5 cycles Throughput= 1 ops / cycle Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2
  Pipelining • A pipeline can stall • This happens with:

    – Branches if (i < 0) i++ else i--;
  15. F D E M W Increment memory address F D

    E M F D Stall F D Load from memory Add +1 Store in memory Pipelining • A pipeline can stall • This happens with: – Branches – Dependent Instructions • A.K.A pipeline bubbling i++; x = i + 1; Stall
  1. Memory is Slow • RAM access is ~60ns •

    Random access on a 4GHz, 64-bit CPU: – 250 cycles / memory access – 130MB / second bandwidth • Surely we can do better!
  Enter: CPU Cache Level Size Latency L1 32KB + 32KB

    1ns L2 256KB 3ns L3 4MB 11ns Main Memory 62ns Intel i7-6700 "Skylake" at 4 GHz
  Enter: CPU Cache • A unit of work is called

    cache line – 64 bytes on x86 – LRU eviction policy • Why is sequential access fast? – Cache prefetching
  In Real Life • Let's rotate an image! for (y

    = 0; y < height; y++) for (x = 0; x < width; x++) { int from = y * width + x; int to = x * height + y; target[to] = source[from]; }
  In Real Life • This is not efficient • Reads

    are sequential 0 1 2 3 ... 9 0 1 2 3 … 9
  In Real Life • This is not efficient • Reads

    are sequential 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 2 3 … 9
  In Real Life • This is not efficient • Reads

    are sequential • Writes aren't, though • Different strides – Worst case wins :-( 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 10 2 20 3 30 … … 9 90
  Cache-Friendly Algorithms • Use blocking or tiling for (y =

    0; y < height; y += blockHeight) for (x = 0; x < width; x += blockWidth) for (by = 0; by < blockHeight; by++) for (bx = 0; bx < blockWidth; bx++) { int from = (y + by) * width + (x + bx); int to = (x + bx) * height + (y + by); target[to] = source[from]; }
  Cache-Friendly Algorithms • The results? Benchmark Mode Cnt Score Error

    Units CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op x2.37 speedup!
  2. Those Pesky Branches • Do I go left or

    right? • Need input! • … but can't wait for it • Maybe... – Take a guess? – Based on historic trends? • Sounds speculative
  Those Pesky Branches • Enter: Branch Prediction • Concurrently: –

    Speculate branch – Evaluate condition • It's now a tradeoff – Commit is fast – Rollback is slow
  // Generate a bunch of bytes byte[] data = new

    byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; Back to Our Conundrum • Can you guess?
  Catharsis 54 10 -4 -2 15 41 - 37 13

    0 -9 14 25 - 61 40 Original data array:
  Catharsis - 61 - 37 -9 -4 -2 0 10

    13 14 15 25 40 41 54 After sorting: 0 data[i] >= 0 Always false! data[i] >= 0 Always true!
  Further Reading • Jason Robert Carey Patterson – Modern Microprocessors,

    a 90-Minute Guide • Igor Ostrovsky - Gallery of Processor Cache Effects • Piyush Kumar – Cache Oblivious Algorithms