Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How shit works: the CPU

How shit works: the CPU

A talk given at BuildStuff 2016 in Vilnius, Lithuania.

The beautiful thing about software engineering is that it gives you the warm and fuzzy illusion of total understanding: I control this machine because I know how it operates. This is the result of layers upon layers of successful abstractions, which hide immense sophistication and complexity. As with any abstraction, though, these sometimes leak, and that's when a good grounding in what's under the hood pays off.

The second talk in this series peels a few layers of abstraction and takes a look under the hood of our "car engine", the CPU. While hardly anyone codes in assembly language anymore, your C# or JavaScript (or Scala or...) application still ends up executing machine code instructions on a processor; that is why Java has a memory model, why memory layout still matters at scale, and why you're usually free to ignore these considerations and go about your merry way.

You'll come away knowing a little bit about a lot of different moving parts under the hood; after all, isn't understanding how the machine operates what this is all about?

Tomer Gabel

November 16, 2016
Tweet

More Decks by Tomer Gabel

Other Decks in Programming

Transcript

  1. Full Disclosure Bullshit ahead! • I’m not an expert •

    Explanations may be: – Simplified – Inaccurate – Wrong :-) • We’ll barely scratch the surface Image: Public Domain
  2. Setting the Stage // Generate a bunch of bytes byte[]

    data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; 1. Which is faster? 2. By how much? 3. And crucially… why?!
  3. # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Score

    Error Units Baseline.sum avgt 6 115.666 ± 3.137 us/op Presorted.sum avgt 6 13.741 ± 0.524 us/op Surprise, Terror and Ruthless Efficiency # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Error Units Baseline.sum avgt 6 ± 3.137 us/op Presorted.sum avgt 6 ± 0.524 us/op * Ignoring setup cost
  4. It Is Known • Your high-level code… long sum =

    0; for (i = 0; i < length; i++) if (data[i] >= 0) sum += data[i]; • Gets compiled down to… movsx eax,BYTE PTR [rax+rdx*1+0x10] cmp eax,0x0 movabs rdx,0x11f3a9f60 movabs rcx,0x128 jl 0x000000010679e077 movabs rcx,0x138 mov r8,QWORD PTR [rdx+rcx*1] lea r8,[r8+0x1] mov QWORD PTR [rdx+rcx*1],r8 jl 0x000000010679e092 movsxd rax,eax add rax,rbx mov rbx,rax inc edi
  5. It Is Less Known • What happens then? • The

    instruction goes through phases… Fetch Decode Execute Memory Access Write- back Instruction Stream
  6. CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out
  7. CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it
  8. CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it – Talks to memory
  9. CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O
  10. CPU Architecture 101 • What does a CPU do? –

    Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O • Immense complexity!
  11. Execution Units • Arithmetic-Logic Unit (ALU) – Boolean algebra –

    Arithmetic – Memory accesses – Flow control • Floating Point Unit (FPU) • Memory Management Unit (MMU) – Memory mapping – Paging – Access control Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source
  12. Fetch Decode Execute Memory Access Write- back Fetch Decode Execute

    Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2 Pipelining Sequential Execution Latency = 5 cycles Throughput= 0.2 ops / cycle
  13. Fetch Decode Execute Memory Access Write- back I1 I0 I2

    Fetch Decode Execute Memory Access Fetch Decode Execute Pipelining Sequential Execution Pipelined Execution Latency = 5 cycles Throughput= 0.2 ops / cycle Latency = 5 cycles Throughput= 1 ops / cycle Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2
  14. Pipelining • A pipeline can stall • This happens with:

    – Branches if (i < 0) i++ else i--; F D E M W Memory Load F D E M Test F D E Conditional Jump ? ? ???
  15. F D E M W Increment memory address F D

    E M F D Stall F D Load from memory Add +1 Store in memory Pipelining • A pipeline can stall • This happens with: – Branches – Dependent Instructions • A.K.A pipeline bubbling i++; x = i + 1; Stall
  16. 1. Memory is Slow • RAM access is ~60ns •

    Random access on a 4GHz, 64-bit CPU: – 250 cycles / memory access – 130MB / second bandwidth • Surely we can do better! Image: Noah Wieder (Public Domain) Source: 7-cpu.com
  17. Enter: CPU Cache Level Size Latency L1 32KB + 32KB

    1ns L2 256KB 3ns L3 4MB 11ns Main Memory 62ns Intel i7-6700 “Skylake” at 4 GHz Image: Ferry24.Milan (CC BY-SA 3.0) Source: 7-cpu.com
  18. Enter: CPU Cache • A unit of work is called

    cache line – 64 bytes on x86 – LRU eviction policy • Why is sequential access fast? – Cache prefetching
  19. In Real Life • Let’s rotate an image! for (y

    = 0; y < height; y++) for (x = 0; x < width; x++) { int from = y * width + x; int to = x * height + y; target[to] = source[from]; } Image: EgoAltere (CC0 Public Domain)
  20. In Real Life • This is not efficient • Reads

    are sequential 0 1 2 3 ... 9 0 1 2 3 … 9
  21. In Real Life • This is not efficient • Reads

    are sequential 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 2 3 … 9
  22. In Real Life • This is not efficient • Reads

    are sequential • Writes aren’t, though • Different strides – Worst case wins :-( 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 10 2 20 3 30 … … 9 90
  23. Cache-Friendly Algorithms • Use blocking or tiling for (y =

    0; y < height; y += blockHeight) for (x = 0; x < width; x += blockWidth) for (by = 0; by < blockHeight; by++) for (bx = 0; bx < blockWidth; bx++) { int from = (y + by) * width + (x + bx); int to = (x + bx) * height + (y + by); target[to] = source[from]; }
  24. Cache-Friendly Algorithms • The results? Benchmark Mode Cnt Score Error

    Units CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op • The results? Benchmark Mode Cnt Error Units CachingShowcase.transpose avgt 10 ± 6.000 ms/op CachingShowcase.transpose avgt 10 ± 1.646 ms/op CachingShowcase.transpose avgt 10 ± 1.833 ms/op CachingShowcase.transpose avgt 10 ± 1.954 ms/op x2.37 speedup!
  25. 2. Those Pesky Branches • Do I go left or

    right? • Need input! • … but can’t wait for it • Maybe... – Take a guess? – Based on historic trends? • Sounds speculative Image: Michael Dolan (CC BY 2.0)
  26. Those Pesky Branches • Enter: Branch Prediction • Concurrently: –

    Speculate branch – Evaluate condition • It’s now a tradeoff – Commit is fast – Rollback is slow Image: Alejandro C. (CC BY-NC 2.0)
  27. // Generate a bunch of bytes byte[] data = new

    byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; Back to Our Conundrum • Can you guess? – 3… – 2... – 1... • Here it is! // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i];
  28. Catharsis 54 10 -4 -2 15 41 - 37 13

    0 -9 14 25 - 61 40 Original data array:
  29. Catharsis - 61 - 37 -9 -4 -2 0 10

    13 14 15 25 40 41 54 After sorting: 0 data[i] >= 0 Always false! data[i] >= 0 Always true!
  30. QUESTIONS? Thank you for listening [email protected] @tomerg http://engineering.wix.com Sources and

    Examples: https://goo.gl/f7NfGT This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
  31. Further Reading • Jason Robert Carey Patterson – Modern Microprocessors,

    a 90-Minute Guide • Igor Ostrovsky - Gallery of Processor Cache Effects • Piyush Kumar – Cache Oblivious Algorithms