Slide 1

Slide 1 text

How shit works: the CPU Tomer Gabel BuildStuff 2016 Lithuania Image: Telecarlos (CC BY-SA 3.0)

Slide 2

Slide 2 text

Full Disclosure Bullshit ahead! • I’m not an expert • Explanations may be: – Simplified – Inaccurate – Wrong :-) • We’ll barely scratch the surface Image: Public Domain

Slide 3

Slide 3 text

A CONUNDRUM? Are you ready for… Image: Louis Reed (CC BY-SA 4.0)

Slide 4

Slide 4 text

Setting the Stage // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; 1. Which is faster? 2. By how much? 3. And crucially… why?!

Slide 5

Slide 5 text

# Run complete. Total time: 00:00:32 Benchmark Mode Cnt Score Error Units Baseline.sum avgt 6 115.666 ± 3.137 us/op Presorted.sum avgt 6 13.741 ± 0.524 us/op Surprise, Terror and Ruthless Efficiency # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Error Units Baseline.sum avgt 6 ± 3.137 us/op Presorted.sum avgt 6 ± 0.524 us/op * Ignoring setup cost

Slide 6

Slide 6 text

CPUS ARE COMPLEX BEASTS. Image: Pauli Rautakorpi (CC BY 3.0)

Slide 7

Slide 7 text

It Is Known • Your high-level code… long sum = 0; for (i = 0; i < length; i++) if (data[i] >= 0) sum += data[i]; • Gets compiled down to… movsx eax,BYTE PTR [rax+rdx*1+0x10] cmp eax,0x0 movabs rdx,0x11f3a9f60 movabs rcx,0x128 jl 0x000000010679e077 movabs rcx,0x138 mov r8,QWORD PTR [rdx+rcx*1] lea r8,[r8+0x1] mov QWORD PTR [rdx+rcx*1],r8 jl 0x000000010679e092 movsxd rax,eax add rax,rbx mov rbx,rax inc edi

Slide 8

Slide 8 text

It Is Less Known • What happens then? • The instruction goes through phases… Fetch Decode Execute Memory Access Write- back Instruction Stream

Slide 9

Slide 9 text

CPU Architecture 101 Image: Appaloosa (CC BY-SA 3.0)

Slide 10

Slide 10 text

CPU Architecture 101 • What does a CPU do? – Reads the program

Slide 11

Slide 11 text

CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out

Slide 12

Slide 12 text

CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it

Slide 13

Slide 13 text

CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory

Slide 14

Slide 14 text

CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O

Slide 15

Slide 15 text

CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O • Immense complexity!

Slide 16

Slide 16 text

Execution Units • Arithmetic-Logic Unit (ALU) – Boolean algebra – Arithmetic – Memory accesses – Flow control • Floating Point Unit (FPU) • Memory Management Unit (MMU) – Memory mapping – Paging – Access control Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source

Slide 17

Slide 17 text

DESIGN CONSIDERATIONS Image: William M. Plate Jr. (Public Domain)

Slide 18

Slide 18 text

Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2 Pipelining Sequential Execution Latency = 5 cycles Throughput= 0.2 ops / cycle

Slide 19

Slide 19 text

Fetch Decode Execute Memory Access Write- back I1 I0 I2 Fetch Decode Execute Memory Access Fetch Decode Execute Pipelining Sequential Execution Pipelined Execution Latency = 5 cycles Throughput= 0.2 ops / cycle Latency = 5 cycles Throughput= 1 ops / cycle Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2

Slide 20

Slide 20 text

Pipelining • A pipeline can stall • This happens with: – Branches if (i < 0) i++ else i--; F D E M W Memory Load F D E M Test F D E Conditional Jump ? ? ???

Slide 21

Slide 21 text

F D E M W Increment memory address F D E M F D Stall F D Load from memory Add +1 Store in memory Pipelining • A pipeline can stall • This happens with: – Branches – Dependent Instructions • A.K.A pipeline bubbling i++; x = i + 1; Stall

Slide 22

Slide 22 text

PRACTICAL RAMIFICATIONS Image: Hangsna (CC BY-SA 3.0)

Slide 23

Slide 23 text

1. Memory is Slow • RAM access is ~60ns • Random access on a 4GHz, 64-bit CPU: – 250 cycles / memory access – 130MB / second bandwidth • Surely we can do better! Image: Noah Wieder (Public Domain) Source: 7-cpu.com

Slide 24

Slide 24 text

Enter: CPU Cache Level Size Latency L1 32KB + 32KB 1ns L2 256KB 3ns L3 4MB 11ns Main Memory 62ns Intel i7-6700 “Skylake” at 4 GHz Image: Ferry24.Milan (CC BY-SA 3.0) Source: 7-cpu.com

Slide 25

Slide 25 text

Enter: CPU Cache • A unit of work is called cache line – 64 bytes on x86 – LRU eviction policy • Why is sequential access fast? – Cache prefetching

Slide 26

Slide 26 text

In Real Life • Let’s rotate an image! for (y = 0; y < height; y++) for (x = 0; x < width; x++) { int from = y * width + x; int to = x * height + y; target[to] = source[from]; } Image: EgoAltere (CC0 Public Domain)

Slide 27

Slide 27 text

In Real Life • This is not efficient • Reads are sequential 0 1 2 3 ... 9 0 1 2 3 … 9

Slide 28

Slide 28 text

In Real Life • This is not efficient • Reads are sequential 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 2 3 … 9

Slide 29

Slide 29 text

In Real Life • This is not efficient • Reads are sequential • Writes aren’t, though • Different strides – Worst case wins :-( 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 10 2 20 3 30 … … 9 90

Slide 30

Slide 30 text

Cache-Friendly Algorithms • Use blocking or tiling for (y = 0; y < height; y += blockHeight) for (x = 0; x < width; x += blockWidth) for (by = 0; by < blockHeight; by++) for (bx = 0; bx < blockWidth; bx++) { int from = (y + by) * width + (x + bx); int to = (x + bx) * height + (y + by); target[to] = source[from]; }

Slide 31

Slide 31 text

Cache-Friendly Algorithms • The results? Benchmark Mode Cnt Score Error Units CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op • The results? Benchmark Mode Cnt Error Units CachingShowcase.transpose avgt 10 ± 6.000 ms/op CachingShowcase.transpose avgt 10 ± 1.646 ms/op CachingShowcase.transpose avgt 10 ± 1.833 ms/op CachingShowcase.transpose avgt 10 ± 1.954 ms/op x2.37 speedup!

Slide 32

Slide 32 text

2. Those Pesky Branches • Do I go left or right? • Need input! • … but can’t wait for it • Maybe... – Take a guess? – Based on historic trends? • Sounds speculative Image: Michael Dolan (CC BY 2.0)

Slide 33

Slide 33 text

Those Pesky Branches • Enter: Branch Prediction • Concurrently: – Speculate branch – Evaluate condition • It’s now a tradeoff – Commit is fast – Rollback is slow Image: Alejandro C. (CC BY-NC 2.0)

Slide 34

Slide 34 text

// Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; Back to Our Conundrum • Can you guess? – 3… – 2... – 1... • Here it is! // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i];

Slide 35

Slide 35 text

Catharsis 54 10 -4 -2 15 41 - 37 13 0 -9 14 25 - 61 40 Original data array:

Slide 36

Slide 36 text

Catharsis - 61 - 37 -9 -4 -2 0 10 13 14 15 25 40 41 54 After sorting: 0 data[i] >= 0 Always false! data[i] >= 0 Always true!

Slide 37

Slide 37 text

QUESTIONS? Thank you for listening [email protected] @tomerg http://engineering.wix.com Sources and Examples: https://goo.gl/f7NfGT This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Slide 38

Slide 38 text

Further Reading • Jason Robert Carey Patterson – Modern Microprocessors, a 90-Minute Guide • Igor Ostrovsky - Gallery of Processor Cache Effects • Piyush Kumar – Cache Oblivious Algorithms