How shit works: the CPU

How shit works: the CPU Tomer Gabel BuildStuff 2016 Lithuania
Image: Telecarlos (CC BY-SA 3.0)

Full Disclosure Bullshit ahead! • I’m not an expert •
Explanations may be: – Simplified – Inaccurate – Wrong :-) • We’ll barely scratch the surface Image: Public Domain

A CONUNDRUM? Are you ready for… Image: Louis Reed (CC
BY-SA 4.0)

Setting the Stage // Generate a bunch of bytes byte[]
data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; 1. Which is faster? 2. By how much? 3. And crucially… why?!

# Run complete. Total time: 00:00:32 Benchmark Mode Cnt Score
Error Units Baseline.sum avgt 6 115.666 ± 3.137 us/op Presorted.sum avgt 6 13.741 ± 0.524 us/op Surprise, Terror and Ruthless Efficiency # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Error Units Baseline.sum avgt 6 ± 3.137 us/op Presorted.sum avgt 6 ± 0.524 us/op * Ignoring setup cost

CPUS ARE COMPLEX BEASTS. Image: Pauli Rautakorpi (CC BY 3.0)

It Is Known • Your high-level code… long sum =
0; for (i = 0; i < length; i++) if (data[i] >= 0) sum += data[i]; • Gets compiled down to… movsx eax,BYTE PTR [rax+rdx*1+0x10] cmp eax,0x0 movabs rdx,0x11f3a9f60 movabs rcx,0x128 jl 0x000000010679e077 movabs rcx,0x138 mov r8,QWORD PTR [rdx+rcx*1] lea r8,[r8+0x1] mov QWORD PTR [rdx+rcx*1],r8 jl 0x000000010679e092 movsxd rax,eax add rax,rbx mov rbx,rax inc edi

It Is Less Known • What happens then? • The
instruction goes through phases… Fetch Decode Execute Memory Access Write- back Instruction Stream

CPU Architecture 101 Image: Appaloosa (CC BY-SA 3.0)

CPU Architecture 101 • What does a CPU do? –
Reads the program

Reads the program – Figures it out

Reads the program – Figures it out – Executes it

Reads the program – Figures it out – Executes it – Talks to memory

Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O

Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O • Immense complexity!

Execution Units • Arithmetic-Logic Unit (ALU) – Boolean algebra –
Arithmetic – Memory accesses – Flow control • Floating Point Unit (FPU) • Memory Management Unit (MMU) – Memory mapping – Paging – Access control Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source

DESIGN CONSIDERATIONS Image: William M. Plate Jr. (Public Domain)

Fetch Decode Execute Memory Access Write- back Fetch Decode Execute
Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2 Pipelining Sequential Execution Latency = 5 cycles Throughput= 0.2 ops / cycle

Fetch Decode Execute Memory Access Write- back I1 I0 I2
Fetch Decode Execute Memory Access Fetch Decode Execute Pipelining Sequential Execution Pipelined Execution Latency = 5 cycles Throughput= 0.2 ops / cycle Latency = 5 cycles Throughput= 1 ops / cycle Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2

Pipelining • A pipeline can stall • This happens with:
– Branches if (i < 0) i++ else i--; F D E M W Memory Load F D E M Test F D E Conditional Jump ? ? ???

F D E M W Increment memory address F D
E M F D Stall F D Load from memory Add +1 Store in memory Pipelining • A pipeline can stall • This happens with: – Branches – Dependent Instructions • A.K.A pipeline bubbling i++; x = i + 1; Stall

PRACTICAL RAMIFICATIONS Image: Hangsna (CC BY-SA 3.0)

1. Memory is Slow • RAM access is ~60ns •
Random access on a 4GHz, 64-bit CPU: – 250 cycles / memory access – 130MB / second bandwidth • Surely we can do better! Image: Noah Wieder (Public Domain) Source: 7-cpu.com

Enter: CPU Cache Level Size Latency L1 32KB + 32KB
1ns L2 256KB 3ns L3 4MB 11ns Main Memory 62ns Intel i7-6700 “Skylake” at 4 GHz Image: Ferry24.Milan (CC BY-SA 3.0) Source: 7-cpu.com

Enter: CPU Cache • A unit of work is called
cache line – 64 bytes on x86 – LRU eviction policy • Why is sequential access fast? – Cache prefetching

In Real Life • Let’s rotate an image! for (y
= 0; y < height; y++) for (x = 0; x < width; x++) { int from = y * width + x; int to = x * height + y; target[to] = source[from]; } Image: EgoAltere (CC0 Public Domain)

In Real Life • This is not efficient • Reads
are sequential 0 1 2 3 ... 9 0 1 2 3 … 9

are sequential 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 2 3 … 9

are sequential • Writes aren’t, though • Different strides – Worst case wins :-( 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 10 2 20 3 30 … … 9 90

Cache-Friendly Algorithms • Use blocking or tiling for (y =
0; y < height; y += blockHeight) for (x = 0; x < width; x += blockWidth) for (by = 0; by < blockHeight; by++) for (bx = 0; bx < blockWidth; bx++) { int from = (y + by) * width + (x + bx); int to = (x + bx) * height + (y + by); target[to] = source[from]; }

Cache-Friendly Algorithms • The results? Benchmark Mode Cnt Score Error
Units CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op • The results? Benchmark Mode Cnt Error Units CachingShowcase.transpose avgt 10 ± 6.000 ms/op CachingShowcase.transpose avgt 10 ± 1.646 ms/op CachingShowcase.transpose avgt 10 ± 1.833 ms/op CachingShowcase.transpose avgt 10 ± 1.954 ms/op x2.37 speedup!

2. Those Pesky Branches • Do I go left or
right? • Need input! • … but can’t wait for it • Maybe... – Take a guess? – Based on historic trends? • Sounds speculative Image: Michael Dolan (CC BY 2.0)

Those Pesky Branches • Enter: Branch Prediction • Concurrently: –
Speculate branch – Evaluate condition • It’s now a tradeoff – Commit is fast – Rollback is slow Image: Alejandro C. (CC BY-NC 2.0)

// Generate a bunch of bytes byte[] data = new
byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; Back to Our Conundrum • Can you guess? – 3… – 2... – 1... • Here it is! // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i];

Catharsis 54 10 -4 -2 15 41 - 37 13
0 -9 14 25 - 61 40 Original data array:

Catharsis - 61 - 37 -9 -4 -2 0 10
13 14 15 25 40 41 54 After sorting: 0 data[i] >= 0 Always false! data[i] >= 0 Always true!

QUESTIONS? Thank you for listening tomer@tomergabel.com @tomerg http://engineering.wix.com Sources and
Examples: https://goo.gl/f7NfGT This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Further Reading • Jason Robert Carey Patterson – Modern Microprocessors,
a 90-Minute Guide • Igor Ostrovsky - Gallery of Processor Cache Effects • Piyush Kumar – Cache Oblivious Algorithms

How shit works: the CPU

How shit works: the CPU

More Decks by Tomer Gabel

Other Decks in Programming

Featured

Transcript