Writing Software for Modern Computers

Slide 1

Slide 1 text

Writing Software for Modern Computers – My Varnish experience Poul-Henning Kamp [email protected] [email protected] @bsdphk

Slide 2

Slide 2 text

$ who am i 31 of 47 years in computing Major FreeBSD kernel hacker Author of Varnish + Tons of other stuff Writes column for ACM Queue The man behind ”bikeshed.org”

Slide 3

Slide 3 text

Outline: * What is Varnish anyway ? * The road to Multicore * Algorithm vs. Architecture * Conclusion, such as it may be. * (That other dimension)

Slide 4

Slide 4 text

So what is Varnish anyway ? A HTTP/WEB frontend/afterburner * Fast (>1 Mreq/s) * Flexible * Modular * Deployable See: Varnish-cache.org

Slide 5

Slide 5 text

GeoNet.org.nz

Slide 6

Slide 6 text

GeoNet.org.nz x 20,000 Handled Load Offered Load

Slide 7

Slide 7 text

GeoNet.org.nz x 20,000 Handled Load Offered Load Varnish cache

Slide 8

Slide 8 text

Content creation * Complex composition process * Many sources of data * Many kinds of data * Layout rules * Policy rules * Taste * Design Content Delivery * Just make copies * High speed * High volume  CMS Varnish 

Slide 9

Slide 9 text

Varnish-Cache coordinates: http://varnish.org Runs on any contemporary UNIX Ie: FreeBSD, Linux, OS/X, Solaris We don't do Windows Free Open Source Software (FOSS) ← What you get What you pay →

Slide 10

Slide 10 text

385 Mbit/sec 64 Mbit/sec +600%

Slide 11

Slide 11 text

3% CPU 6% CPU +100%

Slide 12

Slide 12 text

CPU STORAGE I/O

Slide 13

Slide 13 text

The Manchester ”Baby” (Working replica)

Slide 14

Slide 14 text

21 june 1948 The worlds first computer program runs on the worlds first Von Neuman architecture computer. 32 bits 32 words 7 instructions

Slide 15

Slide 15 text

First computers Abstract Environment Subroutines ↓ Monitors ↓ Languages ↓ Kernels ↓ Multitasking ↓ Middleware Slower

Slide 16

Slide 16 text

First computers Faster Physics Transistors ↓ Chips ↓ SSI ↓ MSI ↓ LSI ↓ VLSI Abstract Environment Subroutines ↓ Monitors ↓ Languages ↓ Kernels ↓ Multitasking ↓ Middleware Slower Faster

Slide 17

Slide 17 text

First computers Faster Physics Transistors ↓ Chips ↓ SSI ↓ MSI ↓ LSI ↓ VLSI Smarter Hardware Index reg. ↓ Microcode ↓ DMA ↓ Stacks ↓ Caches ↓ Msg.Pass Abstract Environment Subroutines ↓ Monitors ↓ Languages ↓ Kernels ↓ Multitasking ↓ Middleware Faster Complex Slower Faster

Slide 18

Slide 18 text

Slide 19

Slide 19 text

First computers Faster Physics Transistors ↓ Chips ↓ SSI ↓ MSI ↓ LSI ↓ VLSI Smarter Hardware Index reg. ↓ Microcode ↓ DMA ↓ Stacks ↓ Caches ↓ Msg.Pass ↓ Parallel Abstract Environment Subroutines ↓ Monitors ↓ Languages ↓ Kernels ↓ Multitasking ↓ Middleware Faster Complex Slower Faster

Slide 20

Slide 20 text

CPU STORAGE I/O 4 8 16 32 64 bits → → → → CRT Hg Core SRAM DRAM → → → → Paper Magtape Disk Flash → → → 100kHz -> 4MHz -> 33 MHz -> 4 GHz

Slide 21

Slide 21 text

CPU STORAGE I/O 4 8 16 32 64 bits → → → → CRT Hg Core SRAM DRAM → → → → Paper Magtape Disk Flash → → → 100kHz -> 4MHz -> 33 MHz -> 4 GHz CPU CPU CPU

Slide 22

Slide 22 text

L3 L2 Int FP ? L1-I L1-D Int FP ? VM L3 L2 L1-I L1-D Int FP ? VM RAM L3 L2 L1-I L1-D Int FP ? VM L3 L2 L1-I L1-D Int FP ? VM

Slide 23

Slide 23 text

Execution Units CPUs, HTT, MMX etc. – Private Caches High speed, isotropic – Shared Caches Layers, gradually slower, anisotropic – Persistent Objects ”Disk” Very high latency – Computer Science to the rescue: ”Cache Oblivious Algorithms”

Slide 24

Slide 24 text

From: The case for reconfigurable I/O channels Unpublished workshop paper for RESoLVE12 - March 3, 2012 - London, UK Steven Smith, Anil Madhavapeddy, Christopher Smowton, Malte Schwarzkopf, Richard Mortier, Robert M. Watson, Steven Hand

Slide 25

Slide 25 text

Unlimited* Virtually FREE memory Prototype in ATLAS: 1962 General Release: 1980 * Max 4GB per process 16EB (16M GB)

Slide 26

Slide 26 text

Execution Units CPUs, HTT, MMX etc. – Private Caches High speed, isotropic – Shared Caches Layers, gradually slower, anisotropic – Peristent Objects Very high latency – VM mapping Screw things up, slow things down –

Slide 27

Slide 27 text

I know what you're thinking, proc: 'Did he send SIGTERM or SIGKILL ?' Well, to tell you the truth, in all this excitement I kind of lost track myself... But being as this is SIGKILL, the most powerful signal in UNIX, and would blow your address space clean off, you've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, proc? Go Ahead, Fight my Kernel...

Slide 28

Slide 28 text

Performance Programming ● Make sure you know what the job really is ● Exploit the envelope of your requirements ● Avoid unecessary work ● Use few cheap operations ● Use even fewer expensive operations ● Don't fight the kernel ● Use the smart kernel-features we gave you

Slide 29

Slide 29 text

Performance price List ● char *p += 5; ● strlen(p); ● memcpy(p, q, l); ● Locking ● System Call ● Context Switch ● Disk Access ● Filesystem 10-1s 10-9s CPU Memory Protection Mechanical

Slide 30

Slide 30 text

Performance price List ● char *p += 5; ● strlen(p); ● memcpy(p, q, l); ● Locking ● System Call ● Context Switch ● Disk Access ● Filesystem 1 week 1 sec CPU Memory Protection Mechanical

Slide 31

Slide 31 text

Logging to a ”classic” file: FILE *flog; flog = fopen(”/var/log/mylog”, ”a”); [...] fprintf(flog, ”%s Something went wrong with %s\n”, timestamp(), repr(object)); fflush(flog); fsync(fileno(flog)); Filesystem operation Disk I/O 1 * 10msec + 1,000,000 * 1msec = 16 minutes

Slide 32

Slide 32 text

fd = open(...); log_p = mmap(..., size); log_e = logp + size; [...] log_p[1] = LOG_ERROR; log_p[2] = sprintf(log_p + 3, ”%s Something went wrong with %s”, timestamp(), repr(obj)); log_p[3 + log_p[2]] = LOG_END; log_p[0] = LOG_ENTRY; log_p += 3 + log_p[2]; Logging to a VM mapped file: Filesystem operations Memory operations 2 * 10msec + 1,000,000 * 1µsec = 1.02 second

Slide 33

Slide 33 text

Memory Strategy How long is a HTTP header ? – 80 char ? 256 char ? 2048 char ? Classic strategy: – Allocate small, extend with realloc(3) Realloc(3) increasing space is expensive – Needs memory copy 99% of the times.

Slide 34

Slide 34 text

Memory Strategy Allocate enough memory for 99.9% of the cases Unused & untouched memory is free – ”address space” ≠ ”memory” Long lived data ? - Consider trim back with realloc(3) - Consider access patterns

Slide 35

Slide 35 text

Memory Strategy It's just a linked list … Body Head Typ: 8k Typ: > 128k

Slide 36

Slide 36 text

Memory Strategy Beware the sideways reference… Body Head Typ: 8k Typ: > 128k

Slide 37

Slide 37 text

Memory Strategy Segregate non-contextual references Body Head Typ: 8k Typ: > 128k 64b

Slide 38

Slide 38 text

Memory Strategy And fit a lot of them, in a single VM page Body Head Typ: 8k Typ: > 128k 64b

Slide 39

Slide 39 text

Thread Strategy Keep threads away from each other Classical blunder: Using malloc(3) – malloc(3) manipulates a global state Locking → Give each thread a local ”workspace” – Reset when request done, ready for next request

Slide 40

Slide 40 text

Thread Strategy Pooling to avoid thundering herd 1 socket + 10000 threads = Lock contention 1 socket + 1 listener + 9999 threads = bottleneck 1 socket + 4 * (1 listener + 2499 threads) = zoom! → For '4' read ”Number of NUMA domains” ?

Slide 41

Slide 41 text

Thread scheduling You have N threads waiting for work – Which one do you pick ?

Slide 42

Slide 42 text

Thread scheduling You have N threads waiting for work – Which one do you pick ? FIFO = Fairness, all threads get the same load

Slide 43

Slide 43 text

How stupid is FIFO ? You get the thread which … – Has been doing nothing for the longest time – Has nothing in L1 cache – Has nothing in L2 cache – Has nothing in L3 cache – May not even be in RAM at all = The guaranteed slowest thread you can pick

Slide 44

Slide 44 text

LIFO for the win! Schedule everything LIFO order – Threads, memory, buffers, sockets … – Maximizes chance something is in some cache Even better: LIFO on this NUMA domain

Slide 45

Slide 45 text

MPP lessons learned Architecture is more important than algorithms → Algorithm: Do something faster → Architecture: Do fewer slow things Virtual Memory is expensive to ignore → Most O(foo) estimates are invalid with VM. ” → Just add more RAM” does not help on MPP

Slide 46

Slide 46 text

Squid 12 servers Varnish 3 servers

Slide 47

Slide 47 text

var·nish (värʹnĭsh) n. 1. a. A paint containing [...] tr.v. var·nished, var·nish·ing, var·nish·es 1. To cover with varnish. 2. To give a smooth and glossy finish to. 3. To give a deceptively attractive appearance to; gloss over.

Slide 48

Slide 48 text

Varnish Architecture Cluster Controller Manager Cacher ChildProcMgt CmdLine Initialization Params/Args CmdLine Storage Log/Stats Accept/herder Worker threads Grim Reaper C-compiler Shared object CmdLine Web-interface CMS-interface SMS-interface CLI-interface One binary program Shared Memory logwriter stats ad-hoc Backend Hashing VCL compiler Watchdog

Slide 49

Slide 49 text

ESO/ELT/WFRTC

Slide 50

Slide 50 text

ESO/ELT/WFRTC

Slide 51

Slide 51 text

10.008 floats ”Registration” 10.008 x 10.008 sparse: <120.096 Dynamic ~1Hz 10.008 floats ”Projection” 10.008 x 6.350 fully populated static 6.350 floats 10th order filter 6.350 floats Registration 240.000 FLOP Projection: 127.000.000 FLOP Filtering: 273.000 FLOP Total: 127.513.000 FLOP

Slide 52

Slide 52 text

Slide 53

Slide 53 text

What can a movie-theater do over a TV ?

Slide 54

Slide 54 text

What can a movie-theater do over a TV ? Today movie theaters are big TV's.

Slide 55

Slide 55 text

What can a movie-theater do over a TV ? Today movie theaters are big TV's. But imagine you gave them a compute cluster...

Slide 56

Slide 56 text

What can a movie-theater do over a TV ? Today movie theatres are big TV's. But imagine you gave them a compute cluster... Randomized battle-scenes ? Audience Cameos ? CGI approaching theaters ”unique each evening” ?