Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Development was the easy part
Slide 2
Slide 2 text
André Arko @indirect
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
Development is very different
Slide 6
Slide 6 text
from Production
Slide 7
Slide 7 text
you rn →
Slide 8
Slide 8 text
you later →
Slide 9
Slide 9 text
Metrics
Slide 10
Slide 10 text
Metrics are important
Slide 11
Slide 11 text
Metrics tell you what is happening
Slide 12
Slide 12 text
Metrics convince you you understand
Slide 13
Slide 13 text
Averages convince you you understand
Slide 14
Slide 14 text
Averages are lie-candy for your brain
Slide 15
Slide 15 text
Averages 5 -5 -4 -3 -2 -1 0 1 2 3 4 0 0.1 0.2 0.3 0.4
Slide 16
Slide 16 text
Averages 5 -5 -4 -3 -2 -1 0 1 2 3 4 0 0.1 0.2 0.3 0.4
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
just heard “we have a great average” →
Slide 20
Slide 20 text
Averages mask problems
Slide 21
Slide 21 text
10 0 1 2 3 4 5 6 7 8 9 250 0 50 100 150 200
Slide 22
Slide 22 text
Graph the median
Slide 23
Slide 23 text
10 0 1 2 3 4 5 6 7 8 9 250 0 50 100 150 200
Slide 24
Slide 24 text
Graph 95th percentile
Slide 25
Slide 25 text
10 0 1 2 3 4 5 6 7 8 9 250 0 50 100 150 200
Slide 26
Slide 26 text
Graph 99th percentile
Slide 27
Slide 27 text
10 0 1 2 3 4 5 6 7 8 9 1000 0 250 500 750
Slide 28
Slide 28 text
Aggregate graphs another kind of average
Slide 29
Slide 29 text
No content
Slide 30
Slide 30 text
Breakout graphs see individual variations
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
Aggregate alerts more dead servers than alive servers
Slide 34
Slide 34 text
site’s up if any servers are up!
Slide 35
Slide 35 text
Breakout alerts first dead server not all the servers
Slide 36
Slide 36 text
Servers
Slide 37
Slide 37 text
Servers you have no idea what is going on
Slide 38
Slide 38 text
really.
Slide 39
Slide 39 text
Routing
Slide 40
Slide 40 text
Routing your app has this
Slide 41
Slide 41 text
Routing how does it work?
Slide 42
Slide 42 text
Development App You
Slide 43
Slide 43 text
Production People Router Server App App Router Server App App Router
Slide 44
Slide 44 text
Routing how slow is it?
Slide 45
Slide 45 text
Routing does it back up?
Slide 46
Slide 46 text
Request time
Slide 47
Slide 47 text
Request time not the time you measure
Slide 48
Slide 48 text
Request time wall-clock time from real clients
Slide 49
Slide 49 text
Request time make requests from around the world
Slide 50
Slide 50 text
Request time graph them
Slide 51
Slide 51 text
Request time graph them alert on them
Slide 52
Slide 52 text
Request time graph them alert on them thank me later
Slide 53
Slide 53 text
VM lag
Slide 54
Slide 54 text
VM lag do you have it?
Slide 55
Slide 55 text
VM lag do you check for it?
Slide 56
Slide 56 text
VM lag do you know how to check for it?
Slide 57
Slide 57 text
Runtime lag
Slide 58
Slide 58 text
Runtime lag how do you tell you lost consciousness?
Slide 59
Slide 59 text
Runtime lag do you have it?
Slide 60
Slide 60 text
Runtime lag do you have it? you have it.
Slide 61
Slide 61 text
Runtime lag do you have it? you have it. how bad is it?
Slide 62
Slide 62 text
Data stores
Slide 63
Slide 63 text
Data stores in production
Slide 64
Slide 64 text
Data stores in production are distributed
Slide 65
Slide 65 text
what does that mean?
Slide 66
Slide 66 text
your experience (so far) is wrong
Slide 67
Slide 67 text
Saving data
Slide 68
Slide 68 text
Saving data tries to save your data
Slide 69
Slide 69 text
Saving data might save your data
Slide 70
Slide 70 text
Replication
Slide 71
Slide 71 text
Replication is not data- saving magic
Slide 72
Slide 72 text
Replication tries to save your data…
Slide 73
Slide 73 text
Replication tries to save your data… repeatedly
Slide 74
Slide 74 text
Postgres
Slide 75
Slide 75 text
Postgres totally safe, right?
Slide 76
Slide 76 text
Postgres async replication
Slide 77
Slide 77 text
Postgres network failures lose “saved” data
Slide 78
Slide 78 text
Redis
Slide 79
Slide 79 text
Redis is single-threaded
Slide 80
Slide 80 text
Redis has no failover
Slide 81
Slide 81 text
Redis-sentinel elects a new leader
Slide 82
Slide 82 text
Redis-sentinel throws away non- winners’ writes
Slide 83
Slide 83 text
Mongo (gem < 1.8) returns before the first write
Slide 84
Slide 84 text
Mongo (gem < 1.8) your data is on zero disks so far
Slide 85
Slide 85 text
Mongo replication sets default to one write
Slide 86
Slide 86 text
Mongo demand N copies survive N-1 failures
Slide 87
Slide 87 text
trust no one
Slide 88
Slide 88 text
if you didn’t try it you are guessing
Slide 89
Slide 89 text
test it yourself
Slide 90
Slide 90 text
So, in the end what did we learn?
Slide 91
Slide 91 text
Production is fundamentally
Slide 92
Slide 92 text
Production is fundamentally systemically
Slide 93
Slide 93 text
Production is fundamentally systemically different
Slide 94
Slide 94 text
Failures will happen
Slide 95
Slide 95 text
Failures can be resisted
Slide 96
Slide 96 text
Failures should not result in one-off patches
Slide 97
Slide 97 text
Survival requires systematic trials & testing
Slide 98
Slide 98 text
Development is not like production