@leinweber
“but I didn’t change anything”
no deploys!
no database migrations!
no scaling!
Slide 15
Slide 15 text
@leinweber
“but I didn’t change anything”
https://upload.wikimedia.org/wikipedia/commons/0/09/Redherring.gif
Slide 16
Slide 16 text
@leinweber
“but I didn’t change anything”
more traffic?
change in access patterns?
one big user logged in?
Slide 17
Slide 17 text
@leinweber
run out of a resource
Slide 18
Slide 18 text
@leinweber
snowball
Slide 19
Slide 19 text
@leinweber
example
manageable user 1s query => 2x expensive
frequent, small queries 3ms => 12ms
Slide 20
Slide 20 text
@leinweber
assumptions
hardware
maintenance
app
Slide 21
Slide 21 text
@leinweber
assumptions
postgres should not crash
…with overcommit off and no containers
large extensions increase chance
Slide 22
Slide 22 text
@leinweber
if not postgres, then what
Slide 23
Slide 23 text
@leinweber
system resources
cpu
memory
disk
parallelism / backends
locks
Slide 24
Slide 24 text
@leinweber
cpu mem disk parallelism
cpu mem disk parallelism
Slide 25
Slide 25 text
@leinweber
cpu mem disk parallelism
credentials wrong
networking broken
locking issue, check pg_locks
idle in transaction
Slide 26
Slide 26 text
@leinweber
cpu mem disk parallelism
application submitting backlogged workload
connection leak
pool sizes set too large
pg_lock issue + application backlog
Slide 27
Slide 27 text
@leinweber
cpu mem disk parallelism
workload skew causing thrashing
unusual sequential scan workload
failover or restart => no cache
pg_prewarm
Slide 28
Slide 28 text
@leinweber
cpu mem disk parallelism
same as just disk,
but also the application is piling on
Slide 29
Slide 29 text
@leinweber
cpu mem disk parallelism
large GROUP BYs
high disk latency due to unusual page
dispersion pattern in the workload
Slide 30
Slide 30 text
@leinweber
cpu mem disk parallelism
workload has high mem (GROUP BY)
+ app adding backlog
lock contention slowing mem release
Slide 31
Slide 31 text
@leinweber
cpu mem disk parallelism
large GROUP BYs + paging in unusual data
Slide 32
Slide 32 text
@leinweber
cpu mem disk parallelism
Look for what is causing disk access
Slide 33
Slide 33 text
@leinweber
cpu mem disk parallelism
small, in-memory workload
lots of seq scans on small table
index scan w/ filter dropping lots
Slide 34
Slide 34 text
@leinweber
cpu mem disk parallelism
app backlog
+ too much processing on small data
simply a lot of work
Slide 35
Slide 35 text
@leinweber
cpu mem disk parallelism
large seq scans
Slide 36
Slide 36 text
@leinweber
cpu mem disk parallelism
loading cold data + application backlog
Slide 37
Slide 37 text
@leinweber
cpu mem disk parallelism
small # of backends doing a lot more work
Slide 38
Slide 38 text
@leinweber
cpu mem disk parallelism
entity, workload, entity*workload
soft deletes and non-conditional indexes
Slide 39
Slide 39 text
@leinweber
cpu mem disk parallelism
reporting query
Slide 40
Slide 40 text
@leinweber
cpu mem disk parallelism
app backlog, but with CPU/mem problems
Slide 41
Slide 41 text
@leinweber
tools of the trade
Slide 42
Slide 42 text
@leinweber
tools of the trade
C symbols
Slide 43
Slide 43 text
@leinweber
tools of the trade: perf
perf record -p && perf report
Slide 44
Slide 44 text
@leinweber
tools of the trade: perf
perf top
Slide 45
Slide 45 text
@leinweber
tools of the trade: perf
www.brendangregg.com/perf.html
Slide 46
Slide 46 text
@leinweber
tools of the trade: gdb
gdb -batch -ex 'bt' -p
Slide 47
Slide 47 text
@leinweber
Slide 48
Slide 48 text
@leinweber
Slide 49
Slide 49 text
@leinweber
tools of the trade: iostat
iostat -xm 10
Slide 50
Slide 50 text
@leinweber
tools of the trade: iotop
Slide 51
Slide 51 text
@leinweber
tools of the trade: htop
Slide 52
Slide 52 text
@leinweber
Tools of the trade: bwm-ng
Slide 53
Slide 53 text
@leinweber
tools of the trade: backends
pgrep -lf postgres + grep + wc
select * from pg_stat_activity
Slide 54
Slide 54 text
@leinweber
tools of the trade: pg_s_s
select * from pg_stat_statements
Slide 55
Slide 55 text
@leinweber
tools of the trade: summary
cpu mem disk parallelism network
perf x
gdb x
iostat x
iotop x
htop x x
bwm x
pgrep x
Slide 56
Slide 56 text
@leinweber
what to do
Slide 57
Slide 57 text
@leinweber
what to do
configuration change
Slide 58
Slide 58 text
@leinweber
what to do
db change
Slide 59
Slide 59 text
@leinweber
what to do
code change
Slide 60
Slide 60 text
@leinweber
flirting with disaster
Velocity NY 2013: Richard Cook
"Resilience In Complex Adaptive Systems”
Jens Rasmussen:
Risk management in a dynamic society: a
modeling problem
Slide 61
Slide 61 text
@leinweber
flirting with disaster
economic
boundary
Slide 62
Slide 62 text
@leinweber
flirting with disaster
economic
boundary
workload
boundary
Slide 63
Slide 63 text
@leinweber
flirting with disaster
economic
boundary
workload
boundary
performance
boundary
@leinweber
flirting with disaster
Velocity NY 2013: Richard Cook
"Resilience In Complex Adaptive Systems”
Jens Rasmussen:
Risk management in a dynamic society: a
modeling problem
Slide 69
Slide 69 text
@leinweber
thank you
Will Leinweber
@leinweber
citusdata.com