Slide 1

Slide 1 text

When it all Goes Wrong

Slide 2

Slide 2 text

@leinweber Will Leinweber @leinweber Citus Data (Microsoft) bitfission.com
 (warning autoplays midi)

Slide 3

Slide 3 text

@leinweber coming from citus cloud heroku postgres

Slide 4

Slide 4 text

@leinweber special thanks citus cloud
 — dan farina (@danfarina) heroku postgres
 — maciek sakrejda (@uhoh_itsmaciek)

Slide 5

Slide 5 text

@leinweber same sorts of problems from pages & alerts from support tickets

Slide 6

Slide 6 text

@leinweber this talk more app dev who uses postgres
 rather than dba

Slide 7

Slide 7 text

@leinweber the problem with Postgres it’s pretty good you don’t get experience with how it breaks

Slide 8

Slide 8 text

@leinweber what to do for a problem

Slide 9

Slide 9 text

@leinweber what to do for a problem

Slide 10

Slide 10 text

@leinweber complicated system network hardware o/s postgres

Slide 11

Slide 11 text

@leinweber using the database (too much) 95% application 4% auto vacuum 1% everything else

Slide 12

Slide 12 text

@leinweber hard to convince all the graphs saying DB is slow and nothing has changed …must be the database!

Slide 13

Slide 13 text

@leinweber https://upload.wikimedia.org/wikipedia/commons/9/98/Survivorship-bias.png

Slide 14

Slide 14 text

@leinweber “but I didn’t change anything” no deploys! no database migrations! no scaling!

Slide 15

Slide 15 text

@leinweber “but I didn’t change anything” https://upload.wikimedia.org/wikipedia/commons/0/09/Redherring.gif

Slide 16

Slide 16 text

@leinweber “but I didn’t change anything” more traffic? change in access patterns? one big user logged in?

Slide 17

Slide 17 text

@leinweber run out of a resource

Slide 18

Slide 18 text

@leinweber snowball

Slide 19

Slide 19 text

@leinweber example manageable user 1s query => 2x expensive frequent, small queries 3ms => 12ms

Slide 20

Slide 20 text

@leinweber assumptions hardware maintenance app

Slide 21

Slide 21 text

@leinweber assumptions postgres should not crash …with overcommit off and no containers large extensions increase chance

Slide 22

Slide 22 text

@leinweber if not postgres, then what

Slide 23

Slide 23 text

@leinweber system resources cpu memory disk parallelism / backends locks

Slide 24

Slide 24 text

@leinweber cpu mem disk parallelism cpu mem disk parallelism

Slide 25

Slide 25 text

@leinweber cpu mem disk parallelism credentials wrong networking broken locking issue, check pg_locks idle in transaction

Slide 26

Slide 26 text

@leinweber cpu mem disk parallelism application submitting backlogged workload connection leak pool sizes set too large pg_lock issue + application backlog

Slide 27

Slide 27 text

@leinweber cpu mem disk parallelism workload skew causing thrashing unusual sequential scan workload failover or restart => no cache pg_prewarm

Slide 28

Slide 28 text

@leinweber cpu mem disk parallelism same as just disk, but also the application is piling on

Slide 29

Slide 29 text

@leinweber cpu mem disk parallelism large GROUP BYs high disk latency due to unusual page dispersion pattern in the workload

Slide 30

Slide 30 text

@leinweber cpu mem disk parallelism workload has high mem (GROUP BY)
 + app adding backlog lock contention slowing mem release

Slide 31

Slide 31 text

@leinweber cpu mem disk parallelism large GROUP BYs + paging in unusual data

Slide 32

Slide 32 text

@leinweber cpu mem disk parallelism Look for what is causing disk access

Slide 33

Slide 33 text

@leinweber cpu mem disk parallelism small, in-memory workload lots of seq scans on small table index scan w/ filter dropping lots

Slide 34

Slide 34 text

@leinweber cpu mem disk parallelism app backlog 
 + too much processing on small data simply a lot of work

Slide 35

Slide 35 text

@leinweber cpu mem disk parallelism large seq scans

Slide 36

Slide 36 text

@leinweber cpu mem disk parallelism loading cold data + application backlog

Slide 37

Slide 37 text

@leinweber cpu mem disk parallelism small # of backends doing a lot more work

Slide 38

Slide 38 text

@leinweber cpu mem disk parallelism entity, workload, entity*workload soft deletes and non-conditional indexes

Slide 39

Slide 39 text

@leinweber cpu mem disk parallelism reporting query

Slide 40

Slide 40 text

@leinweber cpu mem disk parallelism app backlog, but with CPU/mem problems

Slide 41

Slide 41 text

@leinweber tools of the trade

Slide 42

Slide 42 text

@leinweber tools of the trade C symbols

Slide 43

Slide 43 text

@leinweber tools of the trade: perf perf record -p && perf report

Slide 44

Slide 44 text

@leinweber tools of the trade: perf perf top

Slide 45

Slide 45 text

@leinweber tools of the trade: perf www.brendangregg.com/perf.html

Slide 46

Slide 46 text

@leinweber tools of the trade: gdb gdb -batch -ex 'bt' -p

Slide 47

Slide 47 text

@leinweber

Slide 48

Slide 48 text

@leinweber

Slide 49

Slide 49 text

@leinweber tools of the trade: iostat iostat -xm 10

Slide 50

Slide 50 text

@leinweber tools of the trade: iotop

Slide 51

Slide 51 text

@leinweber tools of the trade: htop

Slide 52

Slide 52 text

@leinweber Tools of the trade: bwm-ng

Slide 53

Slide 53 text

@leinweber tools of the trade: backends pgrep -lf postgres + grep + wc select * from pg_stat_activity

Slide 54

Slide 54 text

@leinweber tools of the trade: pg_s_s select * from pg_stat_statements

Slide 55

Slide 55 text

@leinweber tools of the trade: summary cpu mem disk parallelism network perf x gdb x iostat x iotop x htop x x bwm x pgrep x

Slide 56

Slide 56 text

@leinweber what to do

Slide 57

Slide 57 text

@leinweber what to do configuration change

Slide 58

Slide 58 text

@leinweber what to do db change

Slide 59

Slide 59 text

@leinweber what to do code change

Slide 60

Slide 60 text

@leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience In Complex Adaptive Systems” Jens Rasmussen:
 Risk management in a dynamic society: a modeling problem

Slide 61

Slide 61 text

@leinweber flirting with disaster economic boundary

Slide 62

Slide 62 text

@leinweber flirting with disaster economic boundary workload boundary

Slide 63

Slide 63 text

@leinweber flirting with disaster economic boundary workload boundary performance boundary

Slide 64

Slide 64 text

@leinweber flirting with disaster economic boundary workload boundary performance boundary error margin

Slide 65

Slide 65 text

@leinweber flirting with disaster economic boundary workload boundary performance boundary

Slide 66

Slide 66 text

@leinweber flirting with disaster economic boundary workload boundary performance boundary error margin

Slide 67

Slide 67 text

@leinweber flirting with disaster economic boundary workload boundary performance boundary error margin

Slide 68

Slide 68 text

@leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience In Complex Adaptive Systems” Jens Rasmussen: 
 Risk management in a dynamic society: a modeling problem

Slide 69

Slide 69 text

@leinweber thank you Will Leinweber @leinweber citusdata.com