Slide 1

Slide 1 text

How Postmortems can close the loop on IT metrics @jasonhand 1

Slide 2

Slide 2 text

Jason Hand victorops devops evangelist @jasonhand @jasonhand 2

Slide 3

Slide 3 text

automate measure share learn @jasonhand 3

Slide 4

Slide 4 text

automate @jasonhand 4

Slide 5

Slide 5 text

measure @jasonhand 5

Slide 6

Slide 6 text

share @jasonhand 6

Slide 7

Slide 7 text

learn @jasonhand 7

Slide 8

Slide 8 text

increase feedback Shortening feedback loops leads to... @jasonhand 8

Slide 9

Slide 9 text

continuous... integration (of code) deployment (of software/product) and ... @jasonhand 9

Slide 10

Slide 10 text

improvement @jasonhand 10

Slide 11

Slide 11 text

OODA loop Observe Orient Decide Act by: John Boyd @jasonhand 11

Slide 12

Slide 12 text

why we measure @jasonhand 12

Slide 13

Slide 13 text

observe & orient @jasonhand 13

Slide 14

Slide 14 text

which leads to deciding & acting @jasonhand 14

Slide 15

Slide 15 text

What metrics should we be looking at? CPU, Memory, Network, and Disk metrics (duh) @jasonhand 15

Slide 16

Slide 16 text

but wait! there's more @jasonhand 16

Slide 17

Slide 17 text

look between the spaces @jasonhand 17

Slide 18

Slide 18 text

understand what we're building @jasonhand 18

Slide 19

Slide 19 text

between the spaces look for metrics to help: —weed out edge case scenarios —confirm assumptions @jasonhand 19

Slide 20

Slide 20 text

should we? @jasonhand 20

Slide 21

Slide 21 text

actually.. more like.. @jasonhand 21

Slide 22

Slide 22 text

collect all the things @jasonhand 22

Slide 23

Slide 23 text

"We don't know what the context is and we don't know what may be interesting to us in the future. But if something goes down and we don't have metrics for it, we have no perspective... and that's probably the worst-case scenario." - Jason Dixon (librato.com & monitorama) The Ship Show (podcast): Episode 56 @jasonhand 23

Slide 24

Slide 24 text

Fact: measuring and looking at data all the time isn't that helpful. But: when you need it to understand a problem, you'll want it then. @jasonhand 24

Slide 25

Slide 25 text

during an incident leverage metrics to observe & orient to help you decide & take action @jasonhand 25

Slide 26

Slide 26 text

all to help: - triage - investigate - identify what's happening. @jasonhand 26

Slide 27

Slide 27 text

why do we do postmortems? @jasonhand 27

Slide 28

Slide 28 text

Learn... which leads to decisions and action @jasonhand 28

Slide 29

Slide 29 text

get your story straight understand the story of what took place following an incident. @jasonhand 29

Slide 30

Slide 30 text

accountability & empathy By sharing an accurate account and attempting to understand and empathize around exactly what took place, teams can learn from that incident and improve their processes. @jasonhand 30

Slide 31

Slide 31 text

how do we do postmortems @jasonhand 31

Slide 32

Slide 32 text

capture everything and in one place @jasonhand 32

Slide 33

Slide 33 text

victorops timeline @jasonhand 33

Slide 34

Slide 34 text

chat slack hipchat @jasonhand 34

Slide 35

Slide 35 text

blameless remove blame and go after the facts @jasonhand 35

Slide 36

Slide 36 text

The idea is to learn @jasonhand 36

Slide 37

Slide 37 text

punish @jasonhand 37

Slide 38

Slide 38 text

remediation items "Learning from a postmortem is only as useful as what you put into practice afterwards and we realized that without any action items after the meeting, it was more or less just a Greek Senate debate" —Ben VanEvery (box.com) @jasonhand 38

Slide 39

Slide 39 text

assign ownership it's not about sharing feelings and theories without accomplishing anything @jasonhand 39

Slide 40

Slide 40 text

tie it with a bow Now you have a very accurate story @jasonhand 40

Slide 41

Slide 41 text

a story telling us ... @jasonhand 41

Slide 42

Slide 42 text

monitoring data @jasonhand 42

Slide 43

Slide 43 text

who was alerted @jasonhand 43

Slide 44

Slide 44 text

how quickly they responded @jasonhand 44

Slide 45

Slide 45 text

who was involved throughout the incident management lifecycle @jasonhand 45

Slide 46

Slide 46 text

conversations that were had @jasonhand 46

Slide 47

Slide 47 text

commands that were run (i.e. ChatOps) @jasonhand 47

Slide 48

Slide 48 text

context @jasonhand 48

Slide 49

Slide 49 text

Memorialized - monitoring data - alerts - acknowledgments - context . graphs . logs . runbooks . notes - actions - conversations - remediation @jasonhand 49

Slide 50

Slide 50 text

loop closed We've taken the important monitoring data and metrics .. start back over @jasonhand 50

Slide 51

Slide 51 text

Jason Hand @jasonhand @victorops @jasonhand 51

Slide 52

Slide 52 text

Chatops For Dummies jhand.co/ChatOps4Dummies @jasonhand 52