Slide 1

Slide 1 text

@phinze DevOpsDays Chicago 2017 Good Getting at System Failure Analysis

Slide 2

Slide 2 text

@phinze DevOpsDays Chicago 2017 @phinze Paul Hinze new phone who dis? he/him/his

Slide 3

Slide 3 text

@phinze DevOpsDays Chicago 2017 Feeling Thinking Acting Collaborating

Slide 4

Slide 4 text

@phinze DevOpsDays Chicago 2017 Systems Fail So...

Slide 5

Slide 5 text

@phinze DevOpsDays Chicago 2017 Systems Fail “a set of connected things or parts forming a complex whole”

Slide 6

Slide 6 text

@phinze DevOpsDays Chicago 2017 Systems Fail “to be unsuccessful in achieving one’s goal”

Slide 7

Slide 7 text

@phinze DevOpsDays Chicago 2017 Now What?

Slide 8

Slide 8 text

@phinze DevOpsDays Chicago 2017 Attitude Matters A Whole Lot

Slide 9

Slide 9 text

@phinze DevOpsDays Chicago 2017 Feeling Reacting to Failure

Slide 10

Slide 10 text

@phinze DevOpsDays Chicago 2017 Failure Feelings Framework towards a

Slide 11

Slide 11 text

@phinze DevOpsDays Chicago 2017

Slide 12

Slide 12 text

@phinze DevOpsDays Chicago 2017

Slide 13

Slide 13 text

@phinze DevOpsDays Chicago 2017

Slide 14

Slide 14 text

@phinze DevOpsDays Chicago 2017

Slide 15

Slide 15 text

@phinze DevOpsDays Chicago 2017

Slide 16

Slide 16 text

@phinze DevOpsDays Chicago 2017

Slide 17

Slide 17 text

@phinze DevOpsDays Chicago 2017

Slide 18

Slide 18 text

@phinze DevOpsDays Chicago 2017

Slide 19

Slide 19 text

@phinze DevOpsDays Chicago 2017 Systems Fail So...

Slide 20

Slide 20 text

@phinze DevOpsDays Chicago 2017 Why do you care?

Slide 21

Slide 21 text

@phinze DevOpsDays Chicago 2017

Slide 22

Slide 22 text

@phinze DevOpsDays Chicago 2017

Slide 23

Slide 23 text

@phinze DevOpsDays Chicago 2017 giving a apathy motivation

Slide 24

Slide 24 text

@phinze DevOpsDays Chicago 2017 Failure Hurts

Slide 25

Slide 25 text

@phinze DevOpsDays Chicago 2017 Failure Hurts

Slide 26

Slide 26 text

@phinze DevOpsDays Chicago 2017 Change Your Perspective

Slide 27

Slide 27 text

@phinze DevOpsDays Chicago 2017 (Fair Warning)

Slide 28

Slide 28 text

@phinze DevOpsDays Chicago 2017 Learn to Love Chaos

Slide 29

Slide 29 text

@phinze DevOpsDays Chicago 2017 Nassim Nicholas Taleb

Slide 30

Slide 30 text

@phinze DevOpsDays Chicago 2017 “Health” as a Resource

Slide 31

Slide 31 text

@phinze DevOpsDays Chicago 2017 Failure as an Opportunity 1

Slide 32

Slide 32 text

@phinze DevOpsDays Chicago 2017 depression acceptance

Slide 33

Slide 33 text

@phinze DevOpsDays Chicago 2017 Why do you care?

Slide 34

Slide 34 text

@phinze DevOpsDays Chicago 2017 Am I the only one around here who gives a about prod?

Slide 35

Slide 35 text

@phinze DevOpsDays Chicago 2017 Collective Ownership

Slide 36

Slide 36 text

@phinze DevOpsDays Chicago 2017

Slide 37

Slide 37 text

@phinze DevOpsDays Chicago 2017 777777 We're all on a team with our past and future selves

Slide 38

Slide 38 text

@phinze DevOpsDays Chicago 2017 frustration empowerment

Slide 39

Slide 39 text

@phinze DevOpsDays Chicago 2017 When systems fail, we can learn. Then make the system smarter.

Slide 40

Slide 40 text

@phinze DevOpsDays Chicago 2017 Thinking Understanding Failure

Slide 41

Slide 41 text

@phinze DevOpsDays Chicago 2017 1. Understand The System 2. Understand The Failure

Slide 42

Slide 42 text

@phinze DevOpsDays Chicago 2017 Systems Fail So...

Slide 43

Slide 43 text

@phinze DevOpsDays Chicago 2017 Do you understand the system ?

Slide 44

Slide 44 text

@phinze DevOpsDays Chicago 2017 Can you ? understand the system

Slide 45

Slide 45 text

@phinze DevOpsDays Chicago 2017 do you need to ? How much understand the system

Slide 46

Slide 46 text

@phinze DevOpsDays Chicago 2017 do you need to How much understand the system ? in order to understand the failure

Slide 47

Slide 47 text

@phinze DevOpsDays Chicago 2017 What does even mean? “understand the system”

Slide 48

Slide 48 text

@phinze DevOpsDays Chicago 2017 Brains are Limited. Systems are Complex. We need Tools to help us.

Slide 49

Slide 49 text

@phinze DevOpsDays Chicago 2017 Working Set

Slide 50

Slide 50 text

@phinze DevOpsDays Chicago 2017 model (n.) “a simplified description [...] of a system or process, to assist calculations and predictions”

Slide 51

Slide 51 text

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Data Flow Diagram

Slide 52

Slide 52 text

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Sequence Diagram

Slide 53

Slide 53 text

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Domain Model

Slide 54

Slide 54 text

@phinze DevOpsDays Chicago 2017 Build Some! No models?

Slide 55

Slide 55 text

@phinze DevOpsDays Chicago 2017 mermaid

Slide 56

Slide 56 text

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 graph TD internet(("internet")) atlas-frontend("atlas-frontend (rails, passenger)") atlas-worker("atlas-worker (rails, sidekiq)") archivist("archivist (go)
[[binstore (go)
storagelocker (go)
logstream (go)]]") packer-build-manager("packer-build-manager (go)") packer-build-worker("packer-build-worker (go)") terraform-build-manager("terraform-build-manager (go)") terraform-build-worker("terraform-build-worker (go)") slugs("slugs (go)
[[slug-merge (go)
slug-ingress (go)
slug-extract (go)]]") terraform-state-parser("terraform-state-parser (go)") internet-->|http|atlas-frontend internet-->|http|archivist archivist-->|s3-get-put|artifacts-bucket archivist-->|redis-get-put|archivist-cache atlas-frontend-->|redis-rpush|rails-jobs-queue subgraph s3 artifacts-bucket end subgraph redis rails-jobs-queue>"Q: rails-jobs"] archivist-cache("KV: archivist/*") end rails-jobs-queue-->|redis-blpop|atlas-worker atlas-worker-->|activerecord|atlas-db atlas-frontend-->|activerecord|atlas-db subgraph postgres atlas-db end atlas-worker-->|amqp-publish|packer-jobs-queue atlas-worker-->|amqp-publish|terraform-runs-queue atlas-worker-->|http|slugs atlas-worker-->|http|terraform-state-parser atlas-worker-->|http|archivist slugs-->|http|archivist subgraph rabbitmq packer-jobs-queue>"Q: packer-jobs"] packer-builds-queue>"Q: packer-builds"] terraform-runs-queue>"Q: terraform-runs"] terraform-workers-queue>"Q: terraform-workers"] end packer-jobs-queue-->|amqp-consume|packer-build-manager packer-build-manager-->|amqp-publish|packer-builds-queue packer-builds-queue-->|amqp-consume|packer-build-worker terraform-runs-queue-->|amqp-consume|terraform-build-manager terraform-build-manager-->|amqp-publish|terraform-workers-queue terraform-workers-queue-->|amqp-consume|terraform-build-worker

Slide 57

Slide 57 text

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Typora

Slide 58

Slide 58 text

@phinze DevOpsDays Chicago 2017 do you need to How much understand the system ? in order to understand the failure

Slide 59

Slide 59 text

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Terraform Runs are slow.

Slide 60

Slide 60 text

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Packer stuff Web stuff Blob stuff R S T

Slide 61

Slide 61 text

@phinze DevOpsDays Chicago 2017 ⚓ Use models to anchor your working set of system understanding to the failure at hand.

Slide 62

Slide 62 text

@phinze DevOpsDays Chicago 2017 Build Some! No models? Start with your last failure!

Slide 63

Slide 63 text

@phinze DevOpsDays Chicago 2017 Acting Responding to Failure

Slide 64

Slide 64 text

@phinze DevOpsDays Chicago 2017 Given $input, I expected $x. Instead, I observed $y.

Slide 65

Slide 65 text

@phinze DevOpsDays Chicago 2017 A mystery!

Slide 66

Slide 66 text

@phinze DevOpsDays Chicago 2017 A process in of narrowing

Slide 67

Slide 67 text

@phinze DevOpsDays Chicago 2017 Environment Logs Code

Slide 68

Slide 68 text

@phinze DevOpsDays Chicago 2017 ✅ SSH ✅ Shell loops ✅ Scratch files ✅ Shell history ✅ Unix text processing tools ✅ Google No Fancy Tools Necessary (Fancy tools are great if you have 'em!)

Slide 69

Slide 69 text

@phinze DevOpsDays Chicago 2017 Scope Failure the Continuously

Slide 70

Slide 70 text

@phinze DevOpsDays Chicago 2017 # No service discovery? No prob! # Copy all IPs from web console into /tmp/allips cat /tmp/allips | while read ip; do echo $ip if ssh $ip "px aux | grep appnam[e]"; then echo $ip >> /tmp/appips fi done Where's the app?

Slide 71

Slide 71 text

@phinze DevOpsDays Chicago 2017 cat /tmp/appips | while read ip; do echo $ip ssh $ip "grep 'errmsg' /var/log/app.log" done Happening on all nodes or just one?

Slide 72

Slide 72 text

@phinze DevOpsDays Chicago 2017 cat /tmp/appips | while read ip; do ssh $ip "grep 'errmsg' /var/log/app.log" >> /tmp/errs.log done # remote_ip is column 11 cat /tmp/errs.log | awk '{print $11}' | sort | uniq -c Happening to all users or just one?

Slide 73

Slide 73 text

@phinze DevOpsDays Chicago 2017 Read Logs the

Slide 74

Slide 74 text

@phinze DevOpsDays Chicago 2017 cat /tmp/appips | while read ip; do ssh $ip "grep '2017-09-12 0[34]' /var/log/app.log" \ >> /tmp/app2h.log done Snag the last ~2h of logs locally

Slide 75

Slide 75 text

@phinze DevOpsDays Chicago 2017 cat /tmp/app2h.log \ | awk '{ print $8 }' \ # say 8th column is http status | cut -c1 \ # first char gives us "class" of code | sort \ | uniq -c HTTP responses last 2h?

Slide 76

Slide 76 text

@phinze DevOpsDays Chicago 2017 cat /tmp/app2h.log \ | awk '{ print $2, $8 }' \ # 2nd column is time | cut -c1-12 \ # "HH:MM:SS C" - 12 chars | cut -d: -f1,2,4 \ # "HH MM C" | sort \ | uniq -c HTTP responses last 2h by min?

Slide 77

Slide 77 text

@phinze DevOpsDays Chicago 2017 Check Vitals the

Slide 78

Slide 78 text

@phinze DevOpsDays Chicago 2017 - Memory (free -m) - Disk Space (df -h) - Disk I/O (iostat) - Network (iftop) - CPU (htop) - Entropy (cat /proc/sys/kernel/random/entropy_avail) Exhaustible Resources

Slide 79

Slide 79 text

@phinze DevOpsDays Chicago 2017 cat /tmp/appips | for ip in $(cat -); do echo $ip; ssh $ip "df -h" done Any machines out of disk?

Slide 80

Slide 80 text

@phinze DevOpsDays Chicago 2017 Read Code the

Slide 81

Slide 81 text

@phinze DevOpsDays Chicago 2017 Learn It! 7 Don't know the language?

Slide 82

Slide 82 text

@phinze DevOpsDays Chicago 2017 - Syntax highlighting - Search for string (:Ag) - Go to definition (gd) - Find callers (:GoCallers) - Jump back/forth in history (C-o/C-i) - Generate shareable link to context (:GitBrowse) - Walk through VCS history (:GitBlame, or :GitBrowse + GH) Minimum Viable Code Reading

Slide 83

Slide 83 text

@phinze DevOpsDays Chicago 2017 Use Models Stay Anchored to ⚓ ⚓

Slide 84

Slide 84 text

@phinze DevOpsDays Chicago 2017 Solve It! Leave no stone unturned

Slide 85

Slide 85 text

@phinze DevOpsDays Chicago 2017 Collaborating Learning from Failure

Slide 86

Slide 86 text

@phinze DevOpsDays Chicago 2017 Failure Analysis Help From Others requires 1 1

Slide 87

Slide 87 text

@phinze DevOpsDays Chicago 2017 Learning Contagious is \

Slide 88

Slide 88 text

@phinze DevOpsDays Chicago 2017 Ownership Contagious is

Slide 89

Slide 89 text

@phinze DevOpsDays Chicago 2017 Failure Stronger can make everyone

Slide 90

Slide 90 text

@phinze DevOpsDays Chicago 2017 Thank You!