Getting Good at System Failure Analysis

@phinze DevOpsDays Chicago 2017 Good Getting at System Failure Analysis

@phinze DevOpsDays Chicago 2017 @phinze Paul Hinze new phone who
dis? he/him/his

@phinze DevOpsDays Chicago 2017 Feeling Thinking Acting Collaborating

@phinze DevOpsDays Chicago 2017 Systems Fail So...

@phinze DevOpsDays Chicago 2017 Systems Fail “a set of connected
things or parts forming a complex whole”

@phinze DevOpsDays Chicago 2017 Systems Fail “to be unsuccessful in
achieving one’s goal”

@phinze DevOpsDays Chicago 2017 Now What?

@phinze DevOpsDays Chicago 2017 Attitude Matters A Whole Lot

@phinze DevOpsDays Chicago 2017 Feeling Reacting to Failure

@phinze DevOpsDays Chicago 2017 Failure Feelings Framework towards a

@phinze DevOpsDays Chicago 2017

@phinze DevOpsDays Chicago 2017 Why do you care?

@phinze DevOpsDays Chicago 2017 giving a apathy motivation

@phinze DevOpsDays Chicago 2017 Failure Hurts

@phinze DevOpsDays Chicago 2017 Change Your Perspective

@phinze DevOpsDays Chicago 2017 (Fair Warning)

@phinze DevOpsDays Chicago 2017 Learn to Love Chaos

@phinze DevOpsDays Chicago 2017 Nassim Nicholas Taleb

@phinze DevOpsDays Chicago 2017 “Health” as a Resource

@phinze DevOpsDays Chicago 2017 Failure as an Opportunity 1

@phinze DevOpsDays Chicago 2017 depression acceptance

@phinze DevOpsDays Chicago 2017 Why do you care?

@phinze DevOpsDays Chicago 2017 Am I the only one around
here who gives a about prod?

@phinze DevOpsDays Chicago 2017 Collective Ownership

@phinze DevOpsDays Chicago 2017 777777 We're all on a team
with our past and future selves

@phinze DevOpsDays Chicago 2017 frustration empowerment

@phinze DevOpsDays Chicago 2017 When systems fail, we can learn.
Then make the system smarter.

@phinze DevOpsDays Chicago 2017 Thinking Understanding Failure

@phinze DevOpsDays Chicago 2017 1. Understand The System 2. Understand
The Failure

@phinze DevOpsDays Chicago 2017 Do you understand the system ?

@phinze DevOpsDays Chicago 2017 Can you ? understand the system

@phinze DevOpsDays Chicago 2017 do you need to ? How
much understand the system

@phinze DevOpsDays Chicago 2017 do you need to How much
understand the system ? in order to understand the failure

@phinze DevOpsDays Chicago 2017 What does even mean? “understand the
system”

@phinze DevOpsDays Chicago 2017 Brains are Limited. Systems are Complex.
We need Tools to help us.

@phinze DevOpsDays Chicago 2017 Working Set

@phinze DevOpsDays Chicago 2017 model (n.) “a simpliﬁed description [...]
of a system or process, to assist calculations and predictions”

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Data Flow
Diagram

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Sequence Diagram

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Domain Model

@phinze DevOpsDays Chicago 2017 Build Some! No models?

@phinze DevOpsDays Chicago 2017 mermaid

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 graph TD
internet(("internet")) atlas-frontend("atlas-frontend (rails, passenger)") atlas-worker("atlas-worker (rails, sidekiq)") archivist("archivist (go) [[binstore (go) storagelocker (go) logstream (go)]]") packer-build-manager("packer-build-manager (go)") packer-build-worker("packer-build-worker (go)") terraform-build-manager("terraform-build-manager (go)") terraform-build-worker("terraform-build-worker (go)") slugs("slugs (go) [[slug-merge (go) slug-ingress (go) slug-extract (go)]]") terraform-state-parser("terraform-state-parser (go)") internet-->|http|atlas-frontend internet-->|http|archivist archivist-->|s3-get-put|artifacts-bucket archivist-->|redis-get-put|archivist-cache atlas-frontend-->|redis-rpush|rails-jobs-queue subgraph s3 artifacts-bucket end subgraph redis rails-jobs-queue>"Q: rails-jobs"] archivist-cache("KV: archivist/*") end rails-jobs-queue-->|redis-blpop|atlas-worker atlas-worker-->|activerecord|atlas-db atlas-frontend-->|activerecord|atlas-db subgraph postgres atlas-db end atlas-worker-->|amqp-publish|packer-jobs-queue atlas-worker-->|amqp-publish|terraform-runs-queue atlas-worker-->|http|slugs atlas-worker-->|http|terraform-state-parser atlas-worker-->|http|archivist slugs-->|http|archivist subgraph rabbitmq packer-jobs-queue>"Q: packer-jobs"] packer-builds-queue>"Q: packer-builds"] terraform-runs-queue>"Q: terraform-runs"] terraform-workers-queue>"Q: terraform-workers"] end packer-jobs-queue-->|amqp-consume|packer-build-manager packer-build-manager-->|amqp-publish|packer-builds-queue packer-builds-queue-->|amqp-consume|packer-build-worker terraform-runs-queue-->|amqp-consume|terraform-build-manager terraform-build-manager-->|amqp-publish|terraform-workers-queue terraform-workers-queue-->|amqp-consume|terraform-build-worker

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Typora

@phinze DevOpsDays Chicago 2017 do you need to How much
understand the system ? in order to understand the failure

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Terraform Runs
are slow.

@phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 Packer stuff
Web stuff Blob stuff R S T

@phinze DevOpsDays Chicago 2017 ⚓ Use models to anchor your
working set of system understanding to the failure at hand.

@phinze DevOpsDays Chicago 2017 Build Some! No models? Start with
your last failure!

@phinze DevOpsDays Chicago 2017 Acting Responding to Failure

@phinze DevOpsDays Chicago 2017 Given $input, I expected $x. Instead,
I observed $y.

@phinze DevOpsDays Chicago 2017 A mystery!

@phinze DevOpsDays Chicago 2017 A process in of narrowing

@phinze DevOpsDays Chicago 2017 Environment Logs Code

@phinze DevOpsDays Chicago 2017 ✅ SSH ✅ Shell loops ✅
Scratch files ✅ Shell history ✅ Unix text processing tools ✅ Google No Fancy Tools Necessary (Fancy tools are great if you have 'em!)

@phinze DevOpsDays Chicago 2017 Scope Failure the Continuously

@phinze DevOpsDays Chicago 2017 # No service discovery? No prob!
# Copy all IPs from web console into /tmp/allips cat /tmp/allips | while read ip; do echo $ip if ssh $ip "px aux | grep appnam[e]"; then echo $ip >> /tmp/appips fi done Where's the app?

@phinze DevOpsDays Chicago 2017 cat /tmp/appips | while read ip;
do echo $ip ssh $ip "grep 'errmsg' /var/log/app.log" done Happening on all nodes or just one?

do ssh $ip "grep 'errmsg' /var/log/app.log" >> /tmp/errs.log done # remote_ip is column 11 cat /tmp/errs.log | awk '{print $11}' | sort | uniq -c Happening to all users or just one?

@phinze DevOpsDays Chicago 2017 Read Logs the

do ssh $ip "grep '2017-09-12 0[34]' /var/log/app.log" \ >> /tmp/app2h.log done Snag the last ~2h of logs locally

@phinze DevOpsDays Chicago 2017 cat /tmp/app2h.log \ | awk '{
print $8 }' \ # say 8th column is http status | cut -c1 \ # first char gives us "class" of code | sort \ | uniq -c HTTP responses last 2h?

@phinze DevOpsDays Chicago 2017 cat /tmp/app2h.log \ | awk '{
print $2, $8 }' \ # 2nd column is time | cut -c1-12 \ # "HH:MM:SS C" - 12 chars | cut -d: -f1,2,4 \ # "HH MM C" | sort \ | uniq -c HTTP responses last 2h by min?

@phinze DevOpsDays Chicago 2017 Check Vitals the

@phinze DevOpsDays Chicago 2017 - Memory (free -m) - Disk
Space (df -h) - Disk I/O (iostat) - Network (iftop) - CPU (htop) - Entropy (cat /proc/sys/kernel/random/entropy_avail) Exhaustible Resources

@phinze DevOpsDays Chicago 2017 cat /tmp/appips | for ip in
$(cat -); do echo $ip; ssh $ip "df -h" done Any machines out of disk?

@phinze DevOpsDays Chicago 2017 Read Code the

@phinze DevOpsDays Chicago 2017 Learn It! 7 Don't know the
language?

@phinze DevOpsDays Chicago 2017 - Syntax highlighting - Search for
string (:Ag) - Go to definition (gd) - Find callers (:GoCallers) - Jump back/forth in history (C-o/C-i) - Generate shareable link to context (:GitBrowse) - Walk through VCS history (:GitBlame, or :GitBrowse + GH) Minimum Viable Code Reading

@phinze DevOpsDays Chicago 2017 Use Models Stay Anchored to ⚓
⚓

@phinze DevOpsDays Chicago 2017 Solve It! Leave no stone unturned

@phinze DevOpsDays Chicago 2017 Collaborating Learning from Failure

@phinze DevOpsDays Chicago 2017 Failure Analysis Help From Others requires
1 1

@phinze DevOpsDays Chicago 2017 Learning Contagious is \

@phinze DevOpsDays Chicago 2017 Ownership Contagious is

@phinze DevOpsDays Chicago 2017 Failure Stronger can make everyone

@phinze DevOpsDays Chicago 2017 Thank You!

Getting Good at System Failure Analysis

Getting Good at System Failure Analysis

More Decks by Paul Hinze

Other Decks in Technology

Featured

Transcript