Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Good at System Failure Analysis

Paul Hinze
September 12, 2017

Getting Good at System Failure Analysis

Given at DevOpsDays Chicago 2017

Every failure is a mystery to be solved. Solving those mysteries is a skill that can be honed. Let’s talk about how to get better at figuring out what’s up when things go wrong! This is a talk full of both high level advice and concrete tips from somebody who loves fixing weird production issues.

What does it mean to be good at debugging production issues? That’s the question we’ll explore in this talk! I’ll be sharing a grab bag of the postures, practices, tips, and tricks I’ve learned from years hanging out near production.

Running production systems are not always designed for operability, and yet we still need to fix them. Thusly, my goal is to share techniques that apply across a range of operational maturity levels. This breaks down into a few sections:

- Adopting a productive attitude towards failures
- Learning to love logs, wherever you may find them
- Guerrilla systems thinking and domain modeling
- Code reading for failure analysis
- Collaborating to remediate and solve production issues

Production failure analysis has been one of the most rewarding skills that I’ve built up in my career. I hope that after this talk you’ll have a few tools to walk away with, but - more importantly - you’ll be inspired to get better at responding to failures.

Paul Hinze

September 12, 2017
Tweet

More Decks by Paul Hinze

Other Decks in Technology

Transcript

  1. @phinze DevOpsDays Chicago 2017 Systems Fail “a set of connected

    things or parts forming a complex whole”
  2. @phinze DevOpsDays Chicago 2017 do you need to How much

    understand the system ? in order to understand the failure
  3. @phinze DevOpsDays Chicago 2017 model (n.) “a simplified description [...]

    of a system or process, to assist calculations and predictions”
  4. @phinze DevOpsDays Chicago 2017 @phinze DevOpsDays Chicago 2017 graph TD

    internet(("internet")) atlas-frontend("atlas-frontend (rails, passenger)") atlas-worker("atlas-worker (rails, sidekiq)") archivist("archivist (go)<br>[[binstore (go)<br>storagelocker (go)<br>logstream (go)]]") packer-build-manager("packer-build-manager (go)") packer-build-worker("packer-build-worker (go)") terraform-build-manager("terraform-build-manager (go)") terraform-build-worker("terraform-build-worker (go)") slugs("slugs (go)<br>[[slug-merge (go)<br>slug-ingress (go)<br>slug-extract (go)]]") terraform-state-parser("terraform-state-parser (go)") internet-->|http|atlas-frontend internet-->|http|archivist archivist-->|s3-get-put|artifacts-bucket archivist-->|redis-get-put|archivist-cache atlas-frontend-->|redis-rpush|rails-jobs-queue subgraph s3 artifacts-bucket end subgraph redis rails-jobs-queue>"Q: rails-jobs"] archivist-cache("KV: archivist/*") end rails-jobs-queue-->|redis-blpop|atlas-worker atlas-worker-->|activerecord|atlas-db atlas-frontend-->|activerecord|atlas-db subgraph postgres atlas-db end atlas-worker-->|amqp-publish|packer-jobs-queue atlas-worker-->|amqp-publish|terraform-runs-queue atlas-worker-->|http|slugs atlas-worker-->|http|terraform-state-parser atlas-worker-->|http|archivist slugs-->|http|archivist subgraph rabbitmq packer-jobs-queue>"Q: packer-jobs"] packer-builds-queue>"Q: packer-builds"] terraform-runs-queue>"Q: terraform-runs"] terraform-workers-queue>"Q: terraform-workers"] end packer-jobs-queue-->|amqp-consume|packer-build-manager packer-build-manager-->|amqp-publish|packer-builds-queue packer-builds-queue-->|amqp-consume|packer-build-worker terraform-runs-queue-->|amqp-consume|terraform-build-manager terraform-build-manager-->|amqp-publish|terraform-workers-queue terraform-workers-queue-->|amqp-consume|terraform-build-worker
  5. @phinze DevOpsDays Chicago 2017 do you need to How much

    understand the system ? in order to understand the failure
  6. @phinze DevOpsDays Chicago 2017 ⚓ Use models to anchor your

    working set of system understanding to the failure at hand.
  7. @phinze DevOpsDays Chicago 2017 ✅ SSH ✅ Shell loops ✅

    Scratch files ✅ Shell history ✅ Unix text processing tools ✅ Google No Fancy Tools Necessary (Fancy tools are great if you have 'em!)
  8. @phinze DevOpsDays Chicago 2017 # No service discovery? No prob!

    # Copy all IPs from web console into /tmp/allips cat /tmp/allips | while read ip; do echo $ip if ssh $ip "px aux | grep appnam[e]"; then echo $ip >> /tmp/appips fi done Where's the app?
  9. @phinze DevOpsDays Chicago 2017 cat /tmp/appips | while read ip;

    do echo $ip ssh $ip "grep 'errmsg' /var/log/app.log" done Happening on all nodes or just one?
  10. @phinze DevOpsDays Chicago 2017 cat /tmp/appips | while read ip;

    do ssh $ip "grep 'errmsg' /var/log/app.log" >> /tmp/errs.log done # remote_ip is column 11 cat /tmp/errs.log | awk '{print $11}' | sort | uniq -c Happening to all users or just one?
  11. @phinze DevOpsDays Chicago 2017 cat /tmp/appips | while read ip;

    do ssh $ip "grep '2017-09-12 0[34]' /var/log/app.log" \ >> /tmp/app2h.log done Snag the last ~2h of logs locally
  12. @phinze DevOpsDays Chicago 2017 cat /tmp/app2h.log \ | awk '{

    print $8 }' \ # say 8th column is http status | cut -c1 \ # first char gives us "class" of code | sort \ | uniq -c HTTP responses last 2h?
  13. @phinze DevOpsDays Chicago 2017 cat /tmp/app2h.log \ | awk '{

    print $2, $8 }' \ # 2nd column is time | cut -c1-12 \ # "HH:MM:SS C" - 12 chars | cut -d: -f1,2,4 \ # "HH MM C" | sort \ | uniq -c HTTP responses last 2h by min?
  14. @phinze DevOpsDays Chicago 2017 - Memory (free -m) - Disk

    Space (df -h) - Disk I/O (iostat) - Network (iftop) - CPU (htop) - Entropy (cat /proc/sys/kernel/random/entropy_avail) Exhaustible Resources
  15. @phinze DevOpsDays Chicago 2017 cat /tmp/appips | for ip in

    $(cat -); do echo $ip; ssh $ip "df -h" done Any machines out of disk?
  16. @phinze DevOpsDays Chicago 2017 - Syntax highlighting - Search for

    string (:Ag) - Go to definition (gd) - Find callers (:GoCallers) - Jump back/forth in history (C-o/C-i) - Generate shareable link to context (:GitBrowse) - Walk through VCS history (:GitBlame, or :GitBrowse + GH) Minimum Viable Code Reading