@phinze
DevOpsDays
Chicago 2017
@phinze
DevOpsDays
Chicago 2017
Typora
Slide 58
Slide 58 text
@phinze
DevOpsDays
Chicago 2017
do you need to
How much
understand the system
?
in order to
understand the failure
Slide 59
Slide 59 text
@phinze
DevOpsDays
Chicago 2017
@phinze
DevOpsDays
Chicago 2017
Terraform Runs are slow.
Slide 60
Slide 60 text
@phinze
DevOpsDays
Chicago 2017
@phinze
DevOpsDays
Chicago 2017
Packer
stuff
Web
stuff
Blob
stuff
R
S T
Slide 61
Slide 61 text
@phinze
DevOpsDays
Chicago 2017
⚓
Use models to anchor
your working set of system understanding
to the failure at hand.
Slide 62
Slide 62 text
@phinze
DevOpsDays
Chicago 2017
Build Some!
No models?
Start with your last failure!
Slide 63
Slide 63 text
@phinze
DevOpsDays
Chicago 2017
Acting
Responding to Failure
Slide 64
Slide 64 text
@phinze
DevOpsDays
Chicago 2017
Given $input,
I expected $x.
Instead, I observed $y.
Slide 65
Slide 65 text
@phinze
DevOpsDays
Chicago 2017
A mystery!
Slide 66
Slide 66 text
@phinze
DevOpsDays
Chicago 2017
A process
in
of narrowing
Slide 67
Slide 67 text
@phinze
DevOpsDays
Chicago 2017
Environment
Logs
Code
Slide 68
Slide 68 text
@phinze
DevOpsDays
Chicago 2017
✅ SSH
✅ Shell loops
✅ Scratch files
✅ Shell history
✅ Unix text processing tools
✅ Google
No Fancy Tools Necessary
(Fancy tools are great if you have 'em!)
Slide 69
Slide 69 text
@phinze
DevOpsDays
Chicago 2017
Scope
Failure
the
Continuously
Slide 70
Slide 70 text
@phinze
DevOpsDays
Chicago 2017
# No service discovery? No prob!
# Copy all IPs from web console into /tmp/allips
cat /tmp/allips | while read ip; do
echo $ip
if ssh $ip "px aux | grep appnam[e]"; then
echo $ip >> /tmp/appips
fi
done
Where's the app?
Slide 71
Slide 71 text
@phinze
DevOpsDays
Chicago 2017
cat /tmp/appips | while read ip; do
echo $ip
ssh $ip "grep 'errmsg' /var/log/app.log"
done
Happening on all nodes or just one?
Slide 72
Slide 72 text
@phinze
DevOpsDays
Chicago 2017
cat /tmp/appips | while read ip; do
ssh $ip "grep 'errmsg' /var/log/app.log" >> /tmp/errs.log
done
# remote_ip is column 11
cat /tmp/errs.log | awk '{print $11}' | sort | uniq -c
Happening to all users or just one?
Slide 73
Slide 73 text
@phinze
DevOpsDays
Chicago 2017
Read
Logs
the
Slide 74
Slide 74 text
@phinze
DevOpsDays
Chicago 2017
cat /tmp/appips | while read ip; do
ssh $ip "grep '2017-09-12 0[34]' /var/log/app.log" \
>> /tmp/app2h.log
done
Snag the last ~2h of logs locally
Slide 75
Slide 75 text
@phinze
DevOpsDays
Chicago 2017
cat /tmp/app2h.log \
| awk '{ print $8 }' \ # say 8th column is http status
| cut -c1 \ # first char gives us "class" of code
| sort \
| uniq -c
HTTP responses last 2h?
Slide 76
Slide 76 text
@phinze
DevOpsDays
Chicago 2017
cat /tmp/app2h.log \
| awk '{ print $2, $8 }' \ # 2nd column is time
| cut -c1-12 \ # "HH:MM:SS C" - 12 chars
| cut -d: -f1,2,4 \ # "HH MM C"
| sort \
| uniq -c
HTTP responses last 2h by min?
Slide 77
Slide 77 text
@phinze
DevOpsDays
Chicago 2017
Check
Vitals
the
Slide 78
Slide 78 text
@phinze
DevOpsDays
Chicago 2017
- Memory (free -m)
- Disk Space (df -h)
- Disk I/O (iostat)
- Network (iftop)
- CPU (htop)
- Entropy (cat /proc/sys/kernel/random/entropy_avail)
Exhaustible Resources
Slide 79
Slide 79 text
@phinze
DevOpsDays
Chicago 2017
cat /tmp/appips | for ip in $(cat -); do
echo $ip; ssh $ip "df -h"
done
Any machines out of disk?
Slide 80
Slide 80 text
@phinze
DevOpsDays
Chicago 2017
Read
Code
the
Slide 81
Slide 81 text
@phinze
DevOpsDays
Chicago 2017
Learn It! 7
Don't know the language?
Slide 82
Slide 82 text
@phinze
DevOpsDays
Chicago 2017
- Syntax highlighting
- Search for string (:Ag)
- Go to definition (gd)
- Find callers (:GoCallers)
- Jump back/forth in history (C-o/C-i)
- Generate shareable link to context (:GitBrowse)
- Walk through VCS history (:GitBlame, or :GitBrowse + GH)
Minimum Viable Code Reading
Slide 83
Slide 83 text
@phinze
DevOpsDays
Chicago 2017
Use Models
Stay Anchored
to
⚓
⚓
Slide 84
Slide 84 text
@phinze
DevOpsDays
Chicago 2017
Solve It!
Leave no stone unturned
Slide 85
Slide 85 text
@phinze
DevOpsDays
Chicago 2017
Collaborating
Learning from Failure
Slide 86
Slide 86 text
@phinze
DevOpsDays
Chicago 2017
Failure Analysis
Help From Others
requires
1
1
Slide 87
Slide 87 text
@phinze
DevOpsDays
Chicago 2017
Learning
Contagious
is
\
Slide 88
Slide 88 text
@phinze
DevOpsDays
Chicago 2017
Ownership
Contagious
is
Slide 89
Slide 89 text
@phinze
DevOpsDays
Chicago 2017
Failure
Stronger
can make everyone