Slide 1

Slide 1 text

How to Debug Anything @samkottler 1

Slide 2

Slide 2 text

About me • Engineering manager @ DigitalOcean • Formerly of Red Hat’s virtualization team • Ruby core security • CentOS/Icinga/Bundler/Rubygems/Ansible 2

Slide 3

Slide 3 text

tl;dr I professionally debug software and systems. 3

Slide 4

Slide 4 text

Most software is pretty bad. 4

Slide 5

Slide 5 text

Operating systems are no exception. 5

Slide 6

Slide 6 text

Verifying correctness on changing systems is impossible. 6

Slide 7

Slide 7 text

Turtles all the way down: • Unreliable hardware • Unreliable devices • Unreliable kernels • Unreliable drivers • Unreliable libraries • Unreliable runtimes 7

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

Bugs happen. Tools help quickly find and fix them. 9

Slide 10

Slide 10 text

But first, let's talk process. 10

Slide 11

Slide 11 text

1. Suspend your disbelief "This isn't even possible!" 11

Slide 12

Slide 12 text

2. Reproduce the issue Without being able to create the bad condition, it's impossible to fix it. 12

Slide 13

Slide 13 text

3. Compare with healthy systems Sometimes simple or seemingly insignificant changes can lead to major issues. 13

Slide 14

Slide 14 text

4. Figure out what's wrong. This is probably gonna take a while. 14

Slide 15

Slide 15 text

5. Fix the problem Almost always easier than figuring out the problem because of the understanding built while debugging. 15

Slide 16

Slide 16 text

Let's jump back to step #4 16

Slide 17

Slide 17 text

Bugs happen. Tools help quickly find and fix them. 17

Slide 18

Slide 18 text

require "otherthing" module Derp class Yerp def self.string "derping in the USA" end end end puts Derp::Yerp.string 18

Slide 19

Slide 19 text

`require': cannot load such file -- otherthing (LoadError) from /usr/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from test.rb:1:in `' 19

Slide 20

Slide 20 text

open("/usr/lib/ruby/1.9.1/rubygems/path_support.rb", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=1538, ...}) = 0 fstat(5, {st_mode=S_IFREG|0644, st_size=1538, ...}) = 0 close(5) = 0 getuid() = 10100 geteuid() = 10100 getgid() = 10000 getegid() = 10000 open("/usr/lib/ruby/1.9.1/rubygems/path_support.rb", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0644, st_size=1538, ...}) = 0 ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffc21fe8df0) = -1 ENOTTY (Inappropriate ioctl for device) fstat(5, {st_mode=S_IFREG|0644, st_size=1538, ...}) = 0 read(5, "##\n# Gem::PathSupport facilitate"..., 8192) = 1538 read(5, "", 8192) = 0 close(5) = 0 lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=12288, ...}) = 0 lstat("/usr/lib/ruby", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/lib/ruby/1.9.1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/lib/ruby/1.9.1/rubygems", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/usr/lib/ruby/1.9.1/rubygems/path_support.rb", {st_mode=S_IFREG|0644, st_size=1538, ...}) = 0 brk(0x1719000) = 0x1719000 stat("/home/skottler", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 openat(AT_FDCWD, "/var/lib/gems/1.9.1/specifications", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/home/skottler/.gem/ruby/1.9.1/specifications", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 open("/proc/self/maps", O_RDONLY|O_CLOEXEC) = 5 20

Slide 21

Slide 21 text

Reading strace output is easy! fstat(5, {st_mode=S_IFREG|0644, st_size=1538, ...}) = 0 21

Slide 22

Slide 22 text

Let's break it down: function(file descriptor, {....args.....}) = return value 22

Slide 23

Slide 23 text

System calls are a unified interface to debug userspace 23

Slide 24

Slide 24 text

Debugging a crash 24

Slide 25

Slide 25 text

#include int main () { int foo[5], n; memset((char *)0x0, 1, 100); return 0; } 25

Slide 26

Slide 26 text

gcc -g -o foo foo.c 26

Slide 27

Slide 27 text

$ gdb foo 27

Slide 28

Slide 28 text

(gdb) list 1 #include 2 3 int main () { 4 int foo[5], n; 5 6 memset((char *)0x0, 1, 100); 7 8 return 0; 9 } 28

Slide 29

Slide 29 text

(gdb) run Starting program: /home/skottler/foo Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7aa1614 in memset () from /lib/x86_64-linux-gnu/libc.so.6 29

Slide 30

Slide 30 text

(gdb) where #0 0x00007ffff7aa1614 in memset () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x0000000000400599 in main () at foo.c:6 30

Slide 31

Slide 31 text

(gdb) list 1 #include 2 3 int main () { 4 int foo[5], n; 5 6 memset(&n, 0, sizeof(foo)); 7 8 return 0; 9 } 31

Slide 32

Slide 32 text

(gdb) run Starting program: /home/skottler/foo [Inferior 1 (process 44204) exited normally] 32

Slide 33

Slide 33 text

33

Slide 34

Slide 34 text

Profiling 34

Slide 35

Slide 35 text

The anatomy of a memory leak 35

Slide 36

Slide 36 text

A brief detour to redis-server 36

Slide 37

Slide 37 text

#include void f(void) { int* x = malloc(10 * sizeof(int)); x[10] = 0; } int main(void) { f(); return 0; } 37

Slide 38

Slide 38 text

gcc -g -o heap heap.c 38

Slide 39

Slide 39 text

valgrind --leak-check=yes heap 39

Slide 40

Slide 40 text

==18809== Invalid write of size 4 ==18809== at 0x40054B: f (heap.c:5) ==18809== by 0x40055B: main (heap.c:9) ==18809== Address 0x51fc068 is 0 bytes after a block of size 40 alloc'd ==18809== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==18809== by 0x40053E: f (heap.c:4) ==18809== by 0x40055B: main (heap.c:9) ==18809== ==18809== ==18809== HEAP SUMMARY: ==18809== in use at exit: 40 bytes in 1 blocks ==18809== total heap usage: 1 allocs, 0 frees, 40 bytes allocated ==18809== ==18809== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1 ==18809== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==18809== by 0x40053E: f (heap.c:4) ==18809== by 0x40055B: main (heap.c:9) 40

Slide 41

Slide 41 text

void f(void) { int* x = malloc(10 * sizeof(int)); x[10] = 0; } ==18809== Invalid write of size 4 ==18809== at 0x40054B: f (heap.c:5) ==18809== by 0x40055B: main (heap.c:9) 41

Slide 42

Slide 42 text

void f(void) { int* x = malloc(10 * sizeof(int)); x[9] = 0; } 42

Slide 43

Slide 43 text

void f(void) { int* x = malloc(10 * sizeof(int)); x[10] = 0; } ==18809== Address 0x51fc068 is 0 bytes after a block of size 40 alloc'd ==18809== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==18809== by 0x40053E: f (heap.c:4) ==18809== by 0x40055B: main (heap.c:9) 43

Slide 44

Slide 44 text

void f(void) { int* x = malloc(10 * sizeof(int)); x[9] = 0; free(x); } 44

Slide 45

Slide 45 text

==18855== HEAP SUMMARY: ==18855== in use at exit: 0 bytes in 0 blocks ==18855== total heap usage: 1 allocs, 1 frees, 40 bytes allocated ==18855== ==18855== All heap blocks were freed -- no leaks are possible ==18855== ==18855== For counts of detected and suppressed errors, rerun with: -v ==18855== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) 45

Slide 46

Slide 46 text

1. Suspend your disbelief 2. Reproduce the issue 3. Compare with healthy systems 4. Figure out what's wrong. 5. Fix the problem 46

Slide 47

Slide 47 text

Thanks! @samkottler [email protected] github.com/skottler 47