Slide 1

Slide 1 text

1 Don't reboot, debug! A medic first aid course in debugging your server Joshua Thijssen @JayTaph

Slide 2

Slide 2 text

2 Joshua Thijssen Consultant and trainer Author of the Symfony Rainbow Series http://www.symfony-rainbow.com Blog: http://adayinthelifeof.nl Email: [email protected] Twitter: @jaytaph Tech nalyze WWW.TECHANALYZE.IO

Slide 3

Slide 3 text

3

Slide 4

Slide 4 text

3 It's not

Slide 5

Slide 5 text

3 It is It's not

Slide 6

Slide 6 text

4

Slide 7

Slide 7 text

4 Our website is not working anymore!!!

Slide 8

Slide 8 text

Have you tried turning it off and on again? 5

Slide 9

Slide 9 text

6 Manager Not a suit

Slide 10

Slide 10 text

6 Fix it! Every minute we’re losing money! Manager Not a suit

Slide 11

Slide 11 text

7

Slide 12

Slide 12 text

8 Deal now or deal later?

Slide 13

Slide 13 text

9

Slide 14

Slide 14 text

9 ➡ Is it reproducible later? Probably not.

Slide 15

Slide 15 text

9 ➡ Is it reproducible later? Probably not. ➡ Are you solving the problem, or desperately trying to remove a symptom?

Slide 16

Slide 16 text

9 ➡ Is it reproducible later? Probably not. ➡ Are you solving the problem, or desperately trying to remove a symptom? ➡ Short term relieve vs long term solution

Slide 17

Slide 17 text

10

Slide 18

Slide 18 text

➡ Actually analyze, maybe fix the problem. 10

Slide 19

Slide 19 text

➡ Actually analyze, maybe fix the problem. ➡ It will cost less to analyze/fix it now, than to fix it later. 10

Slide 20

Slide 20 text

➡ Actually analyze, maybe fix the problem. ➡ It will cost less to analyze/fix it now, than to fix it later. 10

Slide 21

Slide 21 text

11

Slide 22

Slide 22 text

11

Slide 23

Slide 23 text

12

Slide 24

Slide 24 text

12 ➡ We reboot our system every night!

Slide 25

Slide 25 text

12 ➡ We reboot our system every night! ➡ Why? Memory leaks? Just crappy code?

Slide 26

Slide 26 text

12 ➡ We reboot our system every night! ➡ Why? Memory leaks? Just crappy code? ➡ There is some state not handled correctly!

Slide 27

Slide 27 text

12 ➡ We reboot our system every night! ➡ Why? Memory leaks? Just crappy code? ➡ There is some state not handled correctly! ➡ What happens when # users increase with 200%? Restart every 12 hours?

Slide 28

Slide 28 text

12 ➡ We reboot our system every night! ➡ Why? Memory leaks? Just crappy code? ➡ There is some state not handled correctly! ➡ What happens when # users increase with 200%? Restart every 12 hours? ➡ Let’s hope you won’t get many visitors!

Slide 29

Slide 29 text

Find the culprit 13

Slide 30

Slide 30 text

Bottleneck Troubleshooting Flowchart (BTF) 14

Slide 31

Slide 31 text

Site is slow or not responding. It’s your DB Bottleneck Troubleshooting Flowchart (BTF) 14

Slide 32

Slide 32 text

15

Slide 33

Slide 33 text

➡ Apache / PHP 15

Slide 34

Slide 34 text

➡ Apache / PHP ➡ Monitoring / backup 15

Slide 35

Slide 35 text

➡ Apache / PHP ➡ Monitoring / backup ➡ Hanging cron jobs & runaway tools 15

Slide 36

Slide 36 text

➡ Apache / PHP ➡ Monitoring / backup ➡ Hanging cron jobs & runaway tools ➡ Connectivity / DNS problems 15

Slide 37

Slide 37 text

16 Linux 101

Slide 38

Slide 38 text

17 Processes

Slide 39

Slide 39 text

18

Slide 40

Slide 40 text

18 ➡ Isolated userspace.

Slide 41

Slide 41 text

18 ➡ Isolated userspace. ➡ PID (process id) and state.

Slide 42

Slide 42 text

18 ➡ Isolated userspace. ➡ PID (process id) and state. ➡ Kernel “preempts”, or process yields.

Slide 43

Slide 43 text

18 ➡ Isolated userspace. ➡ PID (process id) and state. ➡ Kernel “preempts”, or process yields. ➡ Multitasking.

Slide 44

Slide 44 text

19

Slide 45

Slide 45 text

20

Slide 46

Slide 46 text

20 ➡ R Running or runnable

Slide 47

Slide 47 text

20 ➡ R Running or runnable ➡ S Interruptible sleep

Slide 48

Slide 48 text

20 ➡ R Running or runnable ➡ S Interruptible sleep ➡ D Uninterruptible sleep

Slide 49

Slide 49 text

20 ➡ R Running or runnable ➡ S Interruptible sleep ➡ D Uninterruptible sleep ➡ T Stopped

Slide 50

Slide 50 text

20 ➡ R Running or runnable ➡ S Interruptible sleep ➡ D Uninterruptible sleep ➡ T Stopped ➡ Z Defunct process (zombies)

Slide 51

Slide 51 text

21

Slide 52

Slide 52 text

21 ➡ Most processes are sleeping.

Slide 53

Slide 53 text

21 ➡ Most processes are sleeping. ➡ External processes (and the kernel) can “wake up” a process at any time by sending “signals”.

Slide 54

Slide 54 text

21 ➡ Most processes are sleeping. ➡ External processes (and the kernel) can “wake up” a process at any time by sending “signals”. ➡ Fire signals with “kill”.

Slide 55

Slide 55 text

22

Slide 56

Slide 56 text

22 ➡ Uninterruptible means it won’t handle signals (directly), but waits on its task to finish (it must wake up by itself).

Slide 57

Slide 57 text

22 ➡ Uninterruptible means it won’t handle signals (directly), but waits on its task to finish (it must wake up by itself). ➡ Used for high-performance loops that needs to focus (like I/O).

Slide 58

Slide 58 text

22 ➡ Uninterruptible means it won’t handle signals (directly), but waits on its task to finish (it must wake up by itself). ➡ Used for high-performance loops that needs to focus (like I/O). ➡ Still can be preempted by the scheduler!

Slide 59

Slide 59 text

23

Slide 60

Slide 60 text

23 ➡ Zombies aren’t bad.

Slide 61

Slide 61 text

23 ➡ Zombies aren’t bad. ➡ It’s just bad programming or administration that creates zombies.

Slide 62

Slide 62 text

23 ➡ Zombies aren’t bad. ➡ It’s just bad programming or administration that creates zombies. ➡ They will not eat brains (at least not much).

Slide 63

Slide 63 text

23 ➡ Zombies aren’t bad. ➡ It’s just bad programming or administration that creates zombies. ➡ They will not eat brains (at least not much). ➡ But there shouldn’t be many.

Slide 64

Slide 64 text

24 Load average

Slide 65

Slide 65 text

25

Slide 66

Slide 66 text

25 ➡ 1 minute, 5 minutes, 15 minutes averages

Slide 67

Slide 67 text

25 ➡ 1 minute, 5 minutes, 15 minutes averages ➡ Calculated as the number of runnable processes.

Slide 68

Slide 68 text

25 ➡ 1 minute, 5 minutes, 15 minutes averages ➡ Calculated as the number of runnable processes. ➡ Depends on number of CPU’s!

Slide 69

Slide 69 text

26 14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27

Slide 70

Slide 70 text

26 14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute.

Slide 71

Slide 71 text

26 14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes

Slide 72

Slide 72 text

26 14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes.

Slide 73

Slide 73 text

26 14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes. ➡ Single CPU: 52% more than it can handle.

Slide 74

Slide 74 text

26 14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes. ➡ Single CPU: 52% more than it can handle. ➡ quad core system: not doing very much

Slide 75

Slide 75 text

27 Memory

Slide 76

Slide 76 text

28 Q: How much memory does this process use? This is REALLY hard question to answer! It depends on many factors!

Slide 77

Slide 77 text

29 ➡ Virtual memory ➡ Shared memory ➡ Resident memory ➡ Swapped memory

Slide 78

Slide 78 text

30

Slide 79

Slide 79 text

30 ➡ 4GB memory space, even if you have less memory installed.

Slide 80

Slide 80 text

30 ➡ 4GB memory space, even if you have less memory installed. ➡ 1GB is reserved for kernel.

Slide 81

Slide 81 text

30 ➡ 4GB memory space, even if you have less memory installed. ➡ 1GB is reserved for kernel. ➡ Kernel can swap out memory.

Slide 82

Slide 82 text

30 ➡ 4GB memory space, even if you have less memory installed. ➡ 1GB is reserved for kernel. ➡ Kernel can swap out memory. ➡ CPU pagefaults and loads back pages.

Slide 83

Slide 83 text

31

Slide 84

Slide 84 text

31 ➡ Process can allocate memory, but does not necessary use it (for instance: preallocation)

Slide 85

Slide 85 text

31 ➡ Process can allocate memory, but does not necessary use it (for instance: preallocation) ➡ VIRT will increase!

Slide 86

Slide 86 text

32

Slide 87

Slide 87 text

33 Q: How much free memory does this system have? This is an easier, but still hard question to answer!

Slide 88

Slide 88 text

34 $ free -m total used free shared buffers cached Mem: 375 349 25 0 111 94 -/+ buffers/cache: 143 231 Swap: 400 7 392

Slide 89

Slide 89 text

35 Monitoring

Slide 90

Slide 90 text

36 ➡ Monitor everything ➡ System / infra monitoring ➡ Application monitoring

Slide 91

Slide 91 text

37

Slide 92

Slide 92 text

38

Slide 93

Slide 93 text

39 Logging Logging

Slide 94

Slide 94 text

40 ➡ Log EVERYTHING from all sources. ➡ Filter later.

Slide 95

Slide 95 text

41 $ php composer.phar require monolog/monolog ➡ syslog ➡ files ➡ mail ➡ slack / hipchat / irc ➡ logstash

Slide 96

Slide 96 text

42 System tools

Slide 97

Slide 97 text

TAIL 43

Slide 98

Slide 98 text

44

Slide 99

Slide 99 text

44 ➡ Most daemons will log into /var/log/*

Slide 100

Slide 100 text

44 ➡ Most daemons will log into /var/log/* ➡ tail -f /var/log/messages

Slide 101

Slide 101 text

44 ➡ Most daemons will log into /var/log/* ➡ tail -f /var/log/messages ➡ Many times, this is ALL you need!

Slide 102

Slide 102 text

45

Slide 103

Slide 103 text

45 ➡ Know your tools (top, htop, vmstat, iostat, ps)

Slide 104

Slide 104 text

45 ➡ Know your tools (top, htop, vmstat, iostat, ps) ➡ Know the /proc filesystem

Slide 105

Slide 105 text

45 ➡ Know your tools (top, htop, vmstat, iostat, ps) ➡ Know the /proc filesystem ➡ sniff around with tcpdump, netstat, nc etc...

Slide 106

Slide 106 text

45 ➡ Know your tools (top, htop, vmstat, iostat, ps) ➡ Know the /proc filesystem ➡ sniff around with tcpdump, netstat, nc etc... ➡ man

Slide 107

Slide 107 text

46 strace

Slide 108

Slide 108 text

47 ➡ strace displays system calls and signals ➡ Communication between applications and the kernel.

Slide 109

Slide 109 text

48 $ strace -ff -p .... socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}]) write(20, "get ez_client1/acls/"..., 44) = 44 read(20, "END\r\n", 8196) = 5 write(20, "get ez_client1/acl/g"..., 40) = 40 read(20, "END\r\n", 8196) = 5 write(20, "quit\r\n", 6) = 6 shutdown(20, 2 /* send and receive */) = 0 close(20) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/client1/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/client1/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)

Slide 110

Slide 110 text

49 $ strace ping www.google.com .... mprotect(0xb757f000, 4096, PROT_READ) = 0 munmap(0xb76d8000, 44104) = 0 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0 gettimeofday({1347446161, 382120}, NULL) = 0 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32 poll([{fd=3, events=POLLIN}], 1, 5000

Slide 111

Slide 111 text

➡ strace -e trace=open ➡ strace -ff -p 50

Slide 112

Slide 112 text

51 vmstat

Slide 113

Slide 113 text

52 $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 896 17208 124784 121964 0 0 0 0 73 159 0 1 99 0 2 0 896 14968 125628 121976 0 0 844 0 6857 29935 5 58 35 2 2 0 896 14472 125628 121964 0 0 0 0 10691 48526 10 90 0 0 2 0 896 13976 126120 121952 0 0 476 252 10430 49144 7 91 0 2 0 0 896 12744 127016 121968 0 0 896 0 7799 38732 3 71 23 3 0 0 896 12744 127016 121964 0 0 0 0 30 93 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 30 92 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 29 99 0 1 99 0 0 0 896 12760 127016 121964 0 0 0 0 32 110 0 0 100 0 0 0 896 12752 127024 121964 0 0 0 324 33 108 0 0 99 1 0 0 896 12752 127024 121964 0 0 0 0 30 103 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 12 34 108 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 0 30 105 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 236 80 101 0 1 99 0

Slide 114

Slide 114 text

53 System tap / dtrace System TAP (dtrace)

Slide 115

Slide 115 text

54 ➡ Unobtrusive probes inside the kernel ➡ Scripts written in D language. ➡ SUN / Solaris only (licensing)

Slide 116

Slide 116 text

55 ➡ “GPL” version of dtrace ➡ Awesome, but complex ➡ But you need / want debug info packages

Slide 117

Slide 117 text

probe syscall.open { printf(“%s(%d) open (%s)\n”, execname(), pid(), argstr); } 56 stap syscall.stp

Slide 118

Slide 118 text

57 ➡ There are some “providers” in the PHP core (zend_dtrace.{c,h,d})

Slide 119

Slide 119 text

58 ➡ Valgrind ➡ GDB ➡ XDebug / profiler ➡ MySQL proxy

Slide 120

Slide 120 text

59 Think about your app / infra BEFORE going live...

Slide 121

Slide 121 text

Make a plan 60

Slide 122

Slide 122 text

61

Slide 123

Slide 123 text

61 ➡ Design for (vertical) scalability.

Slide 124

Slide 124 text

61 ➡ Design for (vertical) scalability. ➡ Remove SPOFs.

Slide 125

Slide 125 text

61 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡ Horizontal scalability is easier, but more restrictive.

Slide 126

Slide 126 text

61 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡ Horizontal scalability is easier, but more restrictive. ➡ Configuration is key.

Slide 127

Slide 127 text

61 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡ Horizontal scalability is easier, but more restrictive. ➡ Configuration is key. ➡ Don’t run on full capacity. Have a contingency buffer for peaks and know how to scale out

Slide 128

Slide 128 text

62

Slide 129

Slide 129 text

62 ➡ One machine for one purpose (app / mail / cron / db / etc).

Slide 130

Slide 130 text

62 ➡ One machine for one purpose (app / mail / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet / ansible) and are cheap.

Slide 131

Slide 131 text

62 ➡ One machine for one purpose (app / mail / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet / ansible) and are cheap. ➡ Try to async as much as possible.

Slide 132

Slide 132 text

62 ➡ One machine for one purpose (app / mail / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet / ansible) and are cheap. ➡ Try to async as much as possible. ➡ Message queues are easy to implement (gearman / *MQ etc).

Slide 133

Slide 133 text

63 Recap Conclusion

Slide 134

Slide 134 text

64

Slide 135

Slide 135 text

65

Slide 136

Slide 136 text

65 ➡ Don’t reboot, debug!

Slide 137

Slide 137 text

65 ➡ Don’t reboot, debug! ➡ Analyze what’s going on,

Slide 138

Slide 138 text

65 ➡ Don’t reboot, debug! ➡ Analyze what’s going on, ➡ find and isolate the culprit.

Slide 139

Slide 139 text

65 ➡ Don’t reboot, debug! ➡ Analyze what’s going on, ➡ find and isolate the culprit. ➡ Threat the problem, not the symptoms.

Slide 140

Slide 140 text

66

Slide 141

Slide 141 text

66 ➡ There are many tools out there to analyze your system realtime.

Slide 142

Slide 142 text

66 ➡ There are many tools out there to analyze your system realtime. ➡ Know your running environment (even it’s “not your business”).

Slide 143

Slide 143 text

66 ➡ There are many tools out there to analyze your system realtime. ➡ Know your running environment (even it’s “not your business”). ➡ Ask 3rd party help if needed.

Slide 144

Slide 144 text

67

Slide 145

Slide 145 text

Find me on twitter: @jaytaph Find me for development and training: www.noxlogic.nl Find me on email: [email protected] Find me for blogs: www.adayinthelifeof.nl Thank You! https://joind.in/14205