Slide 1

Slide 1 text

http://www.techademy.nl http://joind.in/xxxx Techademy Workshop - dd-mmm-YYYY Don’t reboot! Joshua Thijssen debug!

Slide 2

Slide 2 text

Joshua Thijssen / The Netherlands Freelance consultant, developer and trainer @ NoxLogic / TechAdemy Development in PHP, Python, Perl, C, Java.. also sysadmin. Lead developer of Saffire Blog: http://adayinthelifeof.nl Email: [email protected] Twitter: @jaytaph 2 .whoami

Slide 3

Slide 3 text

3 The most question you can ask: wrong incorrect irritating annoying stupendous evil improper unethical immoral unjust wicked inaccurate

Slide 4

Slide 4 text

Title Text Have you tried turning it off and on again? 4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

Title Text 6

Slide 7

Slide 7 text

Title Text 7

Slide 8

Slide 8 text

Title Text 8

Slide 9

Slide 9 text

Title Text 9

Slide 10

Slide 10 text

Have you tried turning it off and on again?

Slide 11

Slide 11 text

Title Text 11

Slide 12

Slide 12 text

Title Text 12

Slide 13

Slide 13 text

Title Text 13 Fix it! Every minute we’re losing money!

Slide 14

Slide 14 text

Title Text 14

Slide 15

Slide 15 text

Title Text 15 Deal now or deal later?

Slide 16

Slide 16 text

We will deal with the problem later! 16 ➡ Is it reproducible later? Probably not. ➡ Are you solving the problem, or desperately trying to remove a symptom? ➡ Short term relieve vs long term solution

Slide 17

Slide 17 text

Deal with the problem now ➡ Actually analyze, maybe fix the problem. ➡ It will cost less to analyze/fix it now, than to fix it later. ➡ You just saved a few gazillion dollars! 17

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

19 ➡ We reboot our system every night! ➡ Why? Memory leaks? Just crappy code? ➡ There is some state not handled correctly! Fix it! ➡ What happens when # users increase with 200%? Restart every 12 hours? ➡ Here’s hoping not never getting many visitors!

Slide 20

Slide 20 text

Find the culprit

Slide 21

Slide 21 text

Site is slow or not responding. It’s your DB Bottleneck Troubleshooting Flowchart (BTF)

Slide 22

Slide 22 text

Title Text 22 MySQL

Slide 23

Slide 23 text

Title Text ➡ We use MySQL because it’s so easy to setup and use. ➡ No, it’s not... 23 MySQL is easy to setup and configure

Slide 24

Slide 24 text

my.cnf 24 max_heap_table_size = 16M tmp_table_size = 32M

Slide 25

Slide 25 text

Other usual suspects ➡ Apache / PHP ➡ Monitoring / backup ➡ Hanging cron jobs ➡ Runaway tools ➡ Connectivity / DNS problems 25

Slide 26

Slide 26 text

26 Linux 101

Slide 27

Slide 27 text

27 Processes

Slide 28

Slide 28 text

Processes 28 ➡ Isolated userspace ➡ PID and state. ➡ Kernel “preempts”, or process yields. ➡ Multitasking

Slide 29

Slide 29 text

29

Slide 30

Slide 30 text

Process states 30 ➡ R Running or runnable ➡ S Interruptible sleep ➡ D Uninterruptible sleep ➡ T Stopped ➡ Z Defunct process (zombies)

Slide 31

Slide 31 text

Process states 31 ➡ Most processes are sleeping. ➡ External processes (and the kernel) can “wake up” a process at any time. ➡ fire signals with “kill”

Slide 32

Slide 32 text

Process states 32 ➡ Uninterruptible means it won’t handle signals (directly). ➡ Used for high-performance loops that needs to focus (like I/O). ➡ Still can be preempted by the scheduler!

Slide 33

Slide 33 text

33 ➡ Zombies aren’t bad. ➡ It’s just bad programming or administration that creates zombies. ➡ They will not eat brains (at least not much). ➡ But there shouldn’t be many. Zombies

Slide 34

Slide 34 text

34 Load average

Slide 35

Slide 35 text

Load average 35 ➡ 1 minute, 5 minutes, 15 minutes averages ➡ Calculated as the number of runnable processes. ➡ Depends on number of CPU’s! ➡ Linux also adds uninterruptible sleeps

Slide 36

Slide 36 text

Load average (./uptime) 36 14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes. ➡ Single CPU: 52% more than it can handle. ➡ quad core system: not doing very much

Slide 37

Slide 37 text

37 Memory

Slide 38

Slide 38 text

Memory 38 Q: How much memory does this process use? This is REALLY hard question to answer! It depends on many factors!

Slide 39

Slide 39 text

Memory 39 ➡ 4GB memory space, even if you have less memory installed ➡ Kernel can swap out memory ➡ CPU pagefaults and loads back pages

Slide 40

Slide 40 text

Memory 40 ➡ Process can allocate memory, but does not necessary use it (for instance: preallocation) ➡ VIRT will increase!

Slide 41

Slide 41 text

Memory (as seen in “top”) 41 ➡ Virtual memory ➡ Resident memory ➡ Shared memory ➡ Swapped memory

Slide 42

Slide 42 text

Memory 42 Q: How much free memory does this system have? This is an easier, but still hard question to answer!

Slide 43

Slide 43 text

$ free -m 43 total used free shared buffers cached Mem: 375 349 25 0 111 94 -/+ buffers/cache: 143 231 Swap: 400 7 392

Slide 44

Slide 44 text

44 Monitoring

Slide 45

Slide 45 text

45 ➡ Monitor everything ➡ System / infra monitoring ➡ Application monitoring

Slide 46

Slide 46 text

46

Slide 47

Slide 47 text

47

Slide 48

Slide 48 text

48 Logging Logging

Slide 49

Slide 49 text

49 ➡ Log EVERYTHING from all sources. ➡ Filter later.

Slide 50

Slide 50 text

50 Logstash Graylog2 wtf

Slide 51

Slide 51 text

51 System tools

Slide 52

Slide 52 text

TAIL 52

Slide 53

Slide 53 text

53 ➡ Most daemons will log into /var/log/* ➡ tail -f /var/log/messages ➡ Many times, this is ALL you need!

Slide 54

Slide 54 text

54 ➡ Know your tools (top, htop, vmstat, iostat, ps) ➡ Know the /proc filesystem ➡ sniff around with tcpdump, netstat, nc etc... ➡ man

Slide 55

Slide 55 text

55 strace

Slide 56

Slide 56 text

56 ➡ strace displays system calls and signals ➡ Communication between applications and the kernel.

Slide 57

Slide 57 text

57 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}]) write(20, "get ez_borentappenschuren-/acls/"..., 44) = 44 read(20, "END\r\n", 8196) = 5 write(20, "get ez_borentappenschuren-/acl/g"..., 40) = 40 read(20, "END\r\n", 8196) = 5 write(20, "quit\r\n", 6) = 6 shutdown(20, 2 /* send and receive */) = 0 close(20) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/borentappenschuren/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/borentappenschuren/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) strace -ff -p

Slide 58

Slide 58 text

58 mprotect(0xb757f000, 4096, PROT_READ) = 0 munmap(0xb76d8000, 44104) = 0 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0 gettimeofday({1347446161, 382120}, NULL) = 0 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32 poll([{fd=3, events=POLLIN}], 1, 5000 strace ping www.google.com

Slide 59

Slide 59 text

➡ strace -e trace=open ➡ strace -ff -p 59

Slide 60

Slide 60 text

60 ltrace

Slide 61

Slide 61 text

61 ➡ traces library calls ➡ Grep and filter. ➡ careful: ltrace php -r 'echo "hello world";' outputs 92476 lines!

Slide 62

Slide 62 text

62 System tap / dtrace System TAP (dtrace)

Slide 63

Slide 63 text

Dtrace 63 ➡ Unobtrusive probes inside the kernel ➡ Scripts written in D language. ➡ SUN / Solaris only (licensing)

Slide 64

Slide 64 text

systemtap 64 ➡ “GPL” version of dtrace ➡ Awesome, but complex ➡ But you need / want debug info packages

Slide 65

Slide 65 text

probe syscall.open { printf(“%s(%d) open (%s)\n”, execname(), pid(), argstr); } systemtap example 65 stap syscall.stp

Slide 66

Slide 66 text

systemtap / dtrace 66 ➡ There are some “providers” in the PHP core (zend_dtrace.{c,h,d})

Slide 67

Slide 67 text

67 ➡ Valgrind ➡ GDB ➡ XDebug / profiler ➡ MySQL proxy Other really cool tools to look at

Slide 68

Slide 68 text

68 Think about your app / infra BEFORE going live...

Slide 69

Slide 69 text

69 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡ Horizontal scalability is easier, but more restrictive. ➡ Configuration is key. ➡ Don’t run on full capacity. Have a contingency buffer for peaks.

Slide 70

Slide 70 text

Make a plan

Slide 71

Slide 71 text

71 Recap Conclusion

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

73 ➡ Don’t reboot, debug! ➡ Analyze what’s going on, ➡ and find and isolate the culprit. ➡ Threat the problem, not the symptoms.

Slide 74

Slide 74 text

74 ➡ There are many tools out there to analyze your system realtime. ➡ Know your running environment (even it’s “not your business”). ➡ Ask 3rd party help if needed.

Slide 75

Slide 75 text

75 ➡ One machine for one purpose (app / mail / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet) and are cheap. ➡ Try to async as much as possible. ➡ Message queues are easy to implement (gearman / *MQ etc).

Slide 76

Slide 76 text

Questions? 76

Slide 77

Slide 77 text

77 Find me on twitter: @jaytaph Find me for development and training: www.noxlogic.nl Find me on email: [email protected] Find me for blogs: www.adayinthelifeof.nl Thank You! https://joind.in/6939