Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don't reboot, debug!

Joshua Thijssen
September 18, 2013
160

Don't reboot, debug!

Joshua Thijssen

September 18, 2013
Tweet

Transcript

  1. 2 Joshua Thijssen Freelance consultant, developer and trainer @ NoxLogic

    Founder of the Dutch Web Alliance Development in PHP, Python, Perl, C, Java. Lead developer of Saffire. Blog: http://adayinthelifeof.nl Email: [email protected] Twitter: @jaytaph
  2. 5

  3. 6

  4. 7

  5. 8

  6. 9

  7. 11

  8. 12

  9. 14

  10. 16 ➡ Is it reproducible later? Probably not. ➡ Are

    you solving the problem, or desperately trying to remove a symptom? ➡ Short term relieve vs long term solution
  11. ➡ Actually analyze, maybe fix the problem. ➡ It will

    cost less to analyze/fix it now, than to fix it later. ➡ You just saved a few gazillion dollars! 17
  12. 18

  13. 19 ➡ We reboot our system every night! ➡ Why?

    Memory leaks? Just crappy code? ➡ There is some state not handled correctly! Fix it! ➡ What happens when # users increase with 200%? Restart every 12 hours? ➡ Let’s hope you won’t get many visitors!
  14. ➡ Apache / PHP ➡ Monitoring / backup ➡ Hanging

    cron jobs ➡ Runaway tools ➡ Connectivity / DNS problems 25
  15. 28 ➡ Isolated userspace ➡ PID and state. ➡ Kernel

    “preempts”, or process yields. ➡ Multitasking
  16. 29

  17. 30 ➡ R Running or runnable ➡ S Interruptible sleep

    ➡ D Uninterruptible sleep ➡ T Stopped ➡ Z Defunct process (zombies)
  18. 31 ➡ Most processes are sleeping. ➡ External processes (and

    the kernel) can “wake up” a process at any time by sending “signals”. ➡ fire signals with “kill”.
  19. 32 ➡ Uninterruptible means it won’t handle signals (directly). ➡

    Used for high-performance loops that needs to focus (like I/O). ➡ Still can be preempted by the scheduler!
  20. 33 ➡ Zombies aren’t bad. ➡ It’s just bad programming

    or administration that creates zombies. ➡ They will not eat brains (at least not much). ➡ But there shouldn’t be many.
  21. 35 ➡ 1 minute, 5 minutes, 15 minutes averages ➡

    Calculated as the number of runnable processes. ➡ Depends on number of CPU’s!
  22. 36 14:57:22 up 35 days, 18:57, 1 user, load average:

    1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes. ➡ Single CPU: 52% more than it can handle. ➡ quad core system: not doing very much
  23. 38 Q: How much memory does this process use? This

    is REALLY hard question to answer! It depends on many factors!
  24. 40 ➡ 4GB memory space*, even if you have less

    memory installed ➡ Kernel can swap out memory ➡ CPU pagefaults and loads back pages
  25. 41 ➡ Process can allocate memory, but does not necessary

    use it (for instance: preallocation) ➡ VIRT will increase!
  26. 42

  27. 43 Q: How much free memory does this system have?

    This is an easier, but still hard question to answer!
  28. 44 $ free -m total used free shared buffers cached

    Mem: 375 349 25 0 111 94 -/+ buffers/cache: 143 231 Swap: 400 7 392
  29. 47

  30. 48

  31. 54 ➡ Most daemons will log into /var/log/* ➡ tail

    -f /var/log/messages ➡ Many times, this is ALL you need!
  32. 55 ➡ Know your tools (top, htop, vmstat, iostat, ps)

    ➡ Know the /proc filesystem ➡ sniff around with tcpdump, netstat, nc etc... ➡ man <keyword>
  33. 58 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20 fcntl(20, F_GETFL) = 0x2

    (flags O_RDWR) fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}]) write(20, "get ez_borentappenschuren-/acls/"..., 44) = 44 read(20, "END\r\n", 8196) = 5 write(20, "get ez_borentappenschuren-/acl/g"..., 40) = 40 read(20, "END\r\n", 8196) = 5 write(20, "quit\r\n", 6) = 6 shutdown(20, 2 /* send and receive */) = 0 close(20) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/borentappenschuren/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/borentappenschuren/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) strace -ff -p <apache_pid>
  34. 59 mprotect(0xb757f000, 4096, PROT_READ) = 0 munmap(0xb76d8000, 44104) = 0

    stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0 gettimeofday({1347446161, 382120}, NULL) = 0 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32 poll([{fd=3, events=POLLIN}], 1, 5000 $ strace ping www.google.com
  35. 62 $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

    r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 896 17208 124784 121964 0 0 0 0 73 159 0 1 99 0 2 0 896 14968 125628 121976 0 0 844 0 6857 29935 5 58 35 2 2 0 896 14472 125628 121964 0 0 0 0 10691 48526 10 90 0 0 2 0 896 13976 126120 121952 0 0 476 252 10430 49144 7 91 0 2 0 0 896 12744 127016 121968 0 0 896 0 7799 38732 3 71 23 3 0 0 896 12744 127016 121964 0 0 0 0 30 93 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 30 92 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 29 99 0 1 99 0 0 0 896 12760 127016 121964 0 0 0 0 32 110 0 0 100 0 0 0 896 12752 127024 121964 0 0 0 324 33 108 0 0 99 1 0 0 896 12752 127024 121964 0 0 0 0 30 103 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 12 34 108 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 0 30 105 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 236 80 101 0 1 99 0
  36. 64 ➡ Unobtrusive probes inside the kernel ➡ Scripts written

    in D language. ➡ SUN / Solaris only (licensing)
  37. 65 ➡ “GPL” version of dtrace ➡ Awesome, but complex

    ➡ But you need / want debug info packages
  38. 70 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡

    Horizontal scalability is easier, but more restrictive. ➡ Configuration is key. ➡ Don’t run on full capacity. Have a contingency buffer for peaks.
  39. 73

  40. 74 ➡ Don’t reboot, debug! ➡ Analyze what’s going on,

    ➡ find and isolate the culprit. ➡ Threat the problem, not the symptoms.
  41. 75 ➡ There are many tools out there to analyze

    your system realtime. ➡ Know your running environment (even it’s “not your business”). ➡ Ask 3rd party help if needed.
  42. 76 ➡ One machine for one purpose (app / mail

    / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet) and are cheap. ➡ Try to async as much as possible. ➡ Message queues are easy to implement (gearman / *MQ etc).
  43. 77

  44. 78 Find me on twitter: @jaytaph Find me for development

    and training: www.noxlogic.nl Find me on email: [email protected] Find me for blogs: www.adayinthelifeof.nl Thank You! https://joind.in/8890