Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don't reboot, debug! IPC

Avatar for Joshua Thijssen Joshua Thijssen
October 30, 2013
150

Don't reboot, debug! IPC

Avatar for Joshua Thijssen

Joshua Thijssen

October 30, 2013
Tweet

More Decks by Joshua Thijssen

Transcript

  1. 2 Joshua Thijssen Freelance consultant, developer and trainer @ NoxLogic

    Founder of the Dutch Web Alliance Development in PHP, Python, Perl, C, Java. Lead developer of Saffire. Blog: http://adayinthelifeof.nl Email: [email protected] Twitter: @jaytaph
  2. 5

  3. 6

  4. 7

  5. 8

  6. 9

  7. 11

  8. 12

  9. 14

  10. 16 ➡ Is it reproducible later? Probably not. ➡ Are

    you solving the problem, or desperately trying to remove a symptom? ➡ Short term relieve vs long term solution
  11. ➡ Actually analyze, maybe fix the problem. ➡ It will

    cost less to analyze/fix it now, than to fix it later. 17
  12. 18

  13. 19 ➡ We reboot our system every night! ➡ Why?

    Memory leaks? Just crappy code? ➡ There is some state not handled correctly! Fix it! ➡ What happens when # users increase with 200%? Restart every 12 hours? ➡ Let’s hope you won’t get many visitors!
  14. ➡ Apache / PHP ➡ Monitoring / backup ➡ Hanging

    cron jobs & runaway tools ➡ Connectivity / DNS problems 25
  15. 28 ➡ Isolated userspace ➡ PID and state. ➡ Kernel

    “preempts”, or process yields. ➡ Multitasking
  16. 29

  17. 30 ➡ R Running or runnable ➡ S Interruptible sleep

    ➡ D Uninterruptible sleep ➡ T Stopped ➡ Z Defunct process (zombies)
  18. 31 ➡ Most processes are sleeping. ➡ External processes (and

    the kernel) can “wake up” a process at any time by sending “signals”. ➡ fire signals with “kill”.
  19. 32 ➡ Uninterruptible means it won’t handle signals (directly), but

    waits on its task to finish. ➡ Used for high-performance loops that needs to focus (like I/O). ➡ Still can be preempted by the scheduler!
  20. 33 ➡ Zombies aren’t bad. ➡ It’s just bad programming

    or administration that creates zombies. ➡ They will not eat brains (at least not much). ➡ But there shouldn’t be many.
  21. 35 ➡ 1 minute, 5 minutes, 15 minutes averages ➡

    Calculated as the number of runnable processes. ➡ Depends on number of CPU’s!
  22. 36 14:57:22 up 35 days, 18:57, 1 user, load average:

    1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes. ➡ Single CPU: 52% more than it can handle. ➡ quad core system: not doing very much
  23. 38 Q: How much memory does this process use? This

    is REALLY hard question to answer! It depends on many factors!
  24. 40 ➡ 4GB memory space, even if you have less

    memory installed. ➡ 1GB is reserved for kernel. ➡ Kernel can swap out memory. ➡ CPU pagefaults and loads back pages.
  25. 41 ➡ Process can allocate memory, but does not necessary

    use it (for instance: preallocation) ➡ VIRT will increase!
  26. 42

  27. 43 Q: How much free memory does this system have?

    This is an easier, but still hard question to answer!
  28. 44 $ free -m total used free shared buffers cached

    Mem: 375 349 25 0 111 94 -/+ buffers/cache: 143 231 Swap: 400 7 392
  29. 47

  30. 48

  31. 54 ➡ Most daemons will log into /var/log/* ➡ tail

    -f /var/log/messages ➡ Many times, this is ALL you need!
  32. 55 ➡ Know your tools (top, htop, vmstat, iostat, ps)

    ➡ Know the /proc filesystem ➡ sniff around with tcpdump, netstat, nc etc... ➡ man <keyword>
  33. 58 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20 fcntl(20, F_GETFL) = 0x2

    (flags O_RDWR) fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}]) write(20, "get ez_borentappenschuren-/acls/"..., 44) = 44 read(20, "END\r\n", 8196) = 5 write(20, "get ez_borentappenschuren-/acl/g"..., 40) = 40 read(20, "END\r\n", 8196) = 5 write(20, "quit\r\n", 6) = 6 shutdown(20, 2 /* send and receive */) = 0 close(20) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/borentappenschuren/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/borentappenschuren/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/borentappenschuren/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) strace -ff -p <apache_pid>
  34. 59 mprotect(0xb757f000, 4096, PROT_READ) = 0 munmap(0xb76d8000, 44104) = 0

    stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0 gettimeofday({1347446161, 382120}, NULL) = 0 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32 poll([{fd=3, events=POLLIN}], 1, 5000 $ strace ping www.google.com
  35. 62 $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

    r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 896 17208 124784 121964 0 0 0 0 73 159 0 1 99 0 2 0 896 14968 125628 121976 0 0 844 0 6857 29935 5 58 35 2 2 0 896 14472 125628 121964 0 0 0 0 10691 48526 10 90 0 0 2 0 896 13976 126120 121952 0 0 476 252 10430 49144 7 91 0 2 0 0 896 12744 127016 121968 0 0 896 0 7799 38732 3 71 23 3 0 0 896 12744 127016 121964 0 0 0 0 30 93 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 30 92 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 29 99 0 1 99 0 0 0 896 12760 127016 121964 0 0 0 0 32 110 0 0 100 0 0 0 896 12752 127024 121964 0 0 0 324 33 108 0 0 99 1 0 0 896 12752 127024 121964 0 0 0 0 30 103 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 12 34 108 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 0 30 105 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 236 80 101 0 1 99 0
  36. 64 ➡ Unobtrusive probes inside the kernel ➡ Scripts written

    in D language. ➡ SUN / Solaris only (licensing)
  37. 65 ➡ “GPL” version of dtrace ➡ Awesome, but complex

    ➡ But you need / want debug info packages
  38. 71 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡

    Horizontal scalability is easier, but more restrictive. ➡ Configuration is key. ➡ Don’t run on full capacity. Have a contingency buffer for peaks and know how to scale out
  39. 72 ➡ One machine for one purpose (app / mail

    / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet) and are cheap. ➡ Try to async as much as possible. ➡ Message queues are easy to implement (gearman / *MQ etc).
  40. 74

  41. 75 ➡ Don’t reboot, debug! ➡ Analyze what’s going on,

    ➡ find and isolate the culprit. ➡ Threat the problem, not the symptoms.
  42. 76 ➡ There are many tools out there to analyze

    your system realtime. ➡ Know your running environment (even it’s “not your business”). ➡ Ask 3rd party help if needed.
  43. 77

  44. Find me on twitter: @jaytaph Find me for development and

    training: www.noxlogic.nl Find me on email: [email protected] Find me for blogs: www.adayinthelifeof.nl Thank You! https://joind.in/9564