Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don't reboot, debug! - DPC15

Joshua Thijssen
June 27, 2015
60

Don't reboot, debug! - DPC15

Joshua Thijssen

June 27, 2015
Tweet

Transcript

  1. 1 Don't reboot, debug! A medic first aid course in

    debugging your server Joshua Thijssen @JayTaph
  2. 2 Joshua Thijssen Consultant and trainer Author of the Symfony

    Rainbow Series http://www.symfony-rainbow.com Blog: http://adayinthelifeof.nl Email: [email protected] Twitter: @jaytaph Tech nalyze WWW.TECHANALYZE.IO
  3. 3

  4. 4

  5. 7

  6. 9

  7. 9 ➡ Is it reproducible later? Probably not. ➡ Are

    you solving the problem, or desperately trying to remove a symptom?
  8. 9 ➡ Is it reproducible later? Probably not. ➡ Are

    you solving the problem, or desperately trying to remove a symptom? ➡ Short term relieve vs long term solution
  9. 10

  10. ➡ Actually analyze, maybe fix the problem. ➡ It will

    cost less to analyze/fix it now, than to fix it later. 10
  11. ➡ Actually analyze, maybe fix the problem. ➡ It will

    cost less to analyze/fix it now, than to fix it later. 10
  12. 11

  13. 11

  14. 12

  15. 12 ➡ We reboot our system every night! ➡ Why?

    Memory leaks? Just crappy code?
  16. 12 ➡ We reboot our system every night! ➡ Why?

    Memory leaks? Just crappy code? ➡ There is some state not handled correctly!
  17. 12 ➡ We reboot our system every night! ➡ Why?

    Memory leaks? Just crappy code? ➡ There is some state not handled correctly! ➡ What happens when # users increase with 200%? Restart every 12 hours?
  18. 12 ➡ We reboot our system every night! ➡ Why?

    Memory leaks? Just crappy code? ➡ There is some state not handled correctly! ➡ What happens when # users increase with 200%? Restart every 12 hours? ➡ Let’s hope you won’t get many visitors!
  19. 15

  20. ➡ Apache / PHP ➡ Monitoring / backup ➡ Hanging

    cron jobs & runaway tools ➡ Connectivity / DNS problems 15
  21. 18

  22. 18 ➡ Isolated userspace. ➡ PID (process id) and state.

    ➡ Kernel “preempts”, or process yields.
  23. 18 ➡ Isolated userspace. ➡ PID (process id) and state.

    ➡ Kernel “preempts”, or process yields. ➡ Multitasking.
  24. 19

  25. 20

  26. 20 ➡ R Running or runnable ➡ S Interruptible sleep

    ➡ D Uninterruptible sleep ➡ T Stopped
  27. 20 ➡ R Running or runnable ➡ S Interruptible sleep

    ➡ D Uninterruptible sleep ➡ T Stopped ➡ Z Defunct process (zombies)
  28. 21

  29. 21 ➡ Most processes are sleeping. ➡ External processes (and

    the kernel) can “wake up” a process at any time by sending “signals”.
  30. 21 ➡ Most processes are sleeping. ➡ External processes (and

    the kernel) can “wake up” a process at any time by sending “signals”. ➡ Fire signals with “kill”.
  31. 22

  32. 22 ➡ Uninterruptible means it won’t handle signals (directly), but

    waits on its task to finish (it must wake up by itself).
  33. 22 ➡ Uninterruptible means it won’t handle signals (directly), but

    waits on its task to finish (it must wake up by itself). ➡ Used for high-performance loops that needs to focus (like I/O).
  34. 22 ➡ Uninterruptible means it won’t handle signals (directly), but

    waits on its task to finish (it must wake up by itself). ➡ Used for high-performance loops that needs to focus (like I/O). ➡ Still can be preempted by the scheduler!
  35. 23

  36. 23 ➡ Zombies aren’t bad. ➡ It’s just bad programming

    or administration that creates zombies.
  37. 23 ➡ Zombies aren’t bad. ➡ It’s just bad programming

    or administration that creates zombies. ➡ They will not eat brains (at least not much).
  38. 23 ➡ Zombies aren’t bad. ➡ It’s just bad programming

    or administration that creates zombies. ➡ They will not eat brains (at least not much). ➡ But there shouldn’t be many.
  39. 25

  40. 25 ➡ 1 minute, 5 minutes, 15 minutes averages ➡

    Calculated as the number of runnable processes.
  41. 25 ➡ 1 minute, 5 minutes, 15 minutes averages ➡

    Calculated as the number of runnable processes. ➡ Depends on number of CPU’s!
  42. 26 14:57:22 up 35 days, 18:57, 1 user, load average:

    1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute.
  43. 26 14:57:22 up 35 days, 18:57, 1 user, load average:

    1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes
  44. 26 14:57:22 up 35 days, 18:57, 1 user, load average:

    1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes.
  45. 26 14:57:22 up 35 days, 18:57, 1 user, load average:

    1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes. ➡ Single CPU: 52% more than it can handle.
  46. 26 14:57:22 up 35 days, 18:57, 1 user, load average:

    1.52, 0.66, 0.27 ➡ 1.52 average runnable (or blocking) processes in the last minute. ➡ 0.66 average in 5 minutes ➡ 0.27 average in 15 minutes. ➡ Single CPU: 52% more than it can handle. ➡ quad core system: not doing very much
  47. 28 Q: How much memory does this process use? This

    is REALLY hard question to answer! It depends on many factors!
  48. 30

  49. 30 ➡ 4GB memory space, even if you have less

    memory installed. ➡ 1GB is reserved for kernel.
  50. 30 ➡ 4GB memory space, even if you have less

    memory installed. ➡ 1GB is reserved for kernel. ➡ Kernel can swap out memory.
  51. 30 ➡ 4GB memory space, even if you have less

    memory installed. ➡ 1GB is reserved for kernel. ➡ Kernel can swap out memory. ➡ CPU pagefaults and loads back pages.
  52. 31

  53. 31 ➡ Process can allocate memory, but does not necessary

    use it (for instance: preallocation)
  54. 31 ➡ Process can allocate memory, but does not necessary

    use it (for instance: preallocation) ➡ VIRT will increase!
  55. 32

  56. 33 Q: How much free memory does this system have?

    This is an easier, but still hard question to answer!
  57. 34 $ free -m total used free shared buffers cached

    Mem: 375 349 25 0 111 94 -/+ buffers/cache: 143 231 Swap: 400 7 392
  58. 37

  59. 38

  60. 41 $ php composer.phar require monolog/monolog ➡ syslog ➡ files

    ➡ mail ➡ slack / hipchat / irc ➡ logstash
  61. 44

  62. 44 ➡ Most daemons will log into /var/log/* ➡ tail

    -f /var/log/messages ➡ Many times, this is ALL you need!
  63. 45

  64. 45 ➡ Know your tools (top, htop, vmstat, iostat, ps)

    ➡ Know the /proc filesystem ➡ sniff around with tcpdump, netstat, nc etc...
  65. 45 ➡ Know your tools (top, htop, vmstat, iostat, ps)

    ➡ Know the /proc filesystem ➡ sniff around with tcpdump, netstat, nc etc... ➡ man <keyword>
  66. 48 $ strace -ff -p <pid> .... socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)

    = 20 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}]) write(20, "get ez_client1/acls/"..., 44) = 44 read(20, "END\r\n", 8196) = 5 write(20, "get ez_client1/acl/g"..., 40) = 40 read(20, "END\r\n", 8196) = 5 write(20, "quit\r\n", 6) = 6 shutdown(20, 2 /* send and receive */) = 0 close(20) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/client1/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists) chmod("/tmp/smarty", 0777) = 0 access("/userdata/client1/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/userdata/client1/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory) mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
  67. 49 $ strace ping www.google.com .... mprotect(0xb757f000, 4096, PROT_READ) =

    0 munmap(0xb76d8000, 44104) = 0 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0 gettimeofday({1347446161, 382120}, NULL) = 0 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}]) send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32 poll([{fd=3, events=POLLIN}], 1, 5000
  68. 52 $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

    r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 896 17208 124784 121964 0 0 0 0 73 159 0 1 99 0 2 0 896 14968 125628 121976 0 0 844 0 6857 29935 5 58 35 2 2 0 896 14472 125628 121964 0 0 0 0 10691 48526 10 90 0 0 2 0 896 13976 126120 121952 0 0 476 252 10430 49144 7 91 0 2 0 0 896 12744 127016 121968 0 0 896 0 7799 38732 3 71 23 3 0 0 896 12744 127016 121964 0 0 0 0 30 93 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 30 92 0 0 100 0 0 0 896 12760 127016 121964 0 0 0 0 29 99 0 1 99 0 0 0 896 12760 127016 121964 0 0 0 0 32 110 0 0 100 0 0 0 896 12752 127024 121964 0 0 0 324 33 108 0 0 99 1 0 0 896 12752 127024 121964 0 0 0 0 30 103 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 12 34 108 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 0 30 105 0 0 100 0 0 0 896 12752 127032 121964 0 0 0 236 80 101 0 1 99 0
  69. 54 ➡ Unobtrusive probes inside the kernel ➡ Scripts written

    in D language. ➡ SUN / Solaris only (licensing)
  70. 55 ➡ “GPL” version of dtrace ➡ Awesome, but complex

    ➡ But you need / want debug info packages
  71. 61

  72. 61 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡

    Horizontal scalability is easier, but more restrictive.
  73. 61 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡

    Horizontal scalability is easier, but more restrictive. ➡ Configuration is key.
  74. 61 ➡ Design for (vertical) scalability. ➡ Remove SPOFs. ➡

    Horizontal scalability is easier, but more restrictive. ➡ Configuration is key. ➡ Don’t run on full capacity. Have a contingency buffer for peaks and know how to scale out
  75. 62

  76. 62 ➡ One machine for one purpose (app / mail

    / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet / ansible) and are cheap.
  77. 62 ➡ One machine for one purpose (app / mail

    / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet / ansible) and are cheap. ➡ Try to async as much as possible.
  78. 62 ➡ One machine for one purpose (app / mail

    / cron / db / etc). ➡ Virtual machines are easy to setup and maintain (puppet / ansible) and are cheap. ➡ Try to async as much as possible. ➡ Message queues are easy to implement (gearman / *MQ etc).
  79. 64

  80. 65

  81. 65 ➡ Don’t reboot, debug! ➡ Analyze what’s going on,

    ➡ find and isolate the culprit. ➡ Threat the problem, not the symptoms.
  82. 66

  83. 66 ➡ There are many tools out there to analyze

    your system realtime. ➡ Know your running environment (even it’s “not your business”).
  84. 66 ➡ There are many tools out there to analyze

    your system realtime. ➡ Know your running environment (even it’s “not your business”). ➡ Ask 3rd party help if needed.
  85. 67

  86. Find me on twitter: @jaytaph Find me for development and

    training: www.noxlogic.nl Find me on email: [email protected] Find me for blogs: www.adayinthelifeof.nl Thank You! https://joind.in/14205