Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don't reboot, debug! - DPC15

Joshua Thijssen
June 27, 2015
57

Don't reboot, debug! - DPC15

Joshua Thijssen

June 27, 2015
Tweet

Transcript

  1. 1
    Don't reboot, debug!
    A medic first aid course in debugging your server
    Joshua Thijssen
    @JayTaph

    View Slide

  2. 2
    Joshua Thijssen
    Consultant and trainer
    Author of the Symfony Rainbow Series
    http://www.symfony-rainbow.com
    Blog: http://adayinthelifeof.nl
    Email: [email protected]
    Twitter: @jaytaph
    Tech nalyze
    WWW.TECHANALYZE.IO

    View Slide

  3. 3

    View Slide

  4. 3
    It's not

    View Slide

  5. 3
    It is
    It's not

    View Slide

  6. 4

    View Slide

  7. 4
    Our website is not working anymore!!!

    View Slide

  8. Have you tried turning it off and on again?
    5

    View Slide

  9. 6
    Manager
    Not a suit

    View Slide

  10. 6
    Fix it! Every minute we’re
    losing money!
    Manager
    Not a suit

    View Slide

  11. 7

    View Slide

  12. 8
    Deal now or deal later?

    View Slide

  13. 9

    View Slide

  14. 9
    ➡ Is it reproducible later? Probably not.

    View Slide

  15. 9
    ➡ Is it reproducible later? Probably not.
    ➡ Are you solving the problem, or
    desperately trying to remove a symptom?

    View Slide

  16. 9
    ➡ Is it reproducible later? Probably not.
    ➡ Are you solving the problem, or
    desperately trying to remove a symptom?
    ➡ Short term relieve vs long term solution

    View Slide

  17. 10

    View Slide

  18. ➡ Actually analyze, maybe fix the problem.
    10

    View Slide

  19. ➡ Actually analyze, maybe fix the problem.
    ➡ It will cost less to analyze/fix it now, than
    to fix it later.
    10

    View Slide

  20. ➡ Actually analyze, maybe fix the problem.
    ➡ It will cost less to analyze/fix it now, than
    to fix it later.
    10

    View Slide

  21. 11

    View Slide

  22. 11

    View Slide

  23. 12

    View Slide

  24. 12
    ➡ We reboot our system every night!

    View Slide

  25. 12
    ➡ We reboot our system every night!
    ➡ Why? Memory leaks? Just crappy code?

    View Slide

  26. 12
    ➡ We reboot our system every night!
    ➡ Why? Memory leaks? Just crappy code?
    ➡ There is some state not handled correctly!

    View Slide

  27. 12
    ➡ We reboot our system every night!
    ➡ Why? Memory leaks? Just crappy code?
    ➡ There is some state not handled correctly!
    ➡ What happens when # users increase with
    200%? Restart every 12 hours?

    View Slide

  28. 12
    ➡ We reboot our system every night!
    ➡ Why? Memory leaks? Just crappy code?
    ➡ There is some state not handled correctly!
    ➡ What happens when # users increase with
    200%? Restart every 12 hours?
    ➡ Let’s hope you won’t get many visitors!

    View Slide

  29. Find the culprit
    13

    View Slide

  30. Bottleneck Troubleshooting Flowchart
    (BTF)
    14

    View Slide

  31. Site is slow or
    not responding.
    It’s your DB
    Bottleneck Troubleshooting Flowchart
    (BTF)
    14

    View Slide

  32. 15

    View Slide

  33. ➡ Apache / PHP
    15

    View Slide

  34. ➡ Apache / PHP
    ➡ Monitoring / backup
    15

    View Slide

  35. ➡ Apache / PHP
    ➡ Monitoring / backup
    ➡ Hanging cron jobs & runaway tools
    15

    View Slide

  36. ➡ Apache / PHP
    ➡ Monitoring / backup
    ➡ Hanging cron jobs & runaway tools
    ➡ Connectivity / DNS problems
    15

    View Slide

  37. 16
    Linux 101

    View Slide

  38. 17
    Processes

    View Slide

  39. 18

    View Slide

  40. 18
    ➡ Isolated userspace.

    View Slide

  41. 18
    ➡ Isolated userspace.
    ➡ PID (process id) and state.

    View Slide

  42. 18
    ➡ Isolated userspace.
    ➡ PID (process id) and state.
    ➡ Kernel “preempts”, or process yields.

    View Slide

  43. 18
    ➡ Isolated userspace.
    ➡ PID (process id) and state.
    ➡ Kernel “preempts”, or process yields.
    ➡ Multitasking.

    View Slide

  44. 19

    View Slide

  45. 20

    View Slide

  46. 20
    ➡ R Running or runnable

    View Slide

  47. 20
    ➡ R Running or runnable
    ➡ S Interruptible sleep

    View Slide

  48. 20
    ➡ R Running or runnable
    ➡ S Interruptible sleep
    ➡ D Uninterruptible sleep

    View Slide

  49. 20
    ➡ R Running or runnable
    ➡ S Interruptible sleep
    ➡ D Uninterruptible sleep
    ➡ T Stopped

    View Slide

  50. 20
    ➡ R Running or runnable
    ➡ S Interruptible sleep
    ➡ D Uninterruptible sleep
    ➡ T Stopped
    ➡ Z Defunct process (zombies)

    View Slide

  51. 21

    View Slide

  52. 21
    ➡ Most processes are sleeping.

    View Slide

  53. 21
    ➡ Most processes are sleeping.
    ➡ External processes (and the kernel) can
    “wake up” a process at any time by
    sending “signals”.

    View Slide

  54. 21
    ➡ Most processes are sleeping.
    ➡ External processes (and the kernel) can
    “wake up” a process at any time by
    sending “signals”.
    ➡ Fire signals with “kill”.

    View Slide

  55. 22

    View Slide

  56. 22
    ➡ Uninterruptible means it won’t handle
    signals (directly), but waits on its task to
    finish (it must wake up by itself).

    View Slide

  57. 22
    ➡ Uninterruptible means it won’t handle
    signals (directly), but waits on its task to
    finish (it must wake up by itself).
    ➡ Used for high-performance loops that
    needs to focus (like I/O).

    View Slide

  58. 22
    ➡ Uninterruptible means it won’t handle
    signals (directly), but waits on its task to
    finish (it must wake up by itself).
    ➡ Used for high-performance loops that
    needs to focus (like I/O).
    ➡ Still can be preempted by the scheduler!

    View Slide

  59. 23

    View Slide

  60. 23
    ➡ Zombies aren’t bad.

    View Slide

  61. 23
    ➡ Zombies aren’t bad.
    ➡ It’s just bad programming or
    administration that creates
    zombies.

    View Slide

  62. 23
    ➡ Zombies aren’t bad.
    ➡ It’s just bad programming or
    administration that creates
    zombies.
    ➡ They will not eat brains
    (at least not much).

    View Slide

  63. 23
    ➡ Zombies aren’t bad.
    ➡ It’s just bad programming or
    administration that creates
    zombies.
    ➡ They will not eat brains
    (at least not much).
    ➡ But there shouldn’t be many.

    View Slide

  64. 24
    Load average

    View Slide

  65. 25

    View Slide

  66. 25
    ➡ 1 minute, 5 minutes, 15 minutes averages

    View Slide

  67. 25
    ➡ 1 minute, 5 minutes, 15 minutes averages
    ➡ Calculated as the number of runnable
    processes.

    View Slide

  68. 25
    ➡ 1 minute, 5 minutes, 15 minutes averages
    ➡ Calculated as the number of runnable
    processes.
    ➡ Depends on number of CPU’s!

    View Slide

  69. 26
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27

    View Slide

  70. 26
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable (or blocking)
    processes in the last minute.

    View Slide

  71. 26
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable (or blocking)
    processes in the last minute.
    ➡ 0.66 average in 5 minutes

    View Slide

  72. 26
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable (or blocking)
    processes in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.

    View Slide

  73. 26
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable (or blocking)
    processes in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.
    ➡ Single CPU: 52% more than it can handle.

    View Slide

  74. 26
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable (or blocking)
    processes in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.
    ➡ Single CPU: 52% more than it can handle.
    ➡ quad core system: not doing very much

    View Slide

  75. 27
    Memory

    View Slide

  76. 28
    Q: How much memory does this process use?
    This is REALLY hard question to answer!
    It depends on many factors!

    View Slide

  77. 29
    ➡ Virtual memory
    ➡ Shared memory
    ➡ Resident memory
    ➡ Swapped memory

    View Slide

  78. 30

    View Slide

  79. 30
    ➡ 4GB memory space, even if you have less
    memory installed.

    View Slide

  80. 30
    ➡ 4GB memory space, even if you have less
    memory installed.
    ➡ 1GB is reserved for kernel.

    View Slide

  81. 30
    ➡ 4GB memory space, even if you have less
    memory installed.
    ➡ 1GB is reserved for kernel.
    ➡ Kernel can swap out memory.

    View Slide

  82. 30
    ➡ 4GB memory space, even if you have less
    memory installed.
    ➡ 1GB is reserved for kernel.
    ➡ Kernel can swap out memory.
    ➡ CPU pagefaults and loads back pages.

    View Slide

  83. 31

    View Slide

  84. 31
    ➡ Process can allocate memory, but does
    not necessary use it (for instance:
    preallocation)

    View Slide

  85. 31
    ➡ Process can allocate memory, but does
    not necessary use it (for instance:
    preallocation)
    ➡ VIRT will increase!

    View Slide

  86. 32

    View Slide

  87. 33
    Q: How much free memory does this system have?
    This is an easier, but still hard question to answer!

    View Slide

  88. 34
    $ free -m
    total used free shared buffers cached
    Mem: 375 349 25 0 111 94
    -/+ buffers/cache: 143 231
    Swap: 400 7 392

    View Slide

  89. 35
    Monitoring

    View Slide

  90. 36
    ➡ Monitor everything
    ➡ System / infra monitoring
    ➡ Application monitoring

    View Slide

  91. 37

    View Slide

  92. 38

    View Slide

  93. 39
    Logging
    Logging

    View Slide

  94. 40
    ➡ Log EVERYTHING from all sources.
    ➡ Filter later.

    View Slide

  95. 41
    $ php composer.phar require monolog/monolog
    ➡ syslog
    ➡ files
    ➡ mail
    ➡ slack / hipchat / irc
    ➡ logstash

    View Slide

  96. 42
    System tools

    View Slide

  97. TAIL
    43

    View Slide

  98. 44

    View Slide

  99. 44
    ➡ Most daemons will log into /var/log/*

    View Slide

  100. 44
    ➡ Most daemons will log into /var/log/*
    ➡ tail -f /var/log/messages

    View Slide

  101. 44
    ➡ Most daemons will log into /var/log/*
    ➡ tail -f /var/log/messages
    ➡ Many times, this is ALL you need!

    View Slide

  102. 45

    View Slide

  103. 45
    ➡ Know your tools (top, htop, vmstat, iostat, ps)

    View Slide

  104. 45
    ➡ Know your tools (top, htop, vmstat, iostat, ps)
    ➡ Know the /proc filesystem

    View Slide

  105. 45
    ➡ Know your tools (top, htop, vmstat, iostat, ps)
    ➡ Know the /proc filesystem
    ➡ sniff around with tcpdump, netstat, nc etc...

    View Slide

  106. 45
    ➡ Know your tools (top, htop, vmstat, iostat, ps)
    ➡ Know the /proc filesystem
    ➡ sniff around with tcpdump, netstat, nc etc...
    ➡ man

    View Slide

  107. 46
    strace

    View Slide

  108. 47
    ➡ strace displays system calls and signals
    ➡ Communication between applications
    and the kernel.

    View Slide

  109. 48
    $ strace -ff -p
    ....
    socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20
    fcntl(20, F_GETFL) = 0x2 (flags O_RDWR)
    fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS
    (Operation now in progress)
    poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}])
    write(20, "get ez_client1/acls/"..., 44) = 44
    read(20, "END\r\n", 8196) = 5
    write(20, "get ez_client1/acl/g"..., 40) = 40
    read(20, "END\r\n", 8196) = 5
    write(20, "quit\r\n", 6) = 6
    shutdown(20, 2 /* send and receive */) = 0
    close(20) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/client1/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file
    or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or
    directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/client1/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or
    directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or
    directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)

    View Slide

  110. 49
    $ strace ping www.google.com
    ....
    mprotect(0xb757f000, 4096, PROT_READ) = 0
    munmap(0xb76d8000, 44104) = 0
    stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0
    socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
    connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0
    gettimeofday({1347446161, 382120}, NULL) = 0
    poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
    send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32
    poll([{fd=3, events=POLLIN}], 1, 5000

    View Slide

  111. ➡ strace -e trace=open
    ➡ strace -ff -p
    50

    View Slide

  112. 51
    vmstat

    View Slide

  113. 52
    $ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 896 17208 124784 121964 0 0 0 0 73 159 0 1 99 0
    2 0 896 14968 125628 121976 0 0 844 0 6857 29935 5 58 35 2
    2 0 896 14472 125628 121964 0 0 0 0 10691 48526 10 90 0 0
    2 0 896 13976 126120 121952 0 0 476 252 10430 49144 7 91 0 2
    0 0 896 12744 127016 121968 0 0 896 0 7799 38732 3 71 23 3
    0 0 896 12744 127016 121964 0 0 0 0 30 93 0 0 100 0
    0 0 896 12760 127016 121964 0 0 0 0 30 92 0 0 100 0
    0 0 896 12760 127016 121964 0 0 0 0 29 99 0 1 99 0
    0 0 896 12760 127016 121964 0 0 0 0 32 110 0 0 100 0
    0 0 896 12752 127024 121964 0 0 0 324 33 108 0 0 99 1
    0 0 896 12752 127024 121964 0 0 0 0 30 103 0 0 100 0
    0 0 896 12752 127032 121964 0 0 0 12 34 108 0 0 100 0
    0 0 896 12752 127032 121964 0 0 0 0 30 105 0 0 100 0
    0 0 896 12752 127032 121964 0 0 0 236 80 101 0 1 99 0

    View Slide

  114. 53
    System tap / dtrace
    System TAP (dtrace)

    View Slide

  115. 54
    ➡ Unobtrusive probes inside the kernel
    ➡ Scripts written in D language.
    ➡ SUN / Solaris only (licensing)

    View Slide

  116. 55
    ➡ “GPL” version of dtrace
    ➡ Awesome, but complex
    ➡ But you need / want debug info packages

    View Slide

  117. probe syscall.open {
    printf(“%s(%d) open (%s)\n”, execname(), pid(), argstr);
    }
    56
    stap syscall.stp

    View Slide

  118. 57
    ➡ There are some “providers” in the PHP
    core (zend_dtrace.{c,h,d})

    View Slide

  119. 58
    ➡ Valgrind
    ➡ GDB
    ➡ XDebug / profiler
    ➡ MySQL proxy

    View Slide

  120. 59
    Think about your app / infra BEFORE going live...

    View Slide

  121. Make a plan
    60

    View Slide

  122. 61

    View Slide

  123. 61
    ➡ Design for (vertical) scalability.

    View Slide

  124. 61
    ➡ Design for (vertical) scalability.
    ➡ Remove SPOFs.

    View Slide

  125. 61
    ➡ Design for (vertical) scalability.
    ➡ Remove SPOFs.
    ➡ Horizontal scalability is easier, but more
    restrictive.

    View Slide

  126. 61
    ➡ Design for (vertical) scalability.
    ➡ Remove SPOFs.
    ➡ Horizontal scalability is easier, but more
    restrictive.
    ➡ Configuration is key.

    View Slide

  127. 61
    ➡ Design for (vertical) scalability.
    ➡ Remove SPOFs.
    ➡ Horizontal scalability is easier, but more
    restrictive.
    ➡ Configuration is key.
    ➡ Don’t run on full capacity. Have a contingency
    buffer for peaks and know how to scale out

    View Slide

  128. 62

    View Slide

  129. 62
    ➡ One machine for one purpose (app / mail / cron /
    db / etc).

    View Slide

  130. 62
    ➡ One machine for one purpose (app / mail / cron /
    db / etc).
    ➡ Virtual machines are easy to setup and maintain
    (puppet / ansible) and are cheap.

    View Slide

  131. 62
    ➡ One machine for one purpose (app / mail / cron /
    db / etc).
    ➡ Virtual machines are easy to setup and maintain
    (puppet / ansible) and are cheap.
    ➡ Try to async as much as possible.

    View Slide

  132. 62
    ➡ One machine for one purpose (app / mail / cron /
    db / etc).
    ➡ Virtual machines are easy to setup and maintain
    (puppet / ansible) and are cheap.
    ➡ Try to async as much as possible.
    ➡ Message queues are easy to implement
    (gearman / *MQ etc).

    View Slide

  133. 63
    Recap
    Conclusion

    View Slide

  134. 64

    View Slide

  135. 65

    View Slide

  136. 65
    ➡ Don’t reboot, debug!

    View Slide

  137. 65
    ➡ Don’t reboot, debug!
    ➡ Analyze what’s going on,

    View Slide

  138. 65
    ➡ Don’t reboot, debug!
    ➡ Analyze what’s going on,
    ➡ find and isolate the culprit.

    View Slide

  139. 65
    ➡ Don’t reboot, debug!
    ➡ Analyze what’s going on,
    ➡ find and isolate the culprit.
    ➡ Threat the problem, not the symptoms.

    View Slide

  140. 66

    View Slide

  141. 66
    ➡ There are many tools out there to analyze your
    system realtime.

    View Slide

  142. 66
    ➡ There are many tools out there to analyze your
    system realtime.
    ➡ Know your running environment
    (even it’s “not your business”).

    View Slide

  143. 66
    ➡ There are many tools out there to analyze your
    system realtime.
    ➡ Know your running environment
    (even it’s “not your business”).
    ➡ Ask 3rd party help if needed.

    View Slide

  144. 67

    View Slide

  145. Find me on twitter: @jaytaph
    Find me for development and training: www.noxlogic.nl
    Find me on email: [email protected]
    Find me for blogs: www.adayinthelifeof.nl
    Thank You!
    https://joind.in/14205

    View Slide