Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don't reboot, debug!

Joshua Thijssen
September 18, 2015
79

Don't reboot, debug!

Joshua Thijssen

September 18, 2015
Tweet

Transcript

  1. 1
    Don't reboot, debug!
    A medic first aid course in debugging your server
    Joshua Thijssen
    @JayTaph

    View Slide

  2. 2

    View Slide

  3. 3

    View Slide

  4. 4
    It's not

    View Slide

  5. 4
    It is
    It's not

    View Slide

  6. Have you tried turning it off and on again?
    5

    View Slide

  7. Find the culprit
    6

    View Slide

  8. Bottleneck Troubleshooting Flowchart
    (BTF)
    7

    View Slide

  9. Site is slow or
    not responding.
    It’s your DB
    Bottleneck Troubleshooting Flowchart
    (BTF)
    7

    View Slide

  10. 8
    Other causes:

    View Slide

  11. ➡ Apache / PHP / nginx/php-fpm
    8
    Other causes:

    View Slide

  12. ➡ Apache / PHP / nginx/php-fpm
    ➡ Monitoring / backup
    8
    Other causes:

    View Slide

  13. ➡ Apache / PHP / nginx/php-fpm
    ➡ Monitoring / backup
    ➡ Hanging cron jobs & runaway tools
    8
    Other causes:

    View Slide

  14. ➡ Apache / PHP / nginx/php-fpm
    ➡ Monitoring / backup
    ➡ Hanging cron jobs & runaway tools
    ➡ Connectivity / DNS problems
    8
    Other causes:

    View Slide

  15. 9
    Linux 101

    View Slide

  16. 9
    Linux 101
    201

    View Slide

  17. 10
    Processes

    View Slide

  18. 11

    View Slide

  19. 11
    ➡ Isolated user space.

    View Slide

  20. 11
    ➡ Isolated user space.
    ➡ PID (process id) and state.

    View Slide

  21. 11
    ➡ Isolated user space.
    ➡ PID (process id) and state.
    ➡ Kernel “preempts”, or process yields.

    View Slide

  22. 11
    ➡ Isolated user space.
    ➡ PID (process id) and state.
    ➡ Kernel “preempts”, or process yields.
    ➡ Multitasking.

    View Slide

  23. 12

    View Slide

  24. 12
    ➡ R Running or runnable

    View Slide

  25. 12
    ➡ R Running or runnable
    ➡ S Interruptible sleep

    View Slide

  26. 12
    ➡ R Running or runnable
    ➡ S Interruptible sleep
    ➡ D Uninterruptible sleep

    View Slide

  27. 12
    ➡ R Running or runnable
    ➡ S Interruptible sleep
    ➡ D Uninterruptible sleep
    ➡ Z Defunct process (zombies)

    View Slide

  28. 13

    View Slide

  29. 14

    View Slide

  30. 14
    ➡ Most processes are sleeping.

    View Slide

  31. 14
    ➡ Most processes are sleeping.
    ➡ External processes (and the kernel) can
    “wake up” a process at any time by
    sending “signals”.

    View Slide

  32. 14
    ➡ Most processes are sleeping.
    ➡ External processes (and the kernel) can
    “wake up” a process at any time by
    sending “signals”.
    ➡ Fire signals with “kill”.

    View Slide

  33. 15

    View Slide

  34. 15
    ➡ Uninterruptible means it won’t handle
    signals (directly), but waits on its task to
    finish (it must wake up by itself).

    View Slide

  35. 15
    ➡ Uninterruptible means it won’t handle
    signals (directly), but waits on its task to
    finish (it must wake up by itself).
    ➡ Used for high-performance loops that
    needs to focus (like I/O).

    View Slide

  36. 15
    ➡ Uninterruptible means it won’t handle
    signals (directly), but waits on its task to
    finish (it must wake up by itself).
    ➡ Used for high-performance loops that
    needs to focus (like I/O).
    ➡ Still can be preempted by the kernel.

    View Slide

  37. 16

    View Slide

  38. 16
    ➡ Zombies aren’t bad.

    View Slide

  39. 16
    ➡ Zombies aren’t bad.
    ➡ It’s just bad programming or
    administration that creates
    zombies.

    View Slide

  40. 16
    ➡ Zombies aren’t bad.
    ➡ It’s just bad programming or
    administration that creates
    zombies.
    ➡ But there shouldn’t be many.

    View Slide

  41. 17
    Load average

    View Slide

  42. 18

    View Slide

  43. 18
    ➡ 1 minute, 5 minutes, 15 minutes averages

    View Slide

  44. 18
    ➡ 1 minute, 5 minutes, 15 minutes averages
    ➡ Calculated as the number of runnable
    processes (but has more sources
    nowadays).

    View Slide

  45. 18
    ➡ 1 minute, 5 minutes, 15 minutes averages
    ➡ Calculated as the number of runnable
    processes (but has more sources
    nowadays).
    ➡ Depends on number of CPU’s!

    View Slide

  46. 19
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27

    View Slide

  47. 19
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable processes
    in the last minute.

    View Slide

  48. 19
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable processes
    in the last minute.
    ➡ 0.66 average in 5 minutes

    View Slide

  49. 19
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable processes
    in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.

    View Slide

  50. 19
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable processes
    in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.
    ➡ Single CPU: 52% more than it can handle.

    View Slide

  51. 19
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable processes
    in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.
    ➡ Single CPU: 52% more than it can handle.
    ➡ Quad core system: not doing very much

    View Slide

  52. 20
    Memory

    View Slide

  53. 21
    Q: How much memory does this process use?
    This is REALLY hard question to answer!
    It depends on many factors!

    View Slide

  54. 22

    View Slide

  55. 22

    View Slide

  56. 23
    ➡ Virtual memory (VIRT)
    ➡ Shared memory (SHR SHRD)
    ➡ Resident memory (RES or RSS)
    ➡ Swapped memory (SWP, SWAP)

    View Slide

  57. 24
    (on a 32bit system)

    View Slide

  58. 24
    ➡ Each process has 4GB memory space
    usable.
    (on a 32bit system)

    View Slide

  59. 24
    ➡ Each process has 4GB memory space
    usable.
    ➡ Even if you have less memory installed.
    (on a 32bit system)

    View Slide

  60. 24
    ➡ Each process has 4GB memory space
    usable.
    ➡ Even if you have less memory installed.
    ➡ 1GB is reserved for kernel.
    (on a 32bit system)

    View Slide

  61. 25
    0x00000000
    0xC0000000
    0xFFFFFFFF
    1 GB
    3 GB
    Virtual memory

    View Slide

  62. 25
    0x00000000
    0xC0000000
    0xFFFFFFFF
    1 GB
    3 GB
    Virtual memory
    Translation table

    View Slide

  63. 25
    0x00000000
    0xC0000000
    0xFFFFFFFF
    1 GB
    3 GB
    Virtual memory
    Translation table
    Physical memory

    View Slide

  64. 26
    Process A Process B Process C Physical
    Memory

    View Slide

  65. 26
    Process A Process B Process C Physical
    Memory
    & & &

    View Slide

  66. 26
    Process A Process B Process C Physical
    Memory
    & & &

    View Slide

  67. 26
    Process A Process B Process C Physical
    Memory
    & & &

    View Slide

  68. 26
    Process A Process B Process C Physical
    Memory
    & & &

    View Slide

  69. ➡ New phone book entries are created.
    ➡ VIRT will increase.
    ➡ Allocating memory != using memory.
    27
    Allocating memory

    View Slide

  70. 28

    View Slide

  71. $pid = pcntl_fork();
    if ($pid) {
    echo "Hello, this is the parent process\n";
    } else {
    echo "Hello, this is the child process\n";
    }
    29

    View Slide

  72. 30
    Process A

    View Slide

  73. 30
    Process A
    fork()

    View Slide

  74. 30
    Process A Process B
    fork()

    View Slide

  75. 31
    C1
    B1
    A1
    C1`
    B1`
    A1`
    A1
    B1
    C1
    Physical
    Virtual
    Virtual
    fork() =>

    View Slide

  76. 32
    C1
    B1
    A1
    C1`
    B2
    A1`
    A1
    B1
    C1
    Physical
    Virtual
    Virtual
    fork() =>
    B2

    View Slide

  77. 33

    View Slide

  78. 34
    How much memory is
    our server using?

    View Slide

  79. $ free -m
    total used free shared buffers cached
    Mem: 3963 3500 462 0 722 1263
    -/+ buffers/cache: 1515 2448
    Swap: 400 20 379
    35

    View Slide

  80. $ free -m
    total used free shared buffers cached
    Mem: 3963 3500 462 0 722 1263
    -/+ buffers/cache: 1515 2448
    Swap: 400 20 379
    35

    View Slide

  81. 36
    Monitoring

    View Slide

  82. 37
    ➡ Monitor everything!
    ➡ System / infrastructure
    ➡ Application level

    View Slide

  83. 38

    View Slide

  84. 39

    View Slide

  85. 40
    ➡ With monitoring you have an excellent
    idea:
    ➡ what is happening
    ➡ what happened
    ➡ what will likely be happening

    View Slide

  86. 41
    Logging
    Logging

    View Slide

  87. 42
    ➡ Log everything from everywhere.
    ➡ filter later.

    View Slide

  88. 43
    ➡ syslog
    ➡ files
    ➡ mail
    ➡ slack / hipchat /irc
    ➡ logstash
    $ php composer.phar require monolog/monolog

    View Slide

  89. 44
    System tools

    View Slide

  90. 45

    View Slide

  91. TAIL
    46

    View Slide

  92. 47
    ➡ Most daemons will log into /var/log
    ➡ tail -f /var/log/messages

    View Slide

  93. 48
    strace

    View Slide

  94. 49
    ➡ strace displays system calls and signals
    ➡ Communication between applications
    and the kernel.

    View Slide

  95. 50
    $ strace -ff -p
    ....
    socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20
    fcntl(20, F_GETFL) = 0x2 (flags O_RDWR)
    fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS
    (Operation now in progress)
    poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}])
    write(20, "get ez_client1/acls/"..., 44) = 44
    read(20, "END\r\n", 8196) = 5
    write(20, "get ez_client1/acl/g"..., 40) = 40
    read(20, "END\r\n", 8196) = 5
    write(20, "quit\r\n", 6) = 6
    shutdown(20, 2 /* send and receive */) = 0
    close(20) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/client1/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file
    or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or
    directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/client1/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/client1/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or
    directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or
    directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)

    View Slide

  96. 51
    $ strace ping www.google.com
    ....
    mprotect(0xb757f000, 4096, PROT_READ) = 0
    munmap(0xb76d8000, 44104) = 0
    stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0
    socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
    connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0
    gettimeofday({1347446161, 382120}, NULL) = 0
    poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
    send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32
    poll([{fd=3, events=POLLIN}], 1, 5000

    View Slide

  97. ➡ strace -e trace=open
    ➡ strace -ff -p
    52

    View Slide

  98. 53
    System tap / dtrace
    System TAP (dtrace)

    View Slide

  99. 54
    ➡ Unobtrusive probes inside the kernel
    ➡ Scripts written in D language.
    ➡ SUN / Solaris only (licensing)

    View Slide

  100. 55
    ➡ SystemTAP
    ➡ “GPL” version of dtrace
    ➡ Awesome, but complex
    ➡ But you need / want debug info packages

    View Slide

  101. probe syscall.open {
    printf(“%s(%d) open (%s)\n”, execname(), pid(), argstr);
    }
    56
    stap syscall.stp

    View Slide

  102. 57
    ➡ There are some “providers” in the PHP
    core (zend_dtrace.{c,h,d})
    ➡ file / line
    ➡ function entry / exit
    ➡ exception caught / thrown

    View Slide

  103. Other tools
    ➡ Valgrind
    ➡ GDB
    ➡ XDebug / Profiler
    ➡ MySQL Proxy / Charles
    58

    View Slide

  104. 59

    View Slide

  105. Find me on twitter: @jaytaph
    Find me for development and training: www.noxlogic.nl
    Find me on email: [email protected]
    Find me for blogs: www.adayinthelifeof.nl
    Thank You!
    https://joind.in/talk/view/15191

    View Slide