Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don't reboot, debug! - PHPNW12

Don't reboot, debug! - PHPNW12

Joshua Thijssen

October 07, 2012
Tweet

More Decks by Joshua Thijssen

Other Decks in Programming

Transcript

  1. http://www.techademy.nl http://joind.in/xxxx
    Techademy Workshop - dd-mmm-YYYY
    Don’t reboot!
    Joshua Thijssen
    debug!

    View Slide

  2. Joshua Thijssen / The Netherlands
    Freelance consultant, developer and
    trainer @ NoxLogic / TechAdemy
    Development in PHP, Python, Perl, C,
    Java.. also sysadmin.
    Lead developer of Saffire
    Blog: http://adayinthelifeof.nl
    Email: [email protected]
    Twitter: @jaytaph
    2
    .whoami

    View Slide

  3. 3
    The most question you can ask:
    wrong
    incorrect
    irritating
    annoying
    stupendous
    evil
    improper
    unethical
    immoral
    unjust
    wicked
    inaccurate

    View Slide

  4. Title Text
    Have you tried turning it off and on again?
    4

    View Slide

  5. 5

    View Slide

  6. Title Text
    6

    View Slide

  7. Title Text
    7

    View Slide

  8. Title Text
    8

    View Slide

  9. Title Text
    9

    View Slide

  10. Have you tried turning it off and on again?

    View Slide

  11. Title Text
    11

    View Slide

  12. Title Text
    12

    View Slide

  13. Title Text
    13
    Fix it! Every minute we’re
    losing money!

    View Slide

  14. Title Text
    14

    View Slide

  15. Title Text
    15
    Deal now or deal later?

    View Slide

  16. We will deal with the problem later!
    16
    ➡ Is it reproducible later? Probably not.
    ➡ Are you solving the problem, or desperately
    trying to remove a symptom?
    ➡ Short term relieve vs long term solution

    View Slide

  17. Deal with the problem now
    ➡ Actually analyze, maybe fix the problem.
    ➡ It will cost less to analyze/fix it now, than to
    fix it later.
    ➡ You just saved a few gazillion dollars!
    17

    View Slide

  18. View Slide

  19. 19
    ➡ We reboot our system every night!
    ➡ Why? Memory leaks? Just crappy code?
    ➡ There is some state not handled correctly! Fix it!
    ➡ What happens when # users increase with 200%?
    Restart every 12 hours?
    ➡ Here’s hoping not never getting many visitors!

    View Slide

  20. Find the culprit

    View Slide

  21. Site is slow or
    not responding.
    It’s your DB
    Bottleneck Troubleshooting Flowchart
    (BTF)

    View Slide

  22. Title Text
    22
    MySQL

    View Slide

  23. Title Text
    ➡ We use MySQL because it’s so easy to
    setup and use.
    ➡ No, it’s not...
    23
    MySQL is easy to setup and configure

    View Slide

  24. my.cnf
    24
    max_heap_table_size = 16M
    tmp_table_size = 32M

    View Slide

  25. Other usual suspects
    ➡ Apache / PHP
    ➡ Monitoring / backup
    ➡ Hanging cron jobs
    ➡ Runaway tools
    ➡ Connectivity / DNS problems
    25

    View Slide

  26. 26
    Linux 101

    View Slide

  27. 27
    Processes

    View Slide

  28. Processes
    28
    ➡ Isolated userspace
    ➡ PID and state.
    ➡ Kernel “preempts”, or process yields.
    ➡ Multitasking

    View Slide

  29. 29

    View Slide

  30. Process states
    30
    ➡ R Running or runnable
    ➡ S Interruptible sleep
    ➡ D Uninterruptible sleep
    ➡ T Stopped
    ➡ Z Defunct process (zombies)

    View Slide

  31. Process states
    31
    ➡ Most processes are sleeping.
    ➡ External processes (and the kernel) can
    “wake up” a process at any time.
    ➡ fire signals with “kill”

    View Slide

  32. Process states
    32
    ➡ Uninterruptible means it won’t handle
    signals (directly).
    ➡ Used for high-performance loops that
    needs to focus (like I/O).
    ➡ Still can be preempted by the scheduler!

    View Slide

  33. 33
    ➡ Zombies aren’t bad.
    ➡ It’s just bad programming or
    administration that creates
    zombies.
    ➡ They will not eat brains
    (at least not much).
    ➡ But there shouldn’t be many.
    Zombies

    View Slide

  34. 34
    Load average

    View Slide

  35. Load average
    35
    ➡ 1 minute, 5 minutes, 15 minutes averages
    ➡ Calculated as the number of runnable
    processes.
    ➡ Depends on number of CPU’s!
    ➡ Linux also adds uninterruptible sleeps

    View Slide

  36. Load average (./uptime)
    36
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable (or blocking)
    processes in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.
    ➡ Single CPU: 52% more than it can handle.
    ➡ quad core system: not doing very much

    View Slide

  37. 37
    Memory

    View Slide

  38. Memory
    38
    Q: How much memory does this process use?
    This is REALLY hard question to answer!
    It depends on many factors!

    View Slide

  39. Memory
    39
    ➡ 4GB memory space, even if you have less
    memory installed
    ➡ Kernel can swap out memory
    ➡ CPU pagefaults and loads back pages

    View Slide

  40. Memory
    40
    ➡ Process can allocate memory, but does not
    necessary use it (for instance: preallocation)
    ➡ VIRT will increase!

    View Slide

  41. Memory (as seen in “top”)
    41
    ➡ Virtual memory
    ➡ Resident memory
    ➡ Shared memory
    ➡ Swapped memory

    View Slide

  42. Memory
    42
    Q: How much free memory does this system have?
    This is an easier, but still hard question to answer!

    View Slide

  43. $ free -m
    43
    total used free shared buffers cached
    Mem: 375 349 25 0 111 94
    -/+ buffers/cache: 143 231
    Swap: 400 7 392

    View Slide

  44. 44
    Monitoring

    View Slide

  45. 45
    ➡ Monitor everything
    ➡ System / infra monitoring
    ➡ Application monitoring

    View Slide

  46. 46

    View Slide

  47. 47

    View Slide

  48. 48
    Logging
    Logging

    View Slide

  49. 49
    ➡ Log EVERYTHING from all sources.
    ➡ Filter later.

    View Slide

  50. 50
    Logstash Graylog2
    wtf

    View Slide

  51. 51
    System tools

    View Slide

  52. TAIL
    52

    View Slide

  53. 53
    ➡ Most daemons will log into /var/log/*
    ➡ tail -f /var/log/messages
    ➡ Many times, this is ALL you need!

    View Slide

  54. 54
    ➡ Know your tools (top, htop, vmstat, iostat, ps)
    ➡ Know the /proc filesystem
    ➡ sniff around with tcpdump, netstat, nc etc...
    ➡ man

    View Slide

  55. 55
    strace

    View Slide

  56. 56
    ➡ strace displays system calls and signals
    ➡ Communication between applications and
    the kernel.

    View Slide

  57. 57
    socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20
    fcntl(20, F_GETFL) = 0x2 (flags O_RDWR)
    fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
    poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}])
    write(20, "get ez_borentappenschuren-/acls/"..., 44) = 44
    read(20, "END\r\n", 8196) = 5
    write(20, "get ez_borentappenschuren-/acl/g"..., 40) = 40
    read(20, "END\r\n", 8196) = 5
    write(20, "quit\r\n", 6) = 6
    shutdown(20, 2 /* send and receive */) = 0
    close(20) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/borentappenschuren/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/borentappenschuren/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    strace -ff -p

    View Slide

  58. 58
    mprotect(0xb757f000, 4096, PROT_READ) = 0
    munmap(0xb76d8000, 44104) = 0
    stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0
    socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
    connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0
    gettimeofday({1347446161, 382120}, NULL) = 0
    poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
    send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32
    poll([{fd=3, events=POLLIN}], 1, 5000
    strace ping www.google.com

    View Slide

  59. ➡ strace -e trace=open
    ➡ strace -ff -p
    59

    View Slide

  60. 60
    ltrace

    View Slide

  61. 61
    ➡ traces library calls
    ➡ Grep and filter.
    ➡ careful: ltrace php -r 'echo "hello world";'
    outputs 92476 lines!

    View Slide

  62. 62
    System tap / dtrace
    System TAP (dtrace)

    View Slide

  63. Dtrace
    63
    ➡ Unobtrusive probes inside the kernel
    ➡ Scripts written in D language.
    ➡ SUN / Solaris only (licensing)

    View Slide

  64. systemtap
    64
    ➡ “GPL” version of dtrace
    ➡ Awesome, but complex
    ➡ But you need / want debug info packages

    View Slide

  65. probe syscall.open {
    printf(“%s(%d) open (%s)\n”, execname(), pid(), argstr);
    }
    systemtap example
    65
    stap syscall.stp

    View Slide

  66. systemtap / dtrace
    66
    ➡ There are some “providers” in the PHP
    core (zend_dtrace.{c,h,d})

    View Slide

  67. 67
    ➡ Valgrind
    ➡ GDB
    ➡ XDebug / profiler
    ➡ MySQL proxy
    Other really cool tools to look at

    View Slide

  68. 68
    Think about your app / infra BEFORE going live...

    View Slide

  69. 69
    ➡ Design for (vertical) scalability.
    ➡ Remove SPOFs.
    ➡ Horizontal scalability is easier, but more
    restrictive.
    ➡ Configuration is key.
    ➡ Don’t run on full capacity. Have a contingency
    buffer for peaks.

    View Slide

  70. Make a plan

    View Slide

  71. 71
    Recap
    Conclusion

    View Slide

  72. View Slide

  73. 73
    ➡ Don’t reboot, debug!
    ➡ Analyze what’s going on,
    ➡ and find and isolate the culprit.
    ➡ Threat the problem, not the symptoms.

    View Slide

  74. 74
    ➡ There are many tools out there to analyze your
    system realtime.
    ➡ Know your running environment (even it’s “not
    your business”).
    ➡ Ask 3rd party help if needed.

    View Slide

  75. 75
    ➡ One machine for one purpose (app / mail / cron /
    db / etc).
    ➡ Virtual machines are easy to setup and maintain
    (puppet) and are cheap.
    ➡ Try to async as much as possible.
    ➡ Message queues are easy to implement
    (gearman / *MQ etc).

    View Slide

  76. Questions?
    76

    View Slide

  77. 77
    Find me on twitter: @jaytaph
    Find me for development and training: www.noxlogic.nl
    Find me on email: [email protected]
    Find me for blogs: www.adayinthelifeof.nl
    Thank You!
    https://joind.in/6939

    View Slide