$30 off During Our Annual Pro Sale. View Details »

Don't reboot, debug! IPC

Joshua Thijssen
October 30, 2013
130

Don't reboot, debug! IPC

Joshua Thijssen

October 30, 2013
Tweet

Transcript

  1. Don’t reboot!
    Joshua Thijssen
    debug!
    1

    View Slide

  2. 2
    Joshua Thijssen
    Freelance consultant, developer and
    trainer @ NoxLogic
    Founder of the Dutch Web Alliance
    Development in PHP, Python, Perl, C,
    Java. Lead developer of Saffire.
    Blog: http://adayinthelifeof.nl
    Email: [email protected]
    Twitter: @jaytaph

    View Slide

  3. 3
    Our website is not working anymore!!!

    View Slide

  4. Have you tried turning it off and on again?
    4

    View Slide

  5. 5

    View Slide

  6. 6

    View Slide

  7. 7

    View Slide

  8. 8

    View Slide

  9. 9

    View Slide

  10. Have you tried turning it off and on again?
    10

    View Slide

  11. 11

    View Slide

  12. 12

    View Slide

  13. 13
    Fix it! Every minute we’re
    losing money!
    Manager

    View Slide

  14. 14

    View Slide

  15. 15
    Deal now or deal later?

    View Slide

  16. 16
    ➡ Is it reproducible later? Probably not.
    ➡ Are you solving the problem, or
    desperately trying to remove a symptom?
    ➡ Short term relieve vs long term solution

    View Slide

  17. ➡ Actually analyze, maybe fix the problem.
    ➡ It will cost less to analyze/fix it now, than to
    fix it later.
    17

    View Slide

  18. 18

    View Slide

  19. 19
    ➡ We reboot our system every night!
    ➡ Why? Memory leaks? Just crappy code?
    ➡ There is some state not handled correctly! Fix it!
    ➡ What happens when # users increase with
    200%? Restart every 12 hours?
    ➡ Let’s hope you won’t get many visitors!

    View Slide

  20. Find the culprit
    20

    View Slide

  21. Site is slow or
    not responding.
    It’s your DB
    Bottleneck Troubleshooting Flowchart
    (BTF)
    21

    View Slide

  22. 22
    MySQL

    View Slide

  23. 23
    MySQL is easy to setup and configure

    View Slide

  24. 24
    max_heap_table_size = 16M
    tmp_table_size = 32M

    View Slide

  25. ➡ Apache / PHP
    ➡ Monitoring / backup
    ➡ Hanging cron jobs & runaway tools
    ➡ Connectivity / DNS problems
    25

    View Slide

  26. 26
    Linux 101

    View Slide

  27. 27
    Processes

    View Slide

  28. 28
    ➡ Isolated userspace
    ➡ PID and state.
    ➡ Kernel “preempts”, or process yields.
    ➡ Multitasking

    View Slide

  29. 29

    View Slide

  30. 30
    ➡ R Running or runnable
    ➡ S Interruptible sleep
    ➡ D Uninterruptible sleep
    ➡ T Stopped
    ➡ Z Defunct process (zombies)

    View Slide

  31. 31
    ➡ Most processes are sleeping.
    ➡ External processes (and the kernel) can
    “wake up” a process at any time by sending
    “signals”.
    ➡ fire signals with “kill”.

    View Slide

  32. 32
    ➡ Uninterruptible means it won’t handle
    signals (directly), but waits on its task to
    finish.
    ➡ Used for high-performance loops that
    needs to focus (like I/O).
    ➡ Still can be preempted by the scheduler!

    View Slide

  33. 33
    ➡ Zombies aren’t bad.
    ➡ It’s just bad programming or
    administration that creates
    zombies.
    ➡ They will not eat brains
    (at least not much).
    ➡ But there shouldn’t be many.

    View Slide

  34. 34
    Load average

    View Slide

  35. 35
    ➡ 1 minute, 5 minutes, 15 minutes averages
    ➡ Calculated as the number of runnable
    processes.
    ➡ Depends on number of CPU’s!

    View Slide

  36. 36
    14:57:22 up 35 days, 18:57, 1 user, load average: 1.52, 0.66, 0.27
    ➡ 1.52 average runnable (or blocking)
    processes in the last minute.
    ➡ 0.66 average in 5 minutes
    ➡ 0.27 average in 15 minutes.
    ➡ Single CPU: 52% more than it can handle.
    ➡ quad core system: not doing very much

    View Slide

  37. 37
    Memory

    View Slide

  38. 38
    Q: How much memory does this process use?
    This is REALLY hard question to answer!
    It depends on many factors!

    View Slide

  39. 39
    ➡ Virtual memory
    ➡ Shared memory
    ➡ Resident memory
    ➡ Swapped memory

    View Slide

  40. 40
    ➡ 4GB memory space, even if you have less
    memory installed.
    ➡ 1GB is reserved for kernel.
    ➡ Kernel can swap out memory.
    ➡ CPU pagefaults and loads back pages.

    View Slide

  41. 41
    ➡ Process can allocate memory, but does not
    necessary use it (for instance: preallocation)
    ➡ VIRT will increase!

    View Slide

  42. 42

    View Slide

  43. 43
    Q: How much free memory does this system have?
    This is an easier, but still hard question to answer!

    View Slide

  44. 44
    $ free -m
    total used free shared buffers cached
    Mem: 375 349 25 0 111 94
    -/+ buffers/cache: 143 231
    Swap: 400 7 392

    View Slide

  45. 45
    Monitoring

    View Slide

  46. 46
    ➡ Monitor everything
    ➡ System / infra monitoring
    ➡ Application monitoring

    View Slide

  47. 47

    View Slide

  48. 48

    View Slide

  49. 49
    Logging
    Logging

    View Slide

  50. 50
    ➡ Log EVERYTHING from all sources.
    ➡ Filter later.

    View Slide

  51. 51
    Logstash Graylog2
    wtf

    View Slide

  52. 52
    System tools

    View Slide

  53. TAIL
    53

    View Slide

  54. 54
    ➡ Most daemons will log into /var/log/*
    ➡ tail -f /var/log/messages
    ➡ Many times, this is ALL you need!

    View Slide

  55. 55
    ➡ Know your tools (top, htop, vmstat, iostat, ps)
    ➡ Know the /proc filesystem
    ➡ sniff around with tcpdump, netstat, nc etc...
    ➡ man

    View Slide

  56. 56
    strace

    View Slide

  57. 57
    ➡ strace displays system calls and signals
    ➡ Communication between applications and
    the kernel.

    View Slide

  58. 58
    socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 20
    fcntl(20, F_GETFL) = 0x2 (flags O_RDWR)
    fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
    connect(20, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
    poll([{fd=20, events=POLLOUT}], 1, -1) = 1 ([{fd=20, revents=POLLOUT}])
    write(20, "get ez_borentappenschuren-/acls/"..., 44) = 44
    read(20, "END\r\n", 8196) = 5
    write(20, "get ez_borentappenschuren-/acl/g"..., 40) = 40
    read(20, "END\r\n", 8196) = 5
    write(20, "quit\r\n", 6) = 6
    shutdown(20, 2 /* send and receive */) = 0
    close(20) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/borentappenschuren/user/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/user/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right-last.tpl", F_OK) = -1 ENOENT (No such file or directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    chmod("/tmp/smarty", 0777) = 0
    access("/userdata/borentappenschuren/user/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/user/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/userdata/borentappenschuren/theme/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/nl/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/noxlogic/root/themes/ezshopping/templates/block-right.tpl", F_OK) = -1 ENOENT (No such file or directory)
    mkdir("/tmp/smarty", 0777) = -1 EEXIST (File exists)
    strace -ff -p

    View Slide

  59. 59
    mprotect(0xb757f000, 4096, PROT_READ) = 0
    munmap(0xb76d8000, 44104) = 0
    stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=59, ...}) = 0
    socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
    connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.178.4")}, 16) = 0
    gettimeofday({1347446161, 382120}, NULL) = 0
    poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
    send(3, "u\205\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL) = 32
    poll([{fd=3, events=POLLIN}], 1, 5000
    $ strace ping www.google.com

    View Slide

  60. ➡ strace -e trace=open
    ➡ strace -ff -p
    60

    View Slide

  61. 61
    vmstat

    View Slide

  62. 62
    $ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 896 17208 124784 121964 0 0 0 0 73 159 0 1 99 0
    2 0 896 14968 125628 121976 0 0 844 0 6857 29935 5 58 35 2
    2 0 896 14472 125628 121964 0 0 0 0 10691 48526 10 90 0 0
    2 0 896 13976 126120 121952 0 0 476 252 10430 49144 7 91 0 2
    0 0 896 12744 127016 121968 0 0 896 0 7799 38732 3 71 23 3
    0 0 896 12744 127016 121964 0 0 0 0 30 93 0 0 100 0
    0 0 896 12760 127016 121964 0 0 0 0 30 92 0 0 100 0
    0 0 896 12760 127016 121964 0 0 0 0 29 99 0 1 99 0
    0 0 896 12760 127016 121964 0 0 0 0 32 110 0 0 100 0
    0 0 896 12752 127024 121964 0 0 0 324 33 108 0 0 99 1
    0 0 896 12752 127024 121964 0 0 0 0 30 103 0 0 100 0
    0 0 896 12752 127032 121964 0 0 0 12 34 108 0 0 100 0
    0 0 896 12752 127032 121964 0 0 0 0 30 105 0 0 100 0
    0 0 896 12752 127032 121964 0 0 0 236 80 101 0 1 99 0

    View Slide

  63. 63
    System tap / dtrace
    System TAP (dtrace)

    View Slide

  64. 64
    ➡ Unobtrusive probes inside the kernel
    ➡ Scripts written in D language.
    ➡ SUN / Solaris only (licensing)

    View Slide

  65. 65
    ➡ “GPL” version of dtrace
    ➡ Awesome, but complex
    ➡ But you need / want debug info packages

    View Slide

  66. probe syscall.open {
    printf(“%s(%d) open (%s)\n”, execname(), pid(), argstr);
    }
    66
    stap syscall.stp

    View Slide

  67. 67
    ➡ There are some “providers” in the PHP
    core (zend_dtrace.{c,h,d})

    View Slide

  68. 68
    ➡ Valgrind
    ➡ GDB
    ➡ XDebug / profiler
    ➡ MySQL proxy

    View Slide

  69. 69
    Think about your app / infra BEFORE going live...

    View Slide

  70. Make a plan
    70

    View Slide

  71. 71
    ➡ Design for (vertical) scalability.
    ➡ Remove SPOFs.
    ➡ Horizontal scalability is easier, but more
    restrictive.
    ➡ Configuration is key.
    ➡ Don’t run on full capacity. Have a contingency
    buffer for peaks and know how to scale out

    View Slide

  72. 72
    ➡ One machine for one purpose (app / mail / cron /
    db / etc).
    ➡ Virtual machines are easy to setup and maintain
    (puppet) and are cheap.
    ➡ Try to async as much as possible.
    ➡ Message queues are easy to implement
    (gearman / *MQ etc).

    View Slide

  73. 73
    Recap
    Conclusion

    View Slide

  74. 74

    View Slide

  75. 75
    ➡ Don’t reboot, debug!
    ➡ Analyze what’s going on,
    ➡ find and isolate the culprit.
    ➡ Threat the problem, not the symptoms.

    View Slide

  76. 76
    ➡ There are many tools out there to analyze your
    system realtime.
    ➡ Know your running environment (even it’s “not
    your business”).
    ➡ Ask 3rd party help if needed.

    View Slide

  77. 77

    View Slide

  78. Find me on twitter: @jaytaph
    Find me for development and training: www.noxlogic.nl
    Find me on email: [email protected]
    Find me for blogs: www.adayinthelifeof.nl
    Thank You!
    https://joind.in/9564

    View Slide