Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond top: Command-Line Monitoring on the JVM (JavaOne 2015)

Beyond top: Command-Line Monitoring on the JVM (JavaOne 2015)

A session from JavaOne 2015


Colin Jones

October 29, 2015


  1. Beyond top(1) Command-Line Monitoring on the JVM Colin Jones @trptcolin

    8th Light
  2. None
  3. What to expect

  4. command-line tooling

  5. on the JVM

  6. introspection & serviceability

  7. --all-flags=false

  8. war stories

  9. real-life usage (well, re-enacted anyway)

  10. A long time ago in a software shop far, far

  11. Things are going pretty well

  12. What does this thing look like? app-architecture Postgres Web /

    API Application Server Load Balancer Periodic Job Application Server 3rd-party Service A 3rd-party Service B Monitored email account End users: native mobile app Admin users: desktop browsers
  13. But strange things are afoot

  14. the server sometimes gets really slow

  15. the team has to manually restart the application server

  16. incident response time is ~5 minutes

  17. Yes, strange things are afoot

  18. Pain, frustration, anger

  19. Just the facts

  20. sometimes, things get slow

  21. all requests seem to be affected

  22. the JVM stays up

  23. restart the JVM and everything is fine

  24. What could it be?

  25. Demo

  26. More facts, please!

  27. constant full GCs

  28. what’s in the heap

  29. what application code was running

  30. The right tools for the job

  31. vmstat system-level: CPU, memory, disk, context switching

  32. top per-process: CPU & memory

  33. jps what’s our PID?

  34. jstack status of all threads (right now-ish!)

  35. jcmd what can’t it do?! jcmd [PID] help (sorry, JVM

    6 users: see jinfo/jmap/jstack)
  36. jstat GC classloader compiler

  37. Mystery solved!

  38. Now “just” fix it

  39. idea 1: eliminate the leak

  40. idea 2: eliminate the cache altogether?

  41. idea 3: delete the feature

  42. idea 4: full-text search engine

  43. So we’re good now… until the next incident

  44. Lessons

  45. “it’s slow” could mean lots of things

  46. “high CPU” could mean lots of things

  47. collecting data is crucial in a crisis

  48. reproducing the issue helps me sleep at night

  49. Other “right tools for the job”

  50. Heap analyzers

  51. Profilers

  52. Constant monitoring & alerting

  53. Dynamic tracing

  54. Learning more

  55. Books

  56. operators are standing by! man jstat man jstack jcmd [PID]

    help [COMMAND] etc.
  57. Thank you! Colin Jones @trptcolin 8th Light