Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Riak

Monitoring Riak

A brief talk covering the basics of monitoring a Riak cluster. Delivered at the Boston Riak Meetup, inspired by Monitorama. I'm sorry for all the Wire references (not really).

7c4bac30ed2d3a9d346ced746b1d985d?s=128

Tom Santero

March 27, 2013
Tweet

Transcript

  1. Monitoring Riak Boston Riak Meetup @tsantero Friday, March 29, 13

  2. Ops-Friendly #hugops No SPOF Easy to Scale Fault Tolerant Erlang!

    Friday, March 29, 13
  3. “just let it crash” Friday, March 29, 13

  4. “just let it crash” No need to monitor then, right?

    Friday, March 29, 13
  5. Friday, March 29, 13

  6. Resource Starvation Friday, March 29, 13

  7. Resource Starvation Performance Friday, March 29, 13

  8. What to Watch The Basics Friday, March 29, 13

  9. Is it plugged in? Friday, March 29, 13

  10. Friday, March 29, 13

  11. $  bin/riak  ping pong Friday, March 29, 13

  12. $  bin/riak  ping pong OK! √ Friday, March 29, 13

  13. $  bin/riak  ping pong OK! √ $  bin/riak  ping Node

     not  responding  to  pings Friday, March 29, 13
  14. $  bin/riak  ping pong OK! √ $  bin/riak  ping Node

     not  responding  to  pings OHNOES! X Friday, March 29, 13
  15. $  riak-­‐admin  test Attempting  to  restart  script  through  sudo  -­‐H

     -­‐u  riak Successfully  completed  1  read/write  cycle  to  'riak@devnull' Friday, March 29, 13
  16. System Friday, March 29, 13

  17. I/O Bound > Network Bound > CPU Bound Friday, March

    29, 13
  18. Metric CPU Memory Disk Space Disk IO Network File Descriptors

    Swap Threshold 75% * num_cores 70% - bu!ers 75% 80% sustained 70% sustained 75% of ulimit > 0KB Friday, March 29, 13
  19. juking the stats Friday, March 29, 13

  20. sample windows > 1 sec LIE Friday, March 29, 13

  21. Riak Stats Friday, March 29, 13

  22. riak-admin status || REST /stats Friday, March 29, 13

  23. $  riak-­‐admin  status 1-­‐minute  stats  for  'riak@devnull' -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ vnode_gets  :

     600 vnode_gets_total  :  714 vnode_puts  :  600 vnode_puts_total  :  714 vnode_index_reads  :  0 vnode_index_reads_total  :  0 vnode_index_writes  :  0 vnode_index_writes_total  :  0 vnode_index_writes_postings  :  0 vnode_index_writes_postings_total  :  0 vnode_index_deletes  :  0 vnode_index_deletes_total  :  0 vnode_index_deletes_postings  :  0 vnode_index_deletes_postings_total  :  0 node_gets  :  585 node_gets_total  :  694 node_get_fsm_siblings_mean  :  0 node_get_fsm_siblings_median  :  0 node_get_fsm_siblings_95  :  0 node_get_fsm_time_99  :  743 node_get_fsm_time_100   <-­‐-­‐  snip  -­‐-­‐> Friday, March 29, 13
  24. Logs Friday, March 29, 13

  25. $ tail -f logs/* Friday, March 29, 13

  26. **  Reason  for  termination  ==   **  {error,system_limit,[{erlang,open_port, [{spawn,"zlib_drv"},[binary]],[]},{zlib,open,0,[]}, {zlib,zip,1,[]},{riak_kv_pb_object,process,2,

    [{file,"src/riak_kv_pb_object.erl"},{line,218}]}, {riak_api_pb_server,process_message,4,[{file,"src/ riak_api_pb_server.erl"},{line,203}]}, {riak_api_pb_server,handle_info,2,[{file,"src/ riak_api_pb_server.erl"},{line,123}]}, {gen_server,handle_msg,5,[{file,"gen_server.erl"}, {line,607}]},{proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,227}]}]} 2013-­‐03-­‐26  17:24:17  =CRASH  REPORT====    crasher:        initial  call:  riak_api_pb_server:init/1        pid:  <0.15785.5260> Friday, March 29, 13
  27. Simple, Right? Friday, March 29, 13

  28. Simple, Right? Friday, March 29, 13

  29. Tools Friday, March 29, 13

  30. Riaknostic $  riak-­‐admin  diag  -­‐-­‐level  debug 18:34:19.708  [debug]  Lager  installed

     handler  lager_console_backend  into  lager_event 18:34:19.720  [debug]  Lager  installed  handler  error_logger_lager_h  into  error_logger 18:34:19.720  [info]  Application  lager  started  on  node  nonode@nohost 18:34:20.736  [debug]  Not  connected  to  the  local  Riak  node,  trying  to  connect.   alive:false  connect_failed:undefined 18:34:20.737  [debug]  Starting  distributed  Erlang. 18:34:20.740  [debug]  Supervisor  net_sup  started  erl_epmd:start_link()  at  pid  <0.42.0> 18:34:20.742  [debug]  Supervisor  net_sup  started  auth:start_link()  at  pid  <0.43.0> Friday, March 29, 13
  31. Nagios 0 OK 1 Warning 2 Critical 3 Unknown check_riak_up

    Friday, March 29, 13
  32. Riemann http://riemann.io/howto.html#monitor-riak Friday, March 29, 13

  33. Splunk Sensu collectd Ganglia etc.. Friday, March 29, 13

  34. Friday, March 29, 13

  35. Friday, March 29, 13

  36. Pro"ling & Provisioning Friday, March 29, 13

  37. Thanks! Friday, March 29, 13