Slide 1

Slide 1 text

Monitoring Riak Boston Riak Meetup @tsantero Friday, March 29, 13

Slide 2

Slide 2 text

Ops-Friendly #hugops No SPOF Easy to Scale Fault Tolerant Erlang! Friday, March 29, 13

Slide 3

Slide 3 text

“just let it crash” Friday, March 29, 13

Slide 4

Slide 4 text

“just let it crash” No need to monitor then, right? Friday, March 29, 13

Slide 5

Slide 5 text

Friday, March 29, 13

Slide 6

Slide 6 text

Resource Starvation Friday, March 29, 13

Slide 7

Slide 7 text

Resource Starvation Performance Friday, March 29, 13

Slide 8

Slide 8 text

What to Watch The Basics Friday, March 29, 13

Slide 9

Slide 9 text

Is it plugged in? Friday, March 29, 13

Slide 10

Slide 10 text

Friday, March 29, 13

Slide 11

Slide 11 text

$  bin/riak  ping pong Friday, March 29, 13

Slide 12

Slide 12 text

$  bin/riak  ping pong OK! √ Friday, March 29, 13

Slide 13

Slide 13 text

$  bin/riak  ping pong OK! √ $  bin/riak  ping Node  not  responding  to  pings Friday, March 29, 13

Slide 14

Slide 14 text

$  bin/riak  ping pong OK! √ $  bin/riak  ping Node  not  responding  to  pings OHNOES! X Friday, March 29, 13

Slide 15

Slide 15 text

$  riak-­‐admin  test Attempting  to  restart  script  through  sudo  -­‐H  -­‐u  riak Successfully  completed  1  read/write  cycle  to  'riak@devnull' Friday, March 29, 13

Slide 16

Slide 16 text

System Friday, March 29, 13

Slide 17

Slide 17 text

I/O Bound > Network Bound > CPU Bound Friday, March 29, 13

Slide 18

Slide 18 text

Metric CPU Memory Disk Space Disk IO Network File Descriptors Swap Threshold 75% * num_cores 70% - bu!ers 75% 80% sustained 70% sustained 75% of ulimit > 0KB Friday, March 29, 13

Slide 19

Slide 19 text

juking the stats Friday, March 29, 13

Slide 20

Slide 20 text

sample windows > 1 sec LIE Friday, March 29, 13

Slide 21

Slide 21 text

Riak Stats Friday, March 29, 13

Slide 22

Slide 22 text

riak-admin status || REST /stats Friday, March 29, 13

Slide 23

Slide 23 text

$  riak-­‐admin  status 1-­‐minute  stats  for  'riak@devnull' -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ vnode_gets  :  600 vnode_gets_total  :  714 vnode_puts  :  600 vnode_puts_total  :  714 vnode_index_reads  :  0 vnode_index_reads_total  :  0 vnode_index_writes  :  0 vnode_index_writes_total  :  0 vnode_index_writes_postings  :  0 vnode_index_writes_postings_total  :  0 vnode_index_deletes  :  0 vnode_index_deletes_total  :  0 vnode_index_deletes_postings  :  0 vnode_index_deletes_postings_total  :  0 node_gets  :  585 node_gets_total  :  694 node_get_fsm_siblings_mean  :  0 node_get_fsm_siblings_median  :  0 node_get_fsm_siblings_95  :  0 node_get_fsm_time_99  :  743 node_get_fsm_time_100   <-­‐-­‐  snip  -­‐-­‐> Friday, March 29, 13

Slide 24

Slide 24 text

Logs Friday, March 29, 13

Slide 25

Slide 25 text

$ tail -f logs/* Friday, March 29, 13

Slide 26

Slide 26 text

**  Reason  for  termination  ==   **  {error,system_limit,[{erlang,open_port, [{spawn,"zlib_drv"},[binary]],[]},{zlib,open,0,[]}, {zlib,zip,1,[]},{riak_kv_pb_object,process,2, [{file,"src/riak_kv_pb_object.erl"},{line,218}]}, {riak_api_pb_server,process_message,4,[{file,"src/ riak_api_pb_server.erl"},{line,203}]}, {riak_api_pb_server,handle_info,2,[{file,"src/ riak_api_pb_server.erl"},{line,123}]}, {gen_server,handle_msg,5,[{file,"gen_server.erl"}, {line,607}]},{proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,227}]}]} 2013-­‐03-­‐26  17:24:17  =CRASH  REPORT====    crasher:        initial  call:  riak_api_pb_server:init/1        pid:  <0.15785.5260> Friday, March 29, 13

Slide 27

Slide 27 text

Simple, Right? Friday, March 29, 13

Slide 28

Slide 28 text

Simple, Right? Friday, March 29, 13

Slide 29

Slide 29 text

Tools Friday, March 29, 13

Slide 30

Slide 30 text

Riaknostic $  riak-­‐admin  diag  -­‐-­‐level  debug 18:34:19.708  [debug]  Lager  installed  handler  lager_console_backend  into  lager_event 18:34:19.720  [debug]  Lager  installed  handler  error_logger_lager_h  into  error_logger 18:34:19.720  [info]  Application  lager  started  on  node  nonode@nohost 18:34:20.736  [debug]  Not  connected  to  the  local  Riak  node,  trying  to  connect.   alive:false  connect_failed:undefined 18:34:20.737  [debug]  Starting  distributed  Erlang. 18:34:20.740  [debug]  Supervisor  net_sup  started  erl_epmd:start_link()  at  pid  <0.42.0> 18:34:20.742  [debug]  Supervisor  net_sup  started  auth:start_link()  at  pid  <0.43.0> Friday, March 29, 13

Slide 31

Slide 31 text

Nagios 0 OK 1 Warning 2 Critical 3 Unknown check_riak_up Friday, March 29, 13

Slide 32

Slide 32 text

Riemann http://riemann.io/howto.html#monitor-riak Friday, March 29, 13

Slide 33

Slide 33 text

Splunk Sensu collectd Ganglia etc.. Friday, March 29, 13

Slide 34

Slide 34 text

Friday, March 29, 13

Slide 35

Slide 35 text

Friday, March 29, 13

Slide 36

Slide 36 text

Pro"ling & Provisioning Friday, March 29, 13

Slide 37

Slide 37 text

Thanks! Friday, March 29, 13