Monitoring with Ganglia

Slide 1

Slide 1 text

Monitoring with Ganglia Vladimir Vuksan @vvuksan http://blog.vuksan.com/

Slide 2

Slide 2 text

Who am I ● Have done systems administration for over 20 years ● Ganglia contributor ● Co-authored O'Reilly book about Ganglia ● Work at Fastly ● @vvuksan on Twitter

Slide 3

Slide 3 text

Ganglia book Book signing Wednesday 6/25 at 10:45 in the O'Reilly Author booth

Slide 4

Slide 4 text

What is Ganglia ● Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization ● Started in 2002 http://ganglia.info/ ● Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization ● Started in 2002 ● Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization ● Started in 2002

Slide 5

Slide 5 text

How I got involved ● Got introduced to Ganglia in 2005 ● Loved it ● In 2010 started working on rewrite of Ganglia web UI ● In 2011 became one of the Ganglia core developers

Slide 6

Slide 6 text

Tutorial outline ● Why Ganglia ● Ganglia basics ● Ganglia setup demo ● Ganglia web UI demo ● Choose your own adventure topics

Slide 7

Slide 7 text

Why do we monitor ● Problem/issue detection MTTR/MTTD ● Trending – where are we going ● Learn how our infrastructure/system really behaves Timezone difference between NZ and SYD is 2 hours => People are predictable

Slide 8

Slide 8 text

Why Ganglia ? ● Relatively easy to set up and track lots of metrics ● Doesn't impose heavy operational burden ie. most installs don't require multiple machines, proxies, Hbase etc. ● Doesn't require lots of work to provide me with tons of usable graphs ● Lots of power users geared features e.g. aggregate graphs, compare hosts, views

Slide 9

Slide 9 text

Ganglia Architecture ● 2 daemons: gmond & gmetad ● gmond sends and/or receives metrics – keep in memory ● 1 gmetad per grid. polls 1 gmond per cluster for data. ● a node belongs to a cluster. a cluster belongs to a grid. ● Web UI a separate item use it or lose it

Slide 10

Slide 10 text

Transport ● Gmonds talk to each other over UDP ● Gmonds expose metrics over TCP as XML ● Gmetad exposes metrics over TCP as XML

Slide 11

Slide 11 text

Multicast vs. unicast transport ● Multicast is the default ● Works great if in environments that are on a single network segment e.g. compute grids, corporate networks ● Zero config ● Doesn't work in cloud as multicast is filtered ● Allows for some interesting implementations since all nodes about metrics from all other nodes ● Use Unicast

Slide 12

Slide 12 text

Write scaling using RRDcached ● If you have lots of metrics your I/O subsystem will likely become the bottleneck. Use SSDs and RRDcached (consolidates writes) ● RRDcached daemon on Ubuntu Debian /etc/default/rrdcached OPTS=" -t 60 -w 180 -z 180 -F -s ganglia -m 664 \ -l 127.0.0.1:9998 -s ganglia -m 777 -P FLUSH,STATS,HELP \ -l unix:/tmp/rrdcached.limited.sock -b /var/lib/ganglia/rrds -B \ -p /var/lib/ganglia/rrdcached.pid" ● Tell gmetad where to look ● Prior to 3.7.0+ environment variable – export RRDCACHED_ADDRESS=/tmp/rrdcached.sock ● In 3.7.0+ gmetad.conf setting – rrdcached_address 127.0.0.1:9998 ● Tell Web UI where to look ● $conf['rrdcached_socket'] = "unix:/tmp/rrdcached.limited.sock";

Slide 13

Slide 13 text

Network buffers scaling ● You will need to increase your UDP buffer size. Default is 128k ● Bump it up in sysctl sysctl -w net.core.rmem_max=15000000 ● Bump up conntrack for good measure sysctl -w net.nf_conntrack_max=512000 ● In gmond.conf under udp_recv_channel add buffer = 10000000

Slide 14

Slide 14 text

Getting data in ● Via gmond modules, written in C or Python. ● Varnish metrics, Apache metrics ● Via gmetric or libraries that implement the gmetric protocol. ● Via other daemons designed to feed metrics to ganglia (e.g. statsd)

Slide 15

Slide 15 text

Zero metric configuration ● Just start sending new metrics. ● gmetad will create a new RRD file for any new metric it sees. ● The web UI will draw a basic graph for every metric. ● You can create nice colored graphs later if you want them.

Slide 16

Slide 16 text

Gmond shenanigans ● One aggregating gmond required for each cluster ● Deficiency in the protocol :-(

Slide 17

Slide 17 text

Demo setup SFO gmond Aggregator Port=50001 SFO gmond sender AMS gmond sender NYC gmond sender SFO gmond Aggregator Port=50002 SFO gmond Aggregator Port=50003 Gmetad poller Web UI

Slide 18

Slide 18 text

Install ● On aggregator apt-get -y install ganglia-monitor ganglia-monitor-python gmetad rrdtool ganglia-webfrontend ● On nodes apt-get -y install ganglia-monitor ganglia-monitor-python

Slide 19

Slide 19 text

Gmond configuration ● Separate aggregator and sender nodes ● We'll be using unicast

Slide 20

Slide 20 text

Sender config ● Send metrics (global section) mute = no deaf = yes ● Remove any udp_recv_channels and tcp_accept_channels ● Ganglia sends metadata packets separately from metric packets. If you don't have metadata metrics will not show up. This becomes a problem if aggregator gets restarted. Not a problem in multicast settings where they can send each other messages requesting metadata but needs to be set in unicast. Set following in global section send_metadata_interval = 60

Slide 21

Slide 21 text

Aggregator config ● Receive metrics only (global section) deaf = no mute = yes ● Remove any udp_send_channels defined

Slide 22

Slide 22 text

Node name determination ● Out of the box receiving/aggregator gmond will use reverse DNS resolution to determine hostname/node name for received metric packets ● Use override_hostname = “my_hostname” ● In global section to set the desired host name

Slide 23

Slide 23 text

Zero configuration ● Just start sending new metrics. ● gmetad will create a new RRD file for any new metric it sees. ● The web UI will draw a basic graph for every metric. ● You can create nice colored graphs later if you want them.

Slide 24

Slide 24 text

High availability setup gmond.conf (unicast) udp_send_channel { host = 1.2.3.4 port = 8649 }_channel gmond.conf (unicast) udp_send_channel { host = 9.8.7.6 port = 8649 }_channel US aggregating gmond.conf udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 } EU aggregating gmond.conf udp_recv_channel { port = 8649 } tcp_accept_channel { port = 8649 }_channel US gmetad.conf data_source “cluster” 1.2.3.4el EU gmetad.conf data_source “cluster” 9.8.7.6el Ganglia Web UI Ganglia Web UI DNS Active Failover

Slide 25

Slide 25 text

Ganglia Demo

Slide 26

Slide 26 text

Web UI tutorial

Slide 27

Slide 27 text

Search ● Search as you type – shows matching hosts then metrics

Slide 28

Slide 28 text

Views ● Arbitrary collection of graphs ● Individual metrics ● Composite graphs ● Aggregate graphs ● How to add ● Add through the web UI ● Configure using JSON configuration files

Slide 29

Slide 29 text

Views JSON config example $ cat /var/lib/gangliaweb/conf/view_cpu_util.json { "view_name": "CPU utilization", "default_size": "medium", "items": [ { "hostname": "aggregator", "metric": "cpu_idle", "vertical_label": "%", "title": "CPU Idle" } ], "view_type": "standard", "parent": null } ●

Slide 30

Slide 30 text

Aggregate graphs ● Easy composite graph creation ● Requires ● Host regular expression ● Metric regular expression ●

Slide 31

Slide 31 text

Common regular expressions ● Show both bytes_in and bytes_out ● bytes_(in|out) ● Show any metric that starts with bytes ● ^bytes_ ● Show only bytes_out and not varnish_bytes_out or bytes_out_compressed ● ^bytes_out$ ● Only hosts cache-5,cache-7 and cache-9 ● ^cache-(5|7|9) ● All hosts from cache-5 to cache-9 ● ^cache-[5-9] ● All hosts except ones starting with cache-t ● ^cache-[^t]

Slide 32

Slide 32 text

Compare hosts ● Compare a set of hosts defined by a regular expression across all common metrics ● Aggregate graphs on steroids ● Will generate hundred/thousands of aggregate graphs you can use for analysis

Slide 33

Slide 33 text

Events ● View events/Add Events

Slide 34

Slide 34 text

Add events API driven ● Use curl from init script or deploy script ● curl -v "http://ganglia.server/api/events.php? action=add&start_time=now&summary=Restart +of+daemon&host_regex=$HOSTNAME" ●

Slide 35

Slide 35 text

Automatic rotation ● Aimed for ops team that need to continuously rotate metrics to help spot early signs of trouble. ● metrics will be rotated until the browser window is closed. ● If you have multiple monitors you can invoke different views to be rotated on different monitors.

Slide 36

Slide 36 text

Live Dashboard ● Adaptation of Tasseo for Ganglia https://github.com/obfuscurity/tasseo

Slide 37

Slide 37 text

Mobile view ● Mobile optimized view for Ganglia. ● Intended for any mobile browsers supported by jQueryMobile toolkit. This covers most WebKit implementations ie. Android, iPhone iOS, HP webOS and Blackberry OS 6+. ● Provides a better experience viewing Ganglia on your mobile phone by eliminating panning and zooming.

Slide 38

Slide 38 text

UI components you can interact with in host view

Slide 39

Slide 39 text

Add to view

Slide 40

Slide 40 text

Inspect ● Interactive graph you can hover over, zoom ●

Slide 41

Slide 41 text

Trend

Slide 42

Slide 42 text

Timeshift

Slide 43

Slide 43 text

CSV and JSON export ● Export data from the graph you are just seeing for further processing e.g. spreadsheet ● Can be done to any image URL by appending either &csv=1 or &json=1

Slide 44

Slide 44 text

XML export from Gmetad ● curl http://localhost:8652/MYCLUSTER/pico.domain.com/load_one

Slide 45

Slide 45 text

Choose your own adventure Nagios integrations/ Alerting AdHoc Views Statsd Ask anything Config options to tune Export to other systems

Slide 46

Slide 46 text

Nagios integration / Alerting

Slide 47

Slide 47 text

Nagios integration/Alerting ● Implements Nagios checks using Ganglia ● You already have nearly all the data you need for alerting ie. current load, disk utilization etc. ● If it's something you are gonna alert you might want to trend it ● Provides for much richer alerts – Use custom criteria other than over/under threshold e.g. percentage of combined values – Check multiple values – make sure no one is currently working on a machine (indicated by presence of /etc/disabled file)

Slide 48

Slide 48 text

Nagios integration cont'd ● Check a single metric ● alert if one minute load average is > 5 check_command check_ganglia_metric!load_one!more!5 ● alert if number of local IPs is not exactly 5 check_command check_ganglia_metric!local_ips!notequal!5 ● Check multiple metrics on a single host – check all disks check_command check_ganglia_multiple_metrics! disk_free_rootfs,less,10:disk_free_tmp,less,20

Slide 49

Slide 49 text

Nagios integration cont'd ● Check multiple metrics on multiple hosts specified by a regex ● Useful in situations where failures occur rarely ● For example send to Ganglia number of failed disks in a disk array. Alert if on failure check_command check_host_regex_ignore_unknowns!'.*'! failed_disks,more,0 ● Result # Services OK = 236, CRIT/UNK = 2 : CRITICAL compute4566.domain.com failed_disks = 1 disks, CRITICAL git0341.domain.com failed_disks = 1 disks

Slide 50

Slide 50 text

Check value same everywhere ● Sometimes you need to assure that ● App revision is consistent across all servers – polling may be tricky due to firewalls, network partitions etc. ● You have deployed all config files check_command check_value_same_everywhere! ^cache|^varnish! varnish_vcl_loaded ● Result VCLs loaded are not the same on all hosts CRITICAL CRIT varnish_vcl_loaded differs values 53 ( cache1, cache3, cache4 ) 52 ( cache2 )

Slide 51

Slide 51 text

Files present ● Alerting systems will not alert on any machines that have following files present ● /etc/ganglia_silence ● You will need to expose this as a metric

Slide 52

Slide 52 text

Ad-Hoc Views

Slide 53

Slide 53 text

Ad-Hoc views ● Define arbitrary views on the fly ● Enable them in conf.php ● $conf['ad-hoc-views'] = true; ● Supply complete view JSON config as a GET or POST variable e.g. &ad-hoc-view={"view_name": "CPU utilization",”default_size”: …

Slide 54

Slide 54 text

Use ad-hoc views with Tasseo ● You can also use them for Tasseo as well e.g. ● URL suffix /ganglia2/tasseo.php?ad-hoc-view=

Slide 55

Slide 55 text

Misc hacks

Slide 56

Slide 56 text

Misc hacks ● Notify a chat channel of an average number of HTTP errors MIN15AGO=`date --date="15 minutes ago" "+%s" ; ERROR_RATE=`curl --silent "http://ganglia.domain.com/ganglia/graph.php? c=Web&h=webserver&v=&m=nginx_500&cs=$MIN15AGO&csv=1" | \ awk -F, '{sum+=$2} END { print "Average = ",sum/NR}' # Send to HipChat curl -d "room_id=ourRoom&from=Ganglia&message=Error Rate = $ERROR_RATE&color=red¬ify=1" https://api.hipchat.com/v1/rooms/message? auth_token=AUTH_TOKEN_HERE&format=json http://blog.vuksan.com/2012/04/06/

Slide 57

Slide 57 text

Reporting ● You crazy ? ● Use Ganglia as a common one way bus ● Ganglia supports string metrics. Use them :-) ● Send out key applications version numbers, config hashes etc.

Slide 58

Slide 58 text

Exports

Slide 59

Slide 59 text

Graphite Export ● Make sure you use UDP transport to send out metrics to graphite. TCP doesn't perform as well. Enable following settings in gmetad.conf carbon_server "my.graphite.box" carbon_port 2003 carbon_protocol udp ● If you don't care for Ganglia Web UI. You can disable writing of RRDs write_rrds off

Slide 60

Slide 60 text

Memcache Export ● Add following in gmetad.conf memcached_parameters "--SERVER=127.0.0.1 --POOL-MIN=10 --POOL-MAX=70"

Slide 61

Slide 61 text

Riemann Export ● Riemann is a powerful event stream processor ● To enable riemann_server "my.riemann.box" riemann_port 5555

Slide 62

Slide 62 text

Statsd implementations ● Pystatsd ● Built in support for Ganglia ● https://github.com/sivy/pystatsd/ ● Etsy statsd ● You need pluggable statsd backend ● https://github.com/jbuchbinder/statsd-ganglia-backe nd

Slide 63

Slide 63 text

Tuning

Slide 64

Slide 64 text

Config options to tune ● Add override config options in conf.php (overrides anything in conf_default.php) ● Remove stats from graph legend $conf['graphreport_stats'] = false; ● Change default metric that shows up. Default load_one $conf['default_metric'] = "cpu_report"; ● Disable authentication – enables view and event creation (if you are behind firewall/basic auth) $conf['auth_system'] = 'disabled'; ● Don't show all host metrics by default. $conf['metric_groups_initially_collapsed'] = true;

Slide 65

Slide 65 text

Config options to tune ● Change default time ranges $conf['time_ranges'] = array( 'hour'=>3600, '2hr'=>7200, '4hr'=>14400, 'day'=>86400, 'week'=>604800, 'month'=>2419200);

Slide 66

Slide 66 text

Links ● Ganglia Github repos ● http://github.com/ganglia/