Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Resource Monitoring with Munin

Practical Resource Monitoring with Munin

Hello. I'm @zembutsu. I work in a server hosting company in Japan. I am a solution engineer, and I manage many servers , and manage network operation mainly.
So, as for my presentation, it is a resource monitoring tool about Munin.

original version is here ( in Japanese )
http://www.slideshare.net/zembutsu/practical-resource-monitoring-with-munin

Munin User Group Japan http://munin.jp/
Masahito Zembutsu @zembutsu
September 8, 2012 OpenSource Conference 2012 Tokyo/Fall, Japan (#osc12tk)

Masahito Zembutsu

September 08, 2012
Tweet

More Decks by Masahito Zembutsu

Other Decks in Technology

Transcript

  1. Munin User Group Japan http://munin.jp/ Masahito Zembutsu @zembutsu September 8,

    2012 OpenSource Conference 2012 Tokyo/Fall (#osc12tk) “Practical Resource Monitoring with Munin - English Edition”
  2. Nice to meet you. I’m @zembutsu. Thank you for giving

    an opportunity of the presentation to me! They are characters of Touhou Project, and "Please take it easy!!"(yukkuri site itte ne!) is one of the famous slang in Japan. http://en.wikipedia.org/wiki/Touhou_Project
  3. Why Am I Here? • Masahito ZEMBUTSU @zembutsu – Solutions

    Engineer ( fiery zeal Otaku mind engineer ) • Working as a server infrastructure engineer. • I want to provide relaxation and rest for theengineers.(Operation/Monitoring/Automation) – Communities of an opensource and the cloud computing • My website http://pocketstudio.jp/ – Experience • April 2000 - Support engineer of server hosting and the ISP • May 2008 - Company internal network management and support • November 2010 - Service development and upper escalation operation • July 2012 – Operation, Development, Research at datacenter somewhere. Don’t mind the careful thing! http://jaws-ug.jp/ http://opencloud.jp/ this is me
  4. ―Don't forget. always, somewhere, someone is fighting for you. ―As

    long as you remember her. you are not alone. (Reference: “Puella Magi Madoka Magica” Episode 12 “My Very Best Friend” ) Operation Monitoring
  5. A Dedicated Hosting Services Shutdown Attack, An Unfamiliar Specifications, Cloudcomputing’s

    Arrival in Japan, Shape of Server, A Business That’s Changing, My Purest Heart for Our Customers. Troubleshooting The Phone That Never Stop Ringing, The Day a Datacenter Stood Still, The Choice of Priority, In sickness unto shutdown, and…, Sales Representative’s Invasion, Customer’s office the Throne of Souls, Tears. You’re a loser only when you fail to try The Birth of Special task force, The Value of Miracles, At Least, Be Human. DECISIVE BATTLE A HUMAN WORK We can (not) advance.
  6. But, it may work with Munin and a solution of

    the problem. My Little Servers Can't Be This Heavy...
  7. This is that I want to do a share today.

    • I think that it is necessary to adopt resource monitoring for an operative flow. • As a result, it may reduce the burden on administrators. I'm extremely happy. XD • We need is the culture to leave the office on time!! (Only as for me?)
  8. Agenda 1. What is Munin? 2. Munin’s Architectre 3. How

    to use Munin 4. Practical trobleshootings! 5. MY VERY BEST MONITORING TOOL
  9. I hope… 2. We improve the efficiency of our working

    (server and network operations). 1. Let's obtain a weapon called “resource monitoring” for us. Wille zur Macht “Let's find happiness together.” (Reference: Kiichi Goto, Patlabor: The Movie, 1989) I guess everybody's happy, that's fine.
  10. Munin.jp • Munin User Group Japan – http://munin.jp/ • Wiki

    – http://munin.jp/wiki/ • Demo – http://demo.munin.jp/ • How to join us – http://munin.jp/mailing-list/
  11. Munin is a networked resource monitoring tool. I dare to

    say, Munin is resource monitoring tool. I dare to say!!
  12. When you click it any, a graph of a day

    / week / month / year are displays it
  13. Vertical axis is a server, cross axle is metrics. The

    grouping function is characteristic.
  14. By the Time we Realized It, It Had Already Begun.

    • troubles - alert systems can’t detect it (increased) – Mainly clientage for Social Networking Service – When the threshold of the alert exceeds it, it is already late. • demand of the clientage – rapidly response – Because a loss per one second is wrong number of digits than before. – a loss of several hundred dollars / minutes :(
  15. “There is something weird, will you check servers? :)” •

    Very difficult request... – Clear cause identification often takes time. • I want to do my best more! – Yes!! I stir myself and go to work. Administrators got exhausted… – I want to aim at the service improvement, but this thought is bad. Why? Let’s see the next slide. Request from my customer of us
  16. BIND An old network constitution. If it was a general

    Web server, it was such a constitution to the utmost. One web server and one database server. It’s very simple!
  17. BIND Number and the management objects of the server are

    increasing in comparison with the past. Therefore support takes the time, and the degree of difficulty rises, too. This Just Can't Be Right!!
  18. Why did this happen? • On the changing environment –

    Network – Server – Software – Middleware – Application – etc
  19. most important thing, by troubleshooting • Cause investigation work has

    top priority. “When we act, it is a first thing to do condition to notice. If there is a technique, anything cannot be settled. It becomes necessary to notice before a technique. The technical expert is in Japan no matter how much, but cannot be readily settled. The reason is because it does not notice.” Soichiro Honda (2008) "akku baran” (candidness ) PHP inc, 10pp. http://en.wikipedia.org/wiki/Soichiro_Honda
  20. You sure that’s enough armor(tools)? • “No problem. Everything’s fine.”

    – ps – top – vmstat – iostat – free – sar (sysstat) …etc Really?
  21. Situation has changed Past • One or several servers •

    Apache, Sendmail, Perl • PostgreSQL, MySQL • Network appliance (sometimes) • No scale • Upgrading is effective Now (present day present time, hahaha!!) • Plural servers in the same network (we assume) • Conventional software + nginx,Tomcat,ruby,PHP,Python,memcac hed,Key-Value Store,Hadoop,Cassandra,MongoDB…etc • The need for scalability • Upgrading is not effective I think that one of the answers to this problem is resource monitoring using Munin.
  22. I Know What Your Server Did Last Summer The essence

    of Munin is many resources visualization
  23. Is This MRTG? No, This Is Munin. MRTG has declined

    We have lost a hero to our glorious and noble cause, but does this foreshadow our defeat? No. It is a new beginning. Compared to Cloud Computing Federation the national resources of Dedicated Server are less than one thirtieth of theirs. Despite this major difference, how is it that we have been able to fight the fight for so long? It is because our goal in this war is a righteous one. It’s been over fifty years since the elite of Cloud Computing, consumed by greed took control of the Cloud Computing Federation. We want our freedom. Never forget the times when the Federation has trampled us! We, the Principality of Dedicated Server, have had a long and arduous struggle to achieve freedom for all engineers of our great network. Our fight is sacred, our cause divine. My beloved brother, MRTG, was sacrificed. Why? The war is at a stalemate.
  24. Comparerative table Tool name Type Datastore Config Web interaface alerting

    Munin Resource monitoring RRDTool CUI Cacti Resouce monitoring RRDTool & MySQL CUI/GUI MRTG Resource monitoring original CUI × Zabbix IT infrastructure monitoring MySQL, PostgreSQL, etc GUI Nagios IT infrastructure monitoring MySQL or PostgreSQL CUI/GUI It is good points and bad points both. I use Munin and a Nagios-based tool properly by my team. We are friends all the time... Reference only Reference only
  25. About Munin • http://munin-monitoring.org/ • Resource monitoring tool – Munin

    can analyze resource trends – “what just happened to kill our performance?” • Plug and Play architecture – It can monitor many items by default Munin is a networked resource monitoring tool that can help analyze resource trends and "what just happened to kill our performance?" problems. It is designed to be very plug and play. A default installation provides a lot of graphs with almost no work. Be alert!
  26. Progress in development • Community based – Github • https://github.com/munin-monitoring

    – Mailing list • https://lists.sourceforge.net/lists/listinfo/munin-users – IRC • irc://irc.oftc.net/#munin • Licence – GNU Public License version2 – There is not commercial support
  27. History • 2002 - project began – The original name

    is “LRRD” • 2004 - Munin 1.0 released – “munin-eye” name was changed to “munin-node” – took long time, and daily improvement continued • 2009 - Munin 1.4 released – Perhaps I think that it is a version spreading most in 1.x. • May 30, 2012 - Munin 2.0 (stable) released
  28. Where is the Japanese information? • NOT YET! • Let’s

    make it together now! – How about write something to wiki first? • http://munin.jp/wiki/ “Is the number of the invitation to the Munin user group ZERO case this week, too? Hum? Do you have a mind to do?” “I’m sorry, my applogies…”
  29. Summarize the points • Munin is a resource monitoring tool.

    (GPL v2) • Simple and powerful architecture. • Munin frees us from a console. (effectiveness) • Munin mean is “memory”. You are never alone! Munin always here for you 24x7x365
  30. This is the work of the main Munin master, and

    a program is executed by cron. It thereby carry out the generation of the collection of data, checking threshold, HTML files and graphs one by one.
  31. Plugins are executed in munin-node, and program is a script

    acquiring various data. Munin-update stores the data which I acquired in RRDTool. And, munin-limits checks the threshold.
  32. And munin-graph and munin-html generate a graph and HTML for

    the material in data (.rrd) stored away by RRDtool.
  33. Constitution of Munin master ( SERVER ) • Perl Libs

    – Munin::Common • munin-cron – munin-update – munin-limits – munin-html – munin-graph • config: munin.conf munin-node ( CLIENT ) • Perl Libs – Munin::Common • munin-node – config: munin-node.conf – Plugins • Tools – munin-node-configure – munin-cron
  34. About data collection • munin-node collect various data. • Port

    4949(TCP) – Munin protocol • LIST • CONFIG • FETCH • VERSION • QUIT (T_T)4949 “4949” is onomatopoeia of Japanese "tearful face".
  35. Data storage and graph generation are work of RRDtool •

    Data format is RRD (round robin database) – /var/lib/munin/<hostname>/<plugin’s name>.rrd • 50KByte/one RRD file – More than 200KB/one plugin (MUST) – 150 to 250 files/munin-node (total about 8 to 15MB/node) -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-idle-d.rrd -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-iowait-d.rrd -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-irq-d.rrd -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-nice-d.rrd -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-softirq-d.rrd -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-steal-d.rrd -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-system-d.rrd -rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-user-d.rrd
  36. Munin prepares for much plugins • System resources – CPU,

    memory, Load Average, disk, S.M.A.R.T… • Network – Traffic, SNMP, HTTP loadtime, TCP, UDP, ICMP… • Applications, middleware – Apache, Nginx, Sendmail, Postfix, MySQL, PostgreSQL, MongoDB, memcached, PHP… etc
  37. Ex) Load Average plugin • /etc/munin/plugins/load – “Load average” is

    five minutes average – It’s a symbolic link • Original is /usr/share/munin/plugin/load – Simple shell script echo -n "load.value " cut -f2 -d' ' < /proc/loadavg load .value 3.22
  38. Environment • Perl5 • OS – Linux • Source code

    ( version 2.0.6 ) • Binary Package – Red Hat Enterprise Linux 系 ( EPEL ) – Debian – openSUSE – MacOS X – Windows
  39. Setting up flow • Install Munin and Perl Libraries •

    Change a config file ( munin.conf ) • Setting up munin-node ( munin-node.conf ) • Check its graphs
  40. Case) Red Hat Enterprise Linux • Use EPEL*1(testing repository) package

    or source • procedure – 1. enabling EPEL – 2. “yum install munin” – 3. configure munin.conf – 4. turn on munin-node and setup – 5. check *1 Extra Packages for Enterprise Linux(EPEL) https://fedoraproject.org/wiki/EPEL
  41. Case) Debian / Ubuntu • Use apt (Debian PTS is

    testing) or Source • Procedure – 1. setting up Perl libraries (via apt-get) – 2. install munin – 3. configure munin.conf – 4. turn on munin-node and setup – 5. check
  42. Config files Munin Master • /etc/munin/munin.conf – Host tree (targeting

    nodes) – Graph strategy • Cron or realtime generation – Paths • RRD files • logfiles munin-node • /etc/munin/munin-node.conf – Access control • Host (IP address) • Network CIDR – Node’s hostname – Port number • Default: TCP 4949 (T_T) – Plugin’s option
  43. [munin-node.conf] Access control • allow ^127¥.0¥.0¥.1$ – Regular expression •

    cidr_allow 192.0.2.0/24 – Not regular expression • If you change files, then you must restart munin-node!
  44. Basic knowledge of Munin plugin • Original files is here

    ( shell or perl scripts ) – /usr/share/munin/plugins/ • How to use – To make symbolic link to /etc/munin/plugins – configure munin-node.conf – munin-node restart (MUST) – Check graph and html
  45. How to debug plugin • /usr/sbin/munin-run <plugin-name> – “--debug” shows

    more detail – behavior is same as munin-node – useful • Command line tool ( I made ) – muninwalk & muninget ; perl script https://github.com/zembutsu/muninwalk
  46. Apache • Symbolic link • munin-node.conf • httpd.conf [apache_*] env.url

    http://127.0.0.1:%d/server-status?auto env.ports 80 # ln -s /usr/share/munin/plugins/apache_* /etc/munin/plugins/ ExtendedStatus On <Location /server-status> SetHandler server-status Order deny,allow Deny from all Allow from 127.0.0.1 </Location>
  47. MySQL • Symbolic link • munin-node.conf [mysql*] env.mysqlopts -u root

    -pPASSWORD env.mysqladmn /usr/bin/mysqladmin # ln -s /usr/share/munin/plugins/mysql_* /etc/munin/plugins/
  48. BIND • Symbolic link • munin-node.conf • named.conf [bind9_rndc] env.rndc

    /usr/sbin/rndc env.querystats /var/named/chroot/var/named/data/named_stats.txt user root # ln -s /usr/share/munin/plugins/bind9_rndc /etc/munin/plugins/ statistics-file "/var/named/data/named_stats.txt";
  49. Sample case; httping plugin • http://www.vanheusden.com/httping/ • "httping" is a

    command-line tool which can check response time of the Web server like a “ping” command. • If you set –S opsion, then you can check response time and processing time. $ httping -S http://210.239.46.254/ PING 210.239.46.254:80 (http://210.239.46.254/): connected to 210.239.46.254:80 (380 bytes), seq=0 time=0.10+0.69=0.79 ms connected to 210.239.46.254:80 (380 bytes), seq=1 time=0.08+0.47=0.55 ms connected to 210.239.46.254:80 (380 bytes), seq=2 time=0.07+0.68=0.75 ms connected to 210.239.46.254:80 (380 bytes), seq=3 time=0.12+0.66=0.77 ms Got signal 2 --- http://210.239.46.254/ ping statistics --- 4 connects, 4 ok, 0.00% failed
  50. Plugin: httping_ #!/bin/sh # # Plugin to monitor HTTP response

    (httping) #%# family=auto #%# capabilities=autoconf URL=${URL:-"http://localhost/"} COUNT=${COUNT:-"5"} httping_bin=$(which httping) if [ "$1" = "autoconf" ]; then echo yes exit 0 fi if [ "$1" = "config" ] ; then echo "graph_args -r --lower-limit 0 "; echo "graph_title http response $URL"; echo "graph_category httping"; echo "graph_info httping response time: $URL"; echo 'graph_vlabel msec' echo "connect.label connect time" echo "connect.draw AREA" echo "connect.type GAUGE" echo "processing.label processing time" echo "processing.draw STACK" echo "processing.type GAUGE" exit fi # format for httpiing 1.5.3 http://www.vanheusden.com/httping/ $httping_bin -c $COUNT -G -S $URL | tr '+|=' ' ' | awk '{connect+=$9; processing+=$10} END{print "connect.value",connect/'$COUNT'"¥n""processing.value",processing/'$COUNT'}' This is substance of a httping plugin, and a file itself is a simple shell script. The contents are the definition about the graph and commands to really acquire a value. A point is to acquire data, and therefore the plug in can make even what kind of language including perl and PHP. Define graphing Output format is “xxx.Value ***”
  51. Config: httping_ • /etc/munin/plugin-conf.d/httping • # ln -s /usr/share/munin/plugins/httping_ /etc/munin/plugins/httping_localhost

    [httping_localhost] env.URL http://pocketstudio.jp/ env.COUNT 5 [httping_blog] env.URL http://pocketstudio.jp/log3/ env.COUNT 5 [httping_node1] env.URL http://node1.pocketstudio.net/ env.COUNT 5
  52. httping live demo • http://demo.munin.jp/munin2/httping-day.html It is a case having

    any problem neither for this server, response time and processing time. There is much partial (processing time) of this server group blue.It takes the processing time by certain CMS. On the other hand, I understand that the network is good.
  53. Never say never. • Agility is the pivot of the

    service (in my case) – LOOKOUT, its cause solution of the trouble • Hardware or Software or Network – We need investigation • where a problem happens promptly
  54. Live Munin demo • http://demo.munin.jp/ – Then let's observe the

    resource situation through this demonstration site of Munin. • Where is a bottleneck? or will be? • Even if you do not log in to a server, I think that you can refer to many resources.
  55. Case) identified unauthorized access • By the Time we Realized

    It, It Had Already Begun. • situation – 1. Error emails beguns to arrive to postmaster – 2. There was not the alert with the monitoring tool – 3. Therefore at first I checked a resource in Munin – 4. I identified that CMS had vulnerability from the situation and acted promptly. I was able to perform the above-mentioned movement quickly in a short time by Munin.
  56. I confirmed the time when traffic was strange MySQL’s queries

    were rised suddenly, too From the above-mentioned situation, I supposed illegal access for CMS. Actually, I understood the attack for the specific URL when I investigated log of the time. Identification and the action of the cause should have taken time more if I did not use Munin.
  57. Munin 2.0 has new features! • Better UI and CGI

    integration – New look, Graph Zooming, FastCGI • asynchronous I/O support – Better performance • Native SSH transport – secure (port 22) & easy setup • asynchronous proxy support – async-server substitutes for munin-node • And more… – https://github.com/munin- monitoring/munin/blob/devel/Announce-2.0
  58. Munin changed support flow (my case) • If I don’t

    use tools – Troubleshooting is various command execute (sysstat) and investigation of the log files. – But, this method need long time and many human resources need, and is bad for service. • If I use Munin (now). – Even if I do not log in, I can understand the situation. – I can judge abnormality visually • “I see the ending of this troubleshooting!” – Agile Support • Troubleshooting that has Plan-Do-Check-Action (PDCA) cycles.
  59. In work of my dedicated server hosting • I really

    depend on Munin – Always, I setup Munin. – Munin is almost in several hundred servers which I manage directly. – I think that Munin is indispensable to our service quality improvement. BAM BAM! Neat I cannot part with Munin for my work. You believe it!
  60. Trobuleshoot PDCA Law of Cycles Presage!! Plan Do Detecting problem

    and situation For real? What are these alerts? Suppose a cause OK, Munin. Please tell me that trouble lies hidden in wherever? Fire! Please stop!!
  61. Trobuleshoot PDCA Law of Cycles Presage!! Plan Do Check Detecting

    problem and situation For real? What are these alerts? Suppose a cause OK, Munin. Please tell me that trouble lies hidden in wherever? Fire! Please stop!! To check resources remotely I just talk about what I just looked in Munin!!
  62. Trobuleshoot PDCA Law of Cycles Presage!! Plan Do Check Action

    Detecting problem and situation For real? What are these alerts? Suppose a cause OK, Munin. Please tell me that trouble lies hidden in wherever? Fire! Please stop!! To check resources remotely Log in and execute commands Wow! click-clack click-clack I just talk about what I just looked in Munin!!
  63. The Only Thing I Have Left To Guide Me You

    are never alone! Munin always here for you 24 x 7 x 365
  64. Munin’s overview ・Munin is the resource monitoring tool that specialize

    to notice by the visualization. ・Simple architecture, and many plug-ins. ・Ths is most suitable for the system that quick support is necessary in a short time.
  65. While there’s Munin, there’s hope. Conclusion Thank you for MUNIN.

    Good-bye to MRTG. No munin, No Operation. MY VERY BEST MONITORING TOOL. * This is my personal impression.
  66. I wish… • I would appreciate you use Munin that

    if you were interested in Munin by my presentation. • Tomorrow is another day. Up to you. Squidn’t you use Munin? (Shoudn’t)
  67. Questions? • Do you have a questionable point for munin?

    I'm glad you asked. Let's give the rights that the reward buys Opoona for you. (but, here is wagon sale...)
  68. References • Munin – http://munin-monitoring.org/ • Munin User Group Japan

    – http://munin.jp/ – http://munin.jp/wiki/ • Website – Waiting for Munin 2.0 – Introduction – Personal Workflow Blog • http://blog.pwkf.org/post/2010/06/Waiting-for-Munin-2.0-Introduction – /tags/2.0.0/ChangeLog – Munin – Trac • http://munin-monitoring.org/browser/tags/2.0.0/ChangeLog Please feedback me [email protected] or @zembutsu ( twitter ) Thank you for your reading!