Upgrade to Pro — share decks privately, control downloads, hide ads and more …

メルカリのシステム・サービス監視について/Monitoring Mercari service...

kazeburo
November 29, 2017

メルカリのシステム・サービス監視について/Monitoring Mercari service and servers

メルカリのシステム・サービス監視について
Monitoring seminar in Mercari

kazeburo

November 29, 2017
Tweet

More Decks by kazeburo

Other Decks in Technology

Transcript

  1. Me • Masahiro Nagano / ௕໺խ޿ • id:kazeburo • Mercari,

    Inc
 Principal Engineer
 Site Reliability Engineering (SRE) Team
  2. Zabbixͷ՝୊ • Zabbix ࣗମͷӡ༻ • όʔδϣϯΞοϓͷෛ୲ • MySQL ͷෛՙ͕େ͖͘ɺ؂ࢹ஗ԆͳͲ΋ൃੜ •

    ෳࡶͳτϦΨʔͷ؅ཧ • Ϛ΢εΫϦοΫओମͷઃఆ • όʔδϣϯ؅ཧͳͲΛߦ͍͍ͨ • ؂ࢹͷ௨஌Λվળ͍ͨ͠
  3. mackerel ಋೖ • Service Metrics͔Βಋೖ • ؆୯ʹάϥϑ͕ඳ͚ɺ؂ࢹᮢ஋ͷઃఆ͕Ͱ͖Δ • fluentdɺNorikraͱͷ૊Έ߹Θͤ •

    ZabbixͷτϦΨʔͷҠ২ • τϦΨʔΛPluginͱ࣮ͯ͠૷ • Plugin͸ GitͰ؅ཧ͠ɺAnsibleͰ഑෍
  4. ؂ࢹʹ·ͭΘΔ਺ࣈ • ؂ࢹϧʔϧ਺: 278 • Hostຖͷ؂ࢹϧʔϧ਺ • MySQL: 34 •

    Application: 39 • Search: 37 • Custom Plugin: 50+ (check + metrics + utils)
  5. MySQLͷ؂ࢹ߲໨(1/4) • Connectivity • FileSystem % >85% >88% • Swap

    % >50% >70% • ssh-alive • sshdͷϓϩηε؂ࢹ • global-ip-and-iptables • global ipͷ༗ແͱiptablesͷঢ়ଶ • unbound-resolv • localͷunboudͰ໊લղܾ͕Ͱ͖Δ͔ • unbound-process • unboundͷϓϩηε؂ࢹ • crond-process • crondͷϓϩηε؂ࢹ • uptime • ࠶ىಈ؂ࢹ
  6. MySQLͷ؂ࢹ߲໨(2/4) • inode-usage • inode࢖༻཰ >80% >90% • uname-change •

    unameίϚϯυͷ݁Ռͷdiff؂ࢹ • passwd-change • passwdϑΝΠϧͷdiff؂ࢹ • hostname-changed • hostnameίϚϯυͷ݁Ռͷdiff؂ࢹ • custom.ntpq.synced.remote <0.1 <0.1 • custom.ntpq.offset.seconds >300 >300 (msec) • ntpͷಉظαʔόͱ࣌ࠁͷζϨ • custom.linux-lite.memory.avail <50MB <20MB • ۭ͖ϝϞϦ • custom.linux-lite.cpu-usage.cpu-steal >20% >20% • custom.linux-lite.cpu-usage.cpu-iowait >30% >50% • custom.linux-lite-cpu-usage.cpu-system >8% >8% • ͦΕͧΕͷCPU࢖༻཰(100%্͕ݶ)
  7. MySQLͷ؂ࢹ߲໨(3/4) • cutom.linux-lite.loadavg.per-cpu >3 >3 • ίΞ਺ͰׂͬͨϩʔυΞϕϨʔδ • postfix-smtp-alive •

    SMTPϙʔτͷ֬ೝ • postfix-master-process • postfix masterϓϩηε؂ࢹ • custom.postfix.mailq.queue >100 >5k • postfix mail Ωϡʔ଺ཹ਺ • custom.linux-lite.process.all >2k >2k • custom.linux-lite.process.running >60 >100 • શϓϩηε਺ͱ࣮ߦதͷϓϩηε਺ • mysql-uptime • mysqlͷuptime • custom.mysql-lite.replication-threads.io <0.2 <0.2 • custom.mysql-lite.replication-threads.sql <0.2 <0.2 • ϨϓϦέʔγϣϯͷ֤threadͷঢ়ଶ
  8. MySQLͷ؂ࢹ߲໨(4/4) • custom.mysql-lite.replication-behind- master.second >5 >5 • mysqlͷϨϓϦέʔγϣϯ஗Ԇ • custom.mysql-lite.connections.utilization

    >90 >90 • max_connectionsʹର͢ΔίωΫγϣϯ਺ • custom.mysql-lite.threads.running >1k >2k • mysql্Ͱ࣮ߦதͷεϨου਺ • mysql-slave-sql-error • replicationΤϥʔͷ؂ࢹ • machine-exceptions • αʔόͷϝϞϦΤϥʔ؂ࢹ • raid-disks • αʔόͷRAID/Diskঢ়ଶͷ؂ࢹ
  9. end