Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring at scale: a needle in the haystack

Monitoring at scale: a needle in the haystack

Monitoring platform at CCIN2P3 presented at HEPiX Spring 2016

Fabien Wernli

April 21, 2016
Tweet

More Decks by Fabien Wernli

Other Decks in Technology

Transcript

  1. EVENT MONITORING @ CCIN2P3 ~ 100 million logs per day

    ~ 1 billion metrics per day ~ 400'000 services ~ 2'400 devices 40k/s peaks 15s period 3 . 2
  2. EVENT TYPES metrics (collectd) (pull) logs (syslog-ng) (push) ccnode42 cpu-0/percent-idle

    ok 1.2 ccddeeff load/load-relative warning 2.3 e1000e: eth0 NIC Link is Down puppet: Finished catalog run in 58.48 seconds 3 . 3
  3. Elasticsearch Riemann syslog-ng Kibana Clients riemann- dash aggregation consolidation parsing

    correlation REST API websocket API Grafana metrics (collectd) logs (syslog) 4 . 2
  4. RIEMANN realtime stream processing logs and metrics aggregation routing subscription

    through ws multi-tier archiving of metrics through samplerr reportedly achieving MEPS 5 . 5
  5. EXAMPLE: METRIC { "timestamp": 1461075576, "service": "cpu-0/percent-idle", "state": "ok", "host":

    "ccddeeff.in2p3.fr", "ttl": 30, … "metric": 42, "plugin": "cpu", "pinstance": "0", "type": "percent", "tinstance": "idle", … } 6 . 3
  6. EXAMPLE: LOG { "timestamp": 1461075777, "service": "sshd", "state": "ok", "host":

    "ccfawe.in2p3.fr", "ttl": 300, … "message": "sshd[334]: Accepted gssapi-with-mic for fwernli from 134.158.70.118 port 46930 ssh2", "syslog": { "pid": 334, "severity": "info" }, "secevt": { "verdict": "ACCEPT" }, "usracct": { "username": "fwernli", "authmethod": "gssapi-with-mic" }, … } 6 . 4
  7. PATTERNDB 2014-05-30T14:34:53.087Z dc01.in2p3.fr \ dCache-pool: park=LCG,\ service=pool-atlas-dq2-li001a,\ ndc=,message=remove entry for:

    0000BD759... Flat Message pdb::simple::ruleset { 'dCache': id => 'dCache', patterns => ['dCache-pool'], rules => [{ id => 'dCache-std-event', patterns => ['park=@ESTRING:park:,@ service=@ESTRING:svc:,@ ndc=@ESTRING:ndc:,@ message=@ANYSTRING:message@'] }] } Matching Rule { timestamp => 1401461273, host => 'dc01.in2p3.fr', program => "dCache-pool", service => "pool-atlas-dq2-li001a", ndc => nil, message => "remove entry for: 0000BD759..", } Structured Message 7 . 3
  8. SELF-RECOVERY id: 34a012fc-f964-4b85-a5cb-066ca2efa54b context_id: 'openafs-${appacct.cell}-${appacct.server_ip}' context_scope: program context_timeout: '60' patterns:

    - 'afs: Lost contact with file server @IPvANY:appacct.server_ip@ in cell @HOSTNAME:appacct.cell@)' matching rule afs: Lost contact with file server \ 10.0.104.125 in cell my.cell warning event id: b82b8d55-6060-4857-bc2e-d3ce5f4fd082 context_id: 'openafs-${appacct.cell}-${appacct.server_ip}' context_scope: program patterns: - 'afs: file server @IPvANY:appacct.server_ip@ in cell @HOSTNAME:appacct.cell@ is back up' matching rule afs: file server 10.0.104.125 in cell \ my.cell is back up self-heal event 7 . 5
  9. TIMEOUT-BASED CORRELATION id: 34a012fc-f964-4b85-a5cb-066ca2efa54b context_id: 'openafs-${appacct.cell}-${appacct.server_ip}' context_scope: program context_timeout: '60'

    patterns: - 'afs: Lost contact with file server @IPvANY:appacct.server_ip@ in cell @HOSTNAME:appacct.cell@)' matching rule OPENAFS_LOSTCONTACT_TIMEOUT: message: tags: - alert - afs values: HOST_FROM: '${afs.server_ip}' PROGRAM: 'openafs-lostcontact/${afs.cell}-${afs.server_ip}' state: warning rule: 34a012fc-f964-4b85-a5cb-066ca2efa54b trigger: timeout Correlation afs: Lost contact with file server \ 10.0.104.125 in cell my.cell warning event 7 . 6
  10. SCENARIO1: SELF-HEALING OPENAFS_LOSTCONTACT_TIMEOUT: message: tags: - alert - afs values:

    HOST_FROM: '${afs.server_ip}' PROGRAM: 'openafs-lostcontact/${afs.cell}-${afs.server_ip}' state: warning rule: 34a012fc-f964-4b85-a5cb-066ca2efa54b trigger: timeout Correlation afs: Lost contact with file server \ 10.0.104.125 in cell my.cell warning event afs: file server 10.0.104.125 in cell \ my.cell is back up self-heal event + no timeout generated event ↓ 7 . 7
  11. SCENARIO2: KABOOM OPENAFS_LOSTCONTACT_TIMEOUT: message: tags: - alert - afs values:

    HOST_FROM: '${afs.server_ip}' PROGRAM: 'openafs-lostcontact/${afs.cell}-${afs.server_ip}' state: warning rule: 34a012fc-f964-4b85-a5cb-066ca2efa54b trigger: timeout Correlation afs: Lost contact with file server \ 10.0.104.125 in cell my.cell warning event timeout message: afs: Lost contact with file server \ 10.0.104.125 in cell my.cell HOST_FROM: 10.0.104.125 PROGRAM: openafs-lostcontact/my.cell-10.0.104.125 state: warning tags: [alert afs] generated event ↓ 7 . 8
  12. METRIC AGGREGATION riemann group-by puppet facts e.g. openstack "domains", productname,

    operating system e.g. server role, profiles, etc. each event has custom facts and site-specific key-values attached 7 . 10
  13. METRIC LIVE ARCHIVING riemann plugin Consolidate metrics using rrdtool-like overlapping

    round- robin archives online time-based downsampling write data to Elasticsearch rotate time-based indices using ES aliases transparent access to the data save disk space and keep queries fast ~ ILM for spectrum scale ~ continuous queries for influxdb samplerr 7 . 12
  14. e v e n t t 4 2 t +

    1 0 6 4 … t + 3 0 0 5 samplerr 20s avg 60s min avg max 5m min avg max 1h min avg max cust 100GB delete after 1day 100GB delete after 2 days 100GB delete after 1 week 100GB delete after 10 years
  15. samplerr (def cfunc [{:func samplerr/average :name avg} {:func samplerr/minimum :name

    min} {:func samplerr/maximum :name max}]) (def archives [{:tf "YYYY.MM.dd" :step 20 :ttl (days 1) :cfunc avg} {:tf "GGGG.'w'WW" :step 60 :ttl (days 2) :cfunc cfunc} {:tf "YYYY.MM" :step 300 :ttl (weeks 1) :cfunc cfunc} {:tf "YYYY" :step 3600 :ttl (years 10) :cfunc custcf}]) (samplerr/persist {:index-prefix index-prefix :index-type "samplerr" :conn elastic}) (samplerr/periodically-rotate {:interval (t/seconds 60) :conn elastic :index-prefix index-prefix :alias-prefix alias-prefix :archives a 7 . 14
  16. replace legacy system based on rrdtool used in production by

    fall at CCIN2P3 used for qserv monitoring (LSST) samplerr 7 . 16