Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Slawek Skowron - Monitoring @ Scale

Base Lab
March 18, 2015

Slawek Skowron - Monitoring @ Scale

Graphite monitoring in scale and how we done this @ BaseCRM.
Presentation from DevOps KRK.

Base Lab

March 18, 2015
Tweet

More Decks by Base Lab

Other Decks in Technology

Transcript

  1. OUTLINE • What is Graphite ? • Graphite architecture •

    Additional components • Current production setup • Writes • Reads • BaseCRM graphite evolution • Data migrations and recovery • Multi region • Dashboards management • Future work
  2. Monitoring system focused on: • Simple store metrics time series

    • Render graphs from time series on demand • API with functions • Dashboards • Huge number of tools and 3rd party products based on graphite WHAT IS GRAPHITE ?
  3. • Graphite-Web - Django web application with JS frontend •

    Dashboards (DB to save dashboards) • API with functions, server side graphs rendering • Carbon - Twisted daemon • carbon-relay - hash / route metrics • carbon-aggregator - aggregate metrics • carbon-cache - “memory cache” and persist metrics to disk • Whisper - simple time series DB • seconds to point Data points send as: metric name + value + Unix epoch timestamp GRAPHITE ARCHITECTURE
  4. ADDITIONAL COMPONENTS • Diamond - https://github.com/BrightcoveOS/Diamond • Python daemon •

    over 120 collectors • simple collectors development • used for OS and generic services monitoring
  5. ADDITIONAL COMPONENTS • Statsd - https://github.com/etsy/statsd/ • Node.js daemon •

    counts, sets, gauges, timers aggregates sends to graphite • many clients library’s • used for in app metrics
  6. Which components failed to work at scale ? carbon-relay switch

    to Carbon-c-Relay carbon-cache - switch to PyPy
  7. REPLACEMENT • Carbon-C-Relay - https://github.com/grobian/carbon-c-relay • Written in C •

    replacement for carbon-relay in python • High performance • multi cluster support (traffic replication) • traffic load-balancing • traffic hashing • Aggregation and rewrites
  8. CURRENT PRODUCTION SETUP - WRITES VM’s report to ELB Round-Robin

    to Relay Top Consistent hash with replication 2 Any of Carbon-cache on each store instance Write to local whisper store volume
  9. CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond,

    statsd, other) report in 30 seconds intervals carbon-c-relay as Top Relay • 3.5 - 4 mln metrics / min • 20% CPU usage on each • batch send (20k metrics) • queue 10 mln metrics carbon-c-relay as Local Relay • 7- 8 mln metrics / min • batch send (20k metrics) • queue 5 mln metrics Carbon-cache with PyPy (50% less CPU) • 7 - 8 mln metrics / min • each point update 0.13 - 0.15 ms 250K-350K Write IOPS 5-6 mln whisper DB files (2 copies)
  10. CURRENT PRODUCTION SETUP - WRITES • Minimise number of other

    processes and CPU usage • CPU Offload • Carbon-c-relay low cpu, • Batch writes, • Separate webs for clients from store hosts • Focus on carbon-cache (Write) + graphite-web (Read) • Leverage OS memory for carbon-cache • Raid0 for more write performance - we have replica • Focus on IOPS - low service time • Time must be always sync
  11. CURRENT PRODUCTION SETUP - READS Web front dashboard based on

    Graphite-Web Graphite-web django backend as API for Grafana Couchbase as cache for graphite-web metrics Each store as API via graphite-web Average response <300ms Nginx on top behind ELB Webs calculates functions, stores serves RAW metrics (CPU offload)
  12. BASECRM GRAPHITE EVOLUTION • PoC with external EBS and C3

    instances • graphite-web, carbon-relay, carbon-cache, whisper files on EBS • Production started on i2.xlarge • 5 store instances - 800GB SSD, 30GB RAM, 4xCPU’s • 4 carbon-cache’s on each store • Same software as in PoC • Problems with machines replace and migrations to bigger cluster • dash-maker to manage complicated dashboards • Next with i2.4xlarge • 5 store instances - 2x800GB in Raid0, 60GB RAM, 8xCPU’s • 8 carbon-cache’s on each store • carbon-c-relay as Top and Local relay • Recovery tool to recover data from old cluster • Grafana as second dashboard interface • Current with i2.8large - latest bump • 5 store instances - 4x800GB in Raid0, 120GB RAM, 16xCPU’s • 16 carbon-cache’s on each store
  13. DASH-MAKER • Manage dashboards like never before • Template everything

    with Jinja2 (all jinja2 features) • Dashboard config - one YAML with Jinja2 support • Reusable graphs - Json's like in graphite-web with Jinja2 support • Global key=values for Jinja2 • Dynamic Jinja2 vars expanded from graphite (last * in metric name) • Many dashboards options from one config based on loop vars • supports graphite 0.9.12, 0.9.12 (evernote), 0.9.13, 0.10.0
  14. DASH-MAKER $ dash-maker -f rabbitmq-server.yml 23:54:55 - dash-maker - main():Line:292

    - INFO - Time [templating: 0. 023229 build: 0. 000177 save: 2.546407] Dashboard dm.us-east- 1.production.rabbitmq. server saved with success 23:54:58 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 017746 build: 0. 000057 save: 2.549711] Dashboard dm.us-east- 1.sandbox.rabbitmq. server saved with success
  15. FUTURE WORK AND PROBLEMS • Dash-maker with Grafana support •

    Out-o-band fast aggregation with anomaly detection • Graphite with Hashing is not elastic - InfluxDB ? march prod ready ? • In future one dashboard - grafana + influxdb ?