Slide 1

Slide 1 text

MONITORING @ SCALE CLOUD SCALE Sławomir Skowron Devops @ BaseCRM Devops Kraków 2015

Slide 2

Slide 2 text

OUTLINE • What is Graphite ? • Graphite architecture • Additional components • Current production setup • Writes • Reads • BaseCRM graphite evolution • Data migrations and recovery • Multi region • Dashboards management • Future work

Slide 3

Slide 3 text

Monitoring system focused on: • Simple store metrics time series • Render graphs from time series on demand • API with functions • Dashboards • Huge number of tools and 3rd party products based on graphite WHAT IS GRAPHITE ?

Slide 4

Slide 4 text

WHY ALL THIS ? LET’S LOOK AT EXAMPLE IN GRAFANA

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

GRAPHITE ARCHITECTURE

Slide 7

Slide 7 text

• Graphite-Web - Django web application with JS frontend • Dashboards (DB to save dashboards) • API with functions, server side graphs rendering • Carbon - Twisted daemon • carbon-relay - hash / route metrics • carbon-aggregator - aggregate metrics • carbon-cache - “memory cache” and persist metrics to disk • Whisper - simple time series DB • seconds to point Data points send as: metric name + value + Unix epoch timestamp GRAPHITE ARCHITECTURE

Slide 8

Slide 8 text

GRAPHITE ARCHITECTURE

Slide 9

Slide 9 text

ADDITIONAL COMPONENTS

Slide 10

Slide 10 text

ADDITIONAL COMPONENTS • Diamond - https://github.com/BrightcoveOS/Diamond • Python daemon • over 120 collectors • simple collectors development • used for OS and generic services monitoring

Slide 11

Slide 11 text

ADDITIONAL COMPONENTS • Statsd - https://github.com/etsy/statsd/ • Node.js daemon • counts, sets, gauges, timers aggregates sends to graphite • many clients library’s • used for in app metrics

Slide 12

Slide 12 text

CURRENT PRODUCTION SETUP WRITES

Slide 13

Slide 13 text

ALL PROVISIONED BY ANSIBLE

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Which components failed to work at scale ? carbon-relay switch to Carbon-c-Relay carbon-cache - switch to PyPy

Slide 16

Slide 16 text

REPLACEMENT • Carbon-C-Relay - https://github.com/grobian/carbon-c-relay • Written in C • replacement for carbon-relay in python • High performance • multi cluster support (traffic replication) • traffic load-balancing • traffic hashing • Aggregation and rewrites

Slide 17

Slide 17 text

IMPROVE Carbon-cache Switch to PyPy (2.4 and current 2.5) 40-50% less CPU usage on carbon-cache

Slide 18

Slide 18 text

CURRENT PRODUCTION SETUP - WRITES

Slide 19

Slide 19 text

CURRENT PRODUCTION SETUP - WRITES VM’s report to ELB Round-Robin to Relay Top Consistent hash with replication 2 Any of Carbon-cache on each store instance Write to local whisper store volume

Slide 20

Slide 20 text

CURRENT PRODUCTION SETUP - WRITES 450+ instances as clients (diamond, statsd, other) report in 30 seconds intervals carbon-c-relay as Top Relay • 3.5 - 4 mln metrics / min • 20% CPU usage on each • batch send (20k metrics) • queue 10 mln metrics carbon-c-relay as Local Relay • 7- 8 mln metrics / min • batch send (20k metrics) • queue 5 mln metrics Carbon-cache with PyPy (50% less CPU) • 7 - 8 mln metrics / min • each point update 0.13 - 0.15 ms 250K-350K Write IOPS 5-6 mln whisper DB files (2 copies)

Slide 21

Slide 21 text

CURRENT PRODUCTION SETUP - WRITES Graphite hashing - max space/performance like weakest host in cluster

Slide 22

Slide 22 text

CURRENT PRODUCTION SETUP - WRITES

Slide 23

Slide 23 text

CURRENT PRODUCTION SETUP - WRITES • Minimise number of other processes and CPU usage • CPU Offload • Carbon-c-relay low cpu, • Batch writes, • Separate webs for clients from store hosts • Focus on carbon-cache (Write) + graphite-web (Read) • Leverage OS memory for carbon-cache • Raid0 for more write performance - we have replica • Focus on IOPS - low service time • Time must be always sync

Slide 24

Slide 24 text

CURRENT PRODUCTION SETUP READS

Slide 25

Slide 25 text

CURRENT PRODUCTION SETUP - READS

Slide 26

Slide 26 text

CURRENT PRODUCTION SETUP - READS Web front dashboard based on Graphite-Web Graphite-web django backend as API for Grafana Couchbase as cache for graphite-web metrics Each store as API via graphite-web Average response <300ms Nginx on top behind ELB Webs calculates functions, stores serves RAW metrics (CPU offload)

Slide 27

Slide 27 text

BASECRM GRAPHITE EVOLUTION

Slide 28

Slide 28 text

BASECRM GRAPHITE EVOLUTION • PoC with external EBS and C3 instances • graphite-web, carbon-relay, carbon-cache, whisper files on EBS • Production started on i2.xlarge • 5 store instances - 800GB SSD, 30GB RAM, 4xCPU’s • 4 carbon-cache’s on each store • Same software as in PoC • Problems with machines replace and migrations to bigger cluster • dash-maker to manage complicated dashboards • Next with i2.4xlarge • 5 store instances - 2x800GB in Raid0, 60GB RAM, 8xCPU’s • 8 carbon-cache’s on each store • carbon-c-relay as Top and Local relay • Recovery tool to recover data from old cluster • Grafana as second dashboard interface • Current with i2.8large - latest bump • 5 store instances - 4x800GB in Raid0, 120GB RAM, 16xCPU’s • 16 carbon-cache’s on each store

Slide 29

Slide 29 text

DATA MIGRATION & RECOVERY

Slide 30

Slide 30 text

DATA MIGRATION & RECOVERY Replicate Traffic Copy old whispers, based on new cluster creates

Slide 31

Slide 31 text

DATA MIGRATION & RECOVERY 5 instances with 1Gbit/s - recovery tops 4.5Gbit/s using HTTP

Slide 32

Slide 32 text

DATA MIGRATION & RECOVERY Switch on ELB

Slide 33

Slide 33 text

DATA MIGRATION & RECOVERY Remove old cluster

Slide 34

Slide 34 text

MULTI REGION METRICS COLLECTING

Slide 35

Slide 35 text

MULTI REGION

Slide 36

Slide 36 text

DASH-MAKER INTERNAL DASHBOARDS MANAGEMENT

Slide 37

Slide 37 text

DASH-MAKER • Manage dashboards like never before • Template everything with Jinja2 (all jinja2 features) • Dashboard config - one YAML with Jinja2 support • Reusable graphs - Json's like in graphite-web with Jinja2 support • Global key=values for Jinja2 • Dynamic Jinja2 vars expanded from graphite (last * in metric name) • Many dashboards options from one config based on loop vars • supports graphite 0.9.12, 0.9.12 (evernote), 0.9.13, 0.10.0

Slide 38

Slide 38 text

DASH-MAKER $ dash-maker -f rabbitmq-server.yml 23:54:55 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 023229 build: 0. 000177 save: 2.546407] Dashboard dm.us-east- 1.production.rabbitmq. server saved with success 23:54:58 - dash-maker - main():Line:292 - INFO - Time [templating: 0. 017746 build: 0. 000057 save: 2.549711] Dashboard dm.us-east- 1.sandbox.rabbitmq. server saved with success

Slide 39

Slide 39 text

FUTURE WORK AND PROBLEMS

Slide 40

Slide 40 text

FUTURE WORK AND PROBLEMS • Dash-maker with Grafana support • Out-o-band fast aggregation with anomaly detection • Graphite with Hashing is not elastic - InfluxDB ? march prod ready ? • In future one dashboard - grafana + influxdb ?