Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOps Days 2015 Tel Aviv - Real Time Metrics and Distributed Monitoring

Jeff Pierce
October 06, 2015

DevOps Days 2015 Tel Aviv - Real Time Metrics and Distributed Monitoring

Talk given at DevOps Days 2015 Tel Aviv on how we at Change.org put in a real-time metrics and monitoring system, got engineering-wide adoption to make it useful, and the gains we got out of it!

Jeff Pierce

October 06, 2015
Tweet

More Decks by Jeff Pierce

Other Decks in Technology

Transcript

  1. • Consulted for Citigroup on their High Frequency Trading Servers

    • Stints at: ◦ Apple ◦ Rackspace • Project Lead on Cassabon (https://github. com/jeffpierce/cassabon)
  2. About Change.org • Global platform where people start and win

    campaigns for change • 120 million users worldwide • Rapidly expanding user base and engineering team • Spiky, unpredictable traffic based on current events and viral petitions
  3. Why not outsource it? • We tried! • We weren’t

    happy with the price • We weren’t happy with the resolution of the stats we were capturing
  4. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure
  5. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome!
  6. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome! • Faster response time to outages
  7. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome! • Faster response time to outages • Able to autoscale on our own terms
  8. What else influenced our decision? • We were pretty understaffed!

    • Low implementation time was key • We needed to rely on the knowledge the team already had
  9. What else influenced our decision? • We were pretty understaffed!

    • Low implementation time was key. • We needed to rely on the knowledge the team already had • We needed something with low maintenance and relatively easy scalability
  10. First Attempt: Try other providers! • Unable to find a

    provider that met both our price and resolution requirements
  11. First Attempt: Try other providers! • Unable to find a

    provider that met both our price and resolution requirements • None that we investigated had reasonable pricing for temporary, autoscaling pool hosts
  12. First Attempt: Try other providers! • Unable to find a

    provider that met both our price and resolution requirements • None that we investigated had reasonable pricing for temporary, autoscaling pool hosts • Decided to see what we could come up with in-house!
  13. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance
  14. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance • Flexible, resilient, distributed
  15. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance • Flexible, resilient, distributed • Cost-competitive with outsourced services and with higher resolution
  16. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance • Flexible, resilient, distributed • Cost-competitive with outsourced services and with higher resolution • Uses many parts that we were already using in our infrastructure
  17. We settled on... • collectd with statsd plugin (http://collectd. org)

    • Cyanite (https://github.com/pyr/cyanite) • graphite-api (https://github. com/brutasse/graphite-api)
  18. We settled on... • collectd with statsd plugin (http://collectd. org)

    • Cyanite (https://github.com/pyr/cyanite) • graphite-api (https://github. com/brutasse/graphite-api) • Grafana (http://grafana.org)
  19. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important
  20. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps
  21. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code
  22. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code • Can also be generated via change control with custom libraries
  23. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code • Can also be generated via change control with custom libraries • JSON is a familiar format to devs, increasing adoption rate
  24. App Servers “Central” Monitor Ext. Stat Gatherer TCP 2003 Cyanite

    Cyanite Cyanite Cyanite Cassandra Cassandra Cassandra Cassandra Cassandra Cassandra TCP 8080 Elastic Search Grafana + Graphite-API TCP 80 Dashboard Requests
  25. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

    • Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats
  26. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

    • Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats • Try to use external monitoring services as little as possible
  27. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

    • Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats • Try to use external monitoring services as little as possible • Template as many checks as possible for easy management by change control
  28. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate
  29. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate • Make importable code in commonly used languages
  30. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate • Make importable code in commonly used languages • Demo ease of use
  31. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate • Make importable code in commonly used languages • Demo ease of use • Consult individual, influential developers on importance of getting stats everywhere
  32. Wins Thus Far • Faster code! • Faster and fewer

    rollbacks! • Finding problem instances is easier than ever!
  33. Wins Thus Far • Faster code! • Faster and fewer

    rollbacks! • Finding problem instances is easier than ever! • Faster, easier troubleshooting!
  34. Increased Communication Between Feature Developers and DevOps! • App developers

    have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.
  35. Increased Communication Between Feature Developers and DevOps! • App developers

    have an increased sense of ownership -- they choose what stats to capture and which dashboards matter • When something is wrong, it’s easier to accept it from stats than the Ops person