Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOps Days 2015 Tel Aviv - Real Time Metrics and Distributed Monitoring

47eba901bd5a2f746b67fc064f69c1ee?s=47 Jeff Pierce
October 06, 2015

DevOps Days 2015 Tel Aviv - Real Time Metrics and Distributed Monitoring

Talk given at DevOps Days 2015 Tel Aviv on how we at Change.org put in a real-time metrics and monitoring system, got engineering-wide adoption to make it useful, and the gains we got out of it!

47eba901bd5a2f746b67fc064f69c1ee?s=128

Jeff Pierce

October 06, 2015
Tweet

Transcript

  1. DevOps Days 2015 Real Time Metrics and Distributed Monitoring

  2. Jeff Pierce Senior DevOps Engineer @ Change.org jpierce@change.org https://github.com/jeffpierce @Th3Technomancer

  3. • Consulted for Citigroup on their High Frequency Trading Servers

    • Stints at: ◦ Apple ◦ Rackspace • Project Lead on Cassabon (https://github. com/jeffpierce/cassabon)
  4. Background

  5. About Change.org • Global platform where people start and win

    campaigns for change • 120 million users worldwide • Rapidly expanding user base and engineering team • Spiky, unpredictable traffic based on current events and viral petitions
  6. Why not outsource it?

  7. Why not outsource it? • We tried!

  8. Why not outsource it? • We tried! • We weren’t

    happy with the price
  9. Why not outsource it? • We tried! • We weren’t

    happy with the price • We weren’t happy with the resolution of the stats we were capturing
  10. Why do we need our monitoring distributed and high res

    metrics?
  11. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure
  12. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome!
  13. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome! • Faster response time to outages
  14. Why do we need our monitoring distributed and high res

    metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome! • Faster response time to outages • Able to autoscale on our own terms
  15. What else influenced our decision?

  16. What else influenced our decision? • We were pretty understaffed!

  17. What else influenced our decision? • We were pretty understaffed!

    • Low implementation time was key
  18. What else influenced our decision? • We were pretty understaffed!

    • Low implementation time was key • We needed to rely on the knowledge the team already had
  19. What else influenced our decision? • We were pretty understaffed!

    • Low implementation time was key. • We needed to rely on the knowledge the team already had • We needed something with low maintenance and relatively easy scalability
  20. Searching For A Solution

  21. First Attempt: Try other providers!

  22. First Attempt: Try other providers! • Unable to find a

    provider that met both our price and resolution requirements
  23. First Attempt: Try other providers! • Unable to find a

    provider that met both our price and resolution requirements • None that we investigated had reasonable pricing for temporary, autoscaling pool hosts
  24. First Attempt: Try other providers! • Unable to find a

    provider that met both our price and resolution requirements • None that we investigated had reasonable pricing for temporary, autoscaling pool hosts • Decided to see what we could come up with in-house!
  25. Requirements For A DIY Stack

  26. Requirements For A DIY Stack • Leverage tools team members

    were familiar with
  27. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance
  28. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance • Flexible, resilient, distributed
  29. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance • Flexible, resilient, distributed • Cost-competitive with outsourced services and with higher resolution
  30. Requirements For A DIY Stack • Leverage tools team members

    were familiar with • Relatively low maintenance • Flexible, resilient, distributed • Cost-competitive with outsourced services and with higher resolution • Uses many parts that we were already using in our infrastructure
  31. We settled on...

  32. We settled on... • collectd with statsd plugin (http: //collectd.org)

  33. None
  34. We settled on... • collectd with statsd plugin (http: //collectd.org)

    • Cyanite (https://github.com/pyr/cyanite)
  35. We settled on... • collectd with statsd plugin (http://collectd. org)

    • Cyanite (https://github.com/pyr/cyanite) • graphite-api (https://github. com/brutasse/graphite-api)
  36. We settled on...

  37. We settled on... • collectd with statsd plugin (http://collectd. org)

    • Cyanite (https://github.com/pyr/cyanite) • graphite-api (https://github. com/brutasse/graphite-api) • Grafana (http://grafana.org)
  38. JSON Dashboards Are A Big Deal!

  39. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important
  40. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps
  41. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code
  42. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code • Can also be generated via change control with custom libraries
  43. JSON Dashboards Are A Big Deal! • Developers often know

    better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code • Can also be generated via change control with custom libraries • JSON is a familiar format to devs, increasing adoption rate
  44. App Servers “Central” Monitor Ext. Stat Gatherer TCP 2003 Cyanite

    Cyanite Cyanite Cyanite Cassandra Cassandra Cassandra Cassandra Cassandra Cassandra TCP 8080 Elastic Search Grafana + Graphite-API TCP 80 Dashboard Requests
  45. The Monitoring Side

  46. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

  47. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

    • Use PagerDuty for alerting/paging
  48. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

    • Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats
  49. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

    • Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats • Try to use external monitoring services as little as possible
  50. Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

    • Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats • Try to use external monitoring services as little as possible • Template as many checks as possible for easy management by change control
  51. Getting Developer Buy-In

  52. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate
  53. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate • Make importable code in commonly used languages
  54. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate • Make importable code in commonly used languages • Demo ease of use
  55. Getting Developer Buy-In • Make it simple to add stats

    and monitors so that we get a high adoption rate • Make importable code in commonly used languages • Demo ease of use • Consult individual, influential developers on importance of getting stats everywhere
  56. What We Got From All This Work

  57. Wins Thus Far • Faster code!

  58. None
  59. Wins Thus Far • Faster code! • Faster and fewer

    rollbacks!
  60. None
  61. Wins Thus Far • Faster code! • Faster and fewer

    rollbacks! • Finding problem instances is easier than ever!
  62. None
  63. None
  64. Wins Thus Far • Faster code! • Faster and fewer

    rollbacks! • Finding problem instances is easier than ever! • Faster, easier troubleshooting!
  65. None
  66. And The Biggest Win...

  67. Increased Communication Between Feature Developers and DevOps!

  68. Increased Communication Between Feature Developers and DevOps! • App developers

    have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.
  69. Increased Communication Between Feature Developers and DevOps! • App developers

    have an increased sense of ownership -- they choose what stats to capture and which dashboards matter • When something is wrong, it’s easier to accept it from stats than the Ops person
  70. Winners Ask Questions!