DevOps Days 2015 Tel Aviv - Real Time Metrics and Distributed Monitoring

DevOps Days 2015 Real Time Metrics and Distributed Monitoring

Jeff Pierce Senior DevOps Engineer @ Change.org [email protected] https://github.com/jeffpierce @Th3Technomancer

• Consulted for Citigroup on their High Frequency Trading Servers
• Stints at: ◦ Apple ◦ Rackspace • Project Lead on Cassabon (https://github. com/jeffpierce/cassabon)

Background

About Change.org • Global platform where people start and win
campaigns for change • 120 million users worldwide • Rapidly expanding user base and engineering team • Spiky, unpredictable traffic based on current events and viral petitions

Why not outsource it?

Why not outsource it? • We tried!

Why not outsource it? • We tried! • We weren’t
happy with the price

Why not outsource it? • We tried! • We weren’t
happy with the price • We weren’t happy with the resolution of the stats we were capturing

Why do we need our monitoring distributed and high res
metrics?

metrics? • In a cloud world, centralized services are asking for failure

metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome!

metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome! • Faster response time to outages

metrics? • In a cloud world, centralized services are asking for failure • High resolution metrics are awesome! • Faster response time to outages • Able to autoscale on our own terms

What else influenced our decision?

What else influenced our decision? • We were pretty understaffed!

• Low implementation time was key

• Low implementation time was key • We needed to rely on the knowledge the team already had

• Low implementation time was key. • We needed to rely on the knowledge the team already had • We needed something with low maintenance and relatively easy scalability

Searching For A Solution

First Attempt: Try other providers!

First Attempt: Try other providers! • Unable to find a
provider that met both our price and resolution requirements

provider that met both our price and resolution requirements • None that we investigated had reasonable pricing for temporary, autoscaling pool hosts

provider that met both our price and resolution requirements • None that we investigated had reasonable pricing for temporary, autoscaling pool hosts • Decided to see what we could come up with in-house!

Requirements For A DIY Stack

Requirements For A DIY Stack • Leverage tools team members
were familiar with

were familiar with • Relatively low maintenance

were familiar with • Relatively low maintenance • Flexible, resilient, distributed

were familiar with • Relatively low maintenance • Flexible, resilient, distributed • Cost-competitive with outsourced services and with higher resolution

were familiar with • Relatively low maintenance • Flexible, resilient, distributed • Cost-competitive with outsourced services and with higher resolution • Uses many parts that we were already using in our infrastructure

We settled on...

We settled on... • collectd with statsd plugin (http: //collectd.org)

We settled on... • collectd with statsd plugin (http: //collectd.org)
• Cyanite (https://github.com/pyr/cyanite)

We settled on... • collectd with statsd plugin (http://collectd. org)
• Cyanite (https://github.com/pyr/cyanite) • graphite-api (https://github. com/brutasse/graphite-api)

We settled on...

We settled on... • collectd with statsd plugin (http://collectd. org)
• Cyanite (https://github.com/pyr/cyanite) • graphite-api (https://github. com/brutasse/graphite-api) • Grafana (http://grafana.org)

JSON Dashboards Are A Big Deal!

JSON Dashboards Are A Big Deal! • Developers often know
better which stats and graphs are important

better which stats and graphs are important • Takes work off of the plate of DevOps

better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code

better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code • Can also be generated via change control with custom libraries

better which stats and graphs are important • Takes work off of the plate of DevOps • Can be checked in with app code • Can also be generated via change control with custom libraries • JSON is a familiar format to devs, increasing adoption rate

App Servers “Central” Monitor Ext. Stat Gatherer TCP 2003 Cyanite
Cyanite Cyanite Cyanite Cassandra Cassandra Cassandra Cassandra Cassandra Cassandra TCP 8080 Elastic Search Grafana + Graphite-API TCP 80 Dashboard Requests

The Monitoring Side

Monitoring Implementation Goals • Write/run simple scripts to query Cyanite

• Use PagerDuty for alerting/paging

• Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats

• Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats • Try to use external monitoring services as little as possible

• Use PagerDuty for alerting/paging • Only use external monitoring to check application-wide or aggregate stats • Try to use external monitoring services as little as possible • Template as many checks as possible for easy management by change control

Getting Developer Buy-In

Getting Developer Buy-In • Make it simple to add stats
and monitors so that we get a high adoption rate

and monitors so that we get a high adoption rate • Make importable code in commonly used languages

and monitors so that we get a high adoption rate • Make importable code in commonly used languages • Demo ease of use

and monitors so that we get a high adoption rate • Make importable code in commonly used languages • Demo ease of use • Consult individual, influential developers on importance of getting stats everywhere

What We Got From All This Work

Wins Thus Far • Faster code!

Wins Thus Far • Faster code! • Faster and fewer
rollbacks!

rollbacks! • Finding problem instances is easier than ever!

rollbacks! • Finding problem instances is easier than ever! • Faster, easier troubleshooting!

And The Biggest Win...

Increased Communication Between Feature Developers and DevOps!

Increased Communication Between Feature Developers and DevOps! • App developers
have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.

Increased Communication Between Feature Developers and DevOps! • App developers
have an increased sense of ownership -- they choose what stats to capture and which dashboards matter • When something is wrong, it’s easier to accept it from stats than the Ops person

Winners Ask Questions!

DevOps Days 2015 Tel Aviv - Real Time Metrics a...

DevOps Days 2015 Tel Aviv - Real Time Metrics and Distributed Monitoring

More Decks by Jeff Pierce

Other Decks in Technology

Featured

Transcript