DevOps Days 2015 Tel Aviv - Real Time Metrics and Distributed Monitoring

Slide 1

Slide 1 text

DevOps Days 2015 Real Time Metrics and Distributed Monitoring

Slide 2

Slide 2 text

Jeff Pierce Senior DevOps Engineer @ Change.org jpierce@change.org https://github.com/jeffpierce @Th3Technomancer

Slide 3

Slide 3 text

● Consulted for Citigroup on their High Frequency Trading Servers ● Stints at: ○ Apple ○ Rackspace ● Project Lead on Cassabon (https://github. com/jeffpierce/cassabon)

Slide 4

Slide 4 text

Background

Slide 5

Slide 5 text

About Change.org ● Global platform where people start and win campaigns for change ● 120 million users worldwide ● Rapidly expanding user base and engineering team ● Spiky, unpredictable traffic based on current events and viral petitions

Slide 6

Slide 6 text

Why not outsource it?

Slide 7

Slide 7 text

Why not outsource it? ● We tried!

Slide 8

Slide 8 text

Why not outsource it? ● We tried! ● We weren’t happy with the price

Slide 9

Slide 9 text

Why not outsource it? ● We tried! ● We weren’t happy with the price ● We weren’t happy with the resolution of the stats we were capturing

Slide 10

Slide 10 text

Why do we need our monitoring distributed and high res metrics?

Slide 11

Slide 11 text

Why do we need our monitoring distributed and high res metrics? ● In a cloud world, centralized services are asking for failure

Slide 12

Slide 12 text

Why do we need our monitoring distributed and high res metrics? ● In a cloud world, centralized services are asking for failure ● High resolution metrics are awesome!

Slide 13

Slide 13 text

Why do we need our monitoring distributed and high res metrics? ● In a cloud world, centralized services are asking for failure ● High resolution metrics are awesome! ● Faster response time to outages

Slide 14

Slide 14 text

Slide 15

Slide 15 text

What else influenced our decision?

Slide 16

Slide 16 text

What else influenced our decision? ● We were pretty understaffed!

Slide 17

Slide 17 text

What else influenced our decision? ● We were pretty understaffed! ● Low implementation time was key

Slide 18

Slide 18 text

What else influenced our decision? ● We were pretty understaffed! ● Low implementation time was key ● We needed to rely on the knowledge the team already had

Slide 19

Slide 19 text

What else influenced our decision? ● We were pretty understaffed! ● Low implementation time was key. ● We needed to rely on the knowledge the team already had ● We needed something with low maintenance and relatively easy scalability

Slide 20

Slide 20 text

Searching For A Solution

Slide 21

Slide 21 text

First Attempt: Try other providers!

Slide 22

Slide 22 text

First Attempt: Try other providers! ● Unable to find a provider that met both our price and resolution requirements

Slide 23

Slide 23 text

First Attempt: Try other providers! ● Unable to find a provider that met both our price and resolution requirements ● None that we investigated had reasonable pricing for temporary, autoscaling pool hosts

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Requirements For A DIY Stack

Slide 26

Slide 26 text

Requirements For A DIY Stack ● Leverage tools team members were familiar with

Slide 27

Slide 27 text

Requirements For A DIY Stack ● Leverage tools team members were familiar with ● Relatively low maintenance

Slide 28

Slide 28 text

Requirements For A DIY Stack ● Leverage tools team members were familiar with ● Relatively low maintenance ● Flexible, resilient, distributed

Slide 29

Slide 29 text

Requirements For A DIY Stack ● Leverage tools team members were familiar with ● Relatively low maintenance ● Flexible, resilient, distributed ● Cost-competitive with outsourced services and with higher resolution

Slide 30

Slide 30 text

Slide 31

Slide 31 text

We settled on...

Slide 32

Slide 32 text

We settled on... ● collectd with statsd plugin (http: //collectd.org)

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

We settled on... ● collectd with statsd plugin (http: //collectd.org) ● Cyanite (https://github.com/pyr/cyanite)

Slide 35

Slide 35 text

We settled on... ● collectd with statsd plugin (http://collectd. org) ● Cyanite (https://github.com/pyr/cyanite) ● graphite-api (https://github. com/brutasse/graphite-api)

Slide 36

Slide 36 text

We settled on...

Slide 37

Slide 37 text

We settled on... ● collectd with statsd plugin (http://collectd. org) ● Cyanite (https://github.com/pyr/cyanite) ● graphite-api (https://github. com/brutasse/graphite-api) ● Grafana (http://grafana.org)

Slide 38

Slide 38 text

JSON Dashboards Are A Big Deal!

Slide 39

Slide 39 text

JSON Dashboards Are A Big Deal! ● Developers often know better which stats and graphs are important

Slide 40

Slide 40 text

JSON Dashboards Are A Big Deal! ● Developers often know better which stats and graphs are important ● Takes work off of the plate of DevOps

Slide 41

Slide 41 text

JSON Dashboards Are A Big Deal! ● Developers often know better which stats and graphs are important ● Takes work off of the plate of DevOps ● Can be checked in with app code

Slide 42

Slide 42 text

JSON Dashboards Are A Big Deal! ● Developers often know better which stats and graphs are important ● Takes work off of the plate of DevOps ● Can be checked in with app code ● Can also be generated via change control with custom libraries

Slide 43

Slide 43 text

Slide 44

Slide 44 text

App Servers “Central” Monitor Ext. Stat Gatherer TCP 2003 Cyanite Cyanite Cyanite Cyanite Cassandra Cassandra Cassandra Cassandra Cassandra Cassandra TCP 8080 Elastic Search Grafana + Graphite-API TCP 80 Dashboard Requests

Slide 45

Slide 45 text

The Monitoring Side

Slide 46

Slide 46 text

Monitoring Implementation Goals ● Write/run simple scripts to query Cyanite

Slide 47

Slide 47 text

Monitoring Implementation Goals ● Write/run simple scripts to query Cyanite ● Use PagerDuty for alerting/paging

Slide 48

Slide 48 text

Monitoring Implementation Goals ● Write/run simple scripts to query Cyanite ● Use PagerDuty for alerting/paging ● Only use external monitoring to check application-wide or aggregate stats

Slide 49

Slide 49 text

Monitoring Implementation Goals ● Write/run simple scripts to query Cyanite ● Use PagerDuty for alerting/paging ● Only use external monitoring to check application-wide or aggregate stats ● Try to use external monitoring services as little as possible

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Getting Developer Buy-In

Slide 52

Slide 52 text

Getting Developer Buy-In ● Make it simple to add stats and monitors so that we get a high adoption rate

Slide 53

Slide 53 text

Getting Developer Buy-In ● Make it simple to add stats and monitors so that we get a high adoption rate ● Make importable code in commonly used languages

Slide 54

Slide 54 text

Getting Developer Buy-In ● Make it simple to add stats and monitors so that we get a high adoption rate ● Make importable code in commonly used languages ● Demo ease of use

Slide 55

Slide 55 text

Getting Developer Buy-In ● Make it simple to add stats and monitors so that we get a high adoption rate ● Make importable code in commonly used languages ● Demo ease of use ● Consult individual, influential developers on importance of getting stats everywhere

Slide 56

Slide 56 text

What We Got From All This Work

Slide 57

Slide 57 text

Wins Thus Far ● Faster code!

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

Wins Thus Far ● Faster code! ● Faster and fewer rollbacks!

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

Wins Thus Far ● Faster code! ● Faster and fewer rollbacks! ● Finding problem instances is easier than ever!

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Wins Thus Far ● Faster code! ● Faster and fewer rollbacks! ● Finding problem instances is easier than ever! ● Faster, easier troubleshooting!

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

And The Biggest Win...

Slide 67

Slide 67 text

Increased Communication Between Feature Developers and DevOps!

Slide 68

Slide 68 text

Increased Communication Between Feature Developers and DevOps! ● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter.

Slide 69

Slide 69 text

Increased Communication Between Feature Developers and DevOps! ● App developers have an increased sense of ownership -- they choose what stats to capture and which dashboards matter ● When something is wrong, it’s easier to accept it from stats than the Ops person

Slide 70

Slide 70 text

Winners Ask Questions!