The State of Open-Source Monitoring

Slide 1

Slide 1 text

Jason Dixon The Good, The Bad, The Fucking Terrible and a Glimpse Into Our Future The State of Open Source Monitoring: Friday, October 5, 12 John Willis says he’s never seen the asshole element at DevOpsDays. Coincidentally, this is my ﬁrst visit to DevOpsDays. Who here thinks monitoring sucks? Ok, this talk is not for you. I want to talk to people who love monitoring and want to make it better. After the end of this talk today I hope that we’ve started a new discussion around monitoring and the guys who come after me need to update their slides.

Slide 2

Slide 2 text

Hi, I’m Jason github.com/obfuscurity @obfuscurity Friday, October 5, 12 Hi, my name is Jason Dixon. A lot of people know me by my Twitter handle "obfuscurity". That’s an octocat that sort of looks like my twitter avatar.

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Friday, October 5, 12 First, a little bit about me. I live in Westminster MD in the USA. It’s in the middle of the country. You can drive out of our street and take the wrong turn and end up in a cornﬁeld. But that’s ok because we have a Chipotle in town. Does everyone know what Chipotle is?

Slide 5

Slide 5 text

Friday, October 5, 12 Previously I was a Site Reliability Engineer at OmniTI.

Slide 6

Slide 6 text

Friday, October 5, 12 And I also helped launch Circonus as their Product Manager.

Slide 7

Slide 7 text

Friday, October 5, 12 At Heroku I was an Ops Engineer. I did a lot of work with their metrics collection and Graphite setup.

Slide 8

Slide 8 text

Friday, October 5, 12 And now I work at GitHub. This is an actual screenshot of my offer letter.

Slide 9

Slide 9 text

Ops-erational Visibility Friday, October 5, 12 I used to be a hard-core UNIX SysAdmin like a lot of you, but I found that I’m really passionate about monitoring and visualization. So that’s what I specialize in these days. I’ve coined the term "VizOps", which is basically what I do to try and improve the state of visualization in Web Operations.

Slide 10

Slide 10 text

Friday, October 5, 12 So let’s start by asking ourselves, “What is Monitoring?”

Slide 11

Slide 11 text

Friday, October 5, 12 Is this monitoring? If you’re a vendor the answer is a resounding YES.

Slide 12

Slide 12 text

nagios httpd httpd httpd database database smtp firewall nrpe check alert primary oncall backup oncall shit hits the fan oncall response Friday, October 5, 12 But for most of us, this is what a traditional monitoring system looks like: a Nagios instance that runs host and service checks, it sends pager or email notiﬁcations when something is down, and it serves as the primary dashboard for interacting with alerts and recoveries.

Slide 13

Slide 13 text

nagios httpd httpd httpd database database smtp firewall nrpe check alert primary oncall backup oncall shit hits the fan oncall response Friday, October 5, 12 But what we don’t see here is that the responder doesn’t just interact with Nagios. The ﬁrst place you’ll probably go (after acking the alert) is to view trends for the affected resource. Maybe you have a Cacti installation that graphs SNMP data, or if you're really lucky, collectd on all your servers and a Graphite instance to store and graph all of your metrics.

Slide 14

Slide 14 text

Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 15

Slide 15 text

Nomenclature matters. Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 16

Slide 16 text

Nomenclature matters. fault detection Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 17

Slide 17 text

Nomenclature matters. fault detection notifications Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 18

Slide 18 text

Nomenclature matters. fault detection notifications metrics collection Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 19

Slide 19 text

Nomenclature matters. fault detection notifications metrics collection trending Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 20

Slide 20 text

Nomenclature matters. fault detection notifications metrics collection trending capacity planning Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 21

Slide 21 text

Nomenclature matters. fault detection notifications metrics collection trending capacity planning analytics Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notiﬁcations, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.

Slide 22

Slide 22 text

Friday, October 5, 12 Before we dig in too far I think it’s important that we identify what we mean by monitoring. What are the different functions it provides? Are these already available as distinct services, or can they be in the future? Once we identify all of these we can begin to see what a modern monitoring architecture might look like.

Slide 23

Slide 23 text

Metrics Collection Friday, October 5, 12 This one is straightforward but we rarely think about it when planning our architecture, other than as a by-product of fault detection, or separately as what powers our trending graphs. Especially with tools like Nagios that throw out the data after informing us as to what’s going on RIGHT NOW.

Slide 24

Slide 24 text

minimal context Friday, October 5, 12 Let’s start with a basic Nagios check like check-host-alive that tells us whether a host is up or down. But what is it really telling us? That we received a ping response to our ping request. But what does THAT really mean? That there’s a server out there somewhere, with at least the capacity to listen for ICMP requests and respond in kind.

Slide 25

Slide 25 text

host metrics Friday, October 5, 12 But what does THAT tell us? Still not much. We have no idea whether that network host is operating as intended. We’ve all seen servers that respond on the network even after a kernel panic. So what do we do? We start checking operating system-level metrics that help us ascertain the true state of the system. Metrics like CPU jiffies and load level that tell us how the system is behaving, but they don't really explain how it got there.

Slide 26

Slide 26 text

Friday, October 5, 12 So we also check things like disk usage, memory and swap usage, the number of users logged in, network activity, etc. We start to get a better image of what was going on leading up to a system event.

Slide 27

Slide 27 text

service metrics Friday, October 5, 12 But we don't run servers for the sake of running the operating system. So we need to gather more information on the services on this host. If it's a database server maybe we'll check the number of connections, the number of long- running queries, replication state, etc. And yet, even after all of this, we still might not have all of the data we'll need to troubleshoot an event or plan for capacity upgrades.

Slide 28

Slide 28 text

connection metrics Friday, October 5, 12 What if the system is in perfect working order, but we suffer a transient event on the network, and our database clients start timing out? So we start checking the database connection from a remote host. We monitor the time to connect, time to ﬁrst byte, total latency, etc.

Slide 29

Slide 29 text

metrics are king shit Friday, October 5, 12 The point I'm trying to make is that we need to place more emphasis on what metrics collection is, and why it's arguably THE MOST IMPORTANT part of the monitoring workﬂow. Historically this has been something we've only considered as part of Fault Detection, i.e. host and service checks. We have to start treating our metrics as ﬁrst-class data, with plans for long-term data storage and recovery.

Slide 30

Slide 30 text

metrics drive change Friday, October 5, 12 This is the data that we'll use going forward to plan for future growth and strategic architectural changes. We should plan to collect as much data as possible, as granular as possible, and store it for as long as possible.

Slide 31

Slide 31 text

Friday, October 5, 12 If there's one thing I'd ask you to take away from this talk, it's to LOVE YOUR METRICS.

Slide 32

Slide 32 text

Fault Detection Friday, October 5, 12 This is what we're talking about when we describe traditional "monitoring" responsibilities. It's also probably the hardest to get right.

Slide 33

Slide 33 text

Friday, October 5, 12 We're looking for when a host, a service, or an application "goes bad". When it stops doing the job it was tasked to do. Or worse, when it begins corrupting our data.

Slide 34

Slide 34 text

state change Friday, October 5, 12 This isn't an easy job. We have to be able to, ideally in real-time, track changes within the metrics we've collected about that entity, and determine when that change goes beyond an acceptable threshold.

Slide 35

Slide 35 text

lacking dynamism Friday, October 5, 12 Historically our tests relied on simple boolean checks, i.e. does it respond, or to a greater extent, does the data we've collected fall within a predetermined range, i.e. latency. Because these are static conﬁgurations, they're highly inﬂexible, and frequently incorrect as your systems evolve and scale. As I'm sure many of you would attest to at 3 in the morning.

Slide 36

Slide 36 text

finite visibility Friday, October 5, 12 Unfortunately, because our tools are designed to work based on what they know about the target right now, we're (for the most part) unable to take advantage of long-term historical trending or forecasting algorithms.

Slide 37

Slide 37 text

Friday, October 5, 12 Regardless, we've largely made due with what we have at our disposal. Traditional open-source Fault Detection systems will identify when a target's state has changed will ﬁre off some sort of notiﬁcation event to let us know that things are broken.

Slide 38

Slide 38 text

Notification Friday, October 5, 12 Notiﬁcations are pretty straightforward and generally hard to fuck up. In principal. But there can be a lot of complexity in what is otherwise a very simple premise: delivering an alert from the monitoring system to the responsible party. Let's step back for a moment and review a notiﬁcation might look like.

Slide 39

Slide 39 text

gather metadata Friday, October 5, 12 First, your Fault Detection system determines that something has failed. It knows which host and service (or application) triggered the alert. It knows what the previous state was. It knows the current state. It knows when this happened and hopefully has some additional metadata, maybe a link to the online documentation, that it wants to include for your convenience. Ok, now what to do with this information?

Slide 40

Slide 40 text

message routing Friday, October 5, 12 If we're a small shop, we whip up an email or pager message and ﬁre it off to someone with the skills to ﬁx the problem. But as we grow, how do we account for pager rotations or escalation policies? Fortunately these days we have services like PagerDuty to deal with this pain.

Slide 41

Slide 41 text

the manual way sucks Friday, October 5, 12 Before they came along there really was no easy way of handling this. If we were lazy we'd just change the email alias to the new on-call person at every shift change. If we were REALLY lazy we'd have a script that hits a database or Google Calendar to see who's on-call and then update the email alias programmatically.

Slide 42

Slide 42 text

PagerDuty Friday, October 5, 12 Nowadays we're pretty lucky that PagerDuty solved this for us. We can ﬁre off a POST to PagerDuty's API and trust that the message is going to get routed or escalated appropriately.

Slide 43

Slide 43 text

good enough Friday, October 5, 12 Now I'd be lying if I said I wasn't a little nervous about placing so much of our collective trust in a single company. I love the team at PagerDuty but I think it would be nice to see some small, sharp, open-source tools developed around notiﬁcations and scheduling. Unfortunately they've made this mostly a solved problem and I don't see anyone wanting to deal with this pain until they do something stupid that makes us want to leave.

Slide 44

Slide 44 text

Trending Friday, October 5, 12 Clearly we need Fault Detection and Notiﬁcations to enable ﬁrst response to any outages or unknown states. But long-term trending of our data is what empowers us to make intelligent decisions about how to resolve ongoing problems or plan for future change. It provides empirical evidence so that we don't have to operate purely on short- term spikes, conjecture or "educated guesses".

Slide 45

Slide 45 text

network & server visualization Friday, October 5, 12 Trending has historically been used by SysAdmins for visualizing network traffic and server load. The Multi Router Traffic Grapher (MRTG) was the ﬁrst popular graphing toolkit. As its name suggests, it was commonly used to graph SNMP metrics from routers and switches.

Slide 46

Slide 46 text

time-series archives Friday, October 5, 12 The time-series database within MRTG was eventually rewritten externally as RRDtool, a new version of the TSDB that was faster, more portable, and used a sparse ﬁle that wouldn't grow over time.

Slide 47

Slide 47 text

Friday, October 5, 12 Although its archive format was relatively awkward to work with, it made it easy enough that the average Systems Administrator could start trending all of their hosts and services with a minimum of pain.

Slide 48

Slide 48 text

trending niche as a service Friday, October 5, 12 These days we've got a huge variety of trending toolkits to work with. Most commercial vendors tend to focus on speciﬁc use cases, e.g. Front-End Performance, Transactional Proﬁling and Business Intelligence Analytics.

Slide 49

Slide 49 text

graphing toolkits Friday, October 5, 12 Conversely, open-source trending projects tend to focus on providing a scalable storage backend and graphing toolkits or an API. More and more we’re seeing an entire ecosystem of dashboard projects built up around these as well.

Slide 50

Slide 50 text

Friday, October 5, 12 The most popular ones support a variety of transforms and ﬁlters, allowing us to do things like: aggregate or average our metrics, calculate the 99th percentile, standard deviation, adjust scales, or simply help us forecast for growth. As the algorithms become more sophisticated and our datasets mature, it makes even more sense to start looking to our trending systems as the "source of truth" for Fault Detection.

Slide 51

Slide 51 text

Legacy of Tools Friday, October 5, 12 Let’s take a quick stroll through some of the more important open-source monitoring tools through modern history. I’m going to make a case as to why I think each of these tools are good, bad or really awful. You might disagree and that’s ok, this is just my list with my own personal criteria for judging them.

Slide 52

Slide 52 text

The Good Friday, October 5, 12 Ok, the good. This is a really short list. :)

Slide 53

Slide 53 text

Friday, October 5, 12 The eternal RRDtool. Although it began life as part of MRTG, the rebirth of RRDtool as a separate project meant that anyone could use this format to store and retrieve metrics in their own personal or open-source projects. It has a narrow focus, metrics storage, retrieval and visualization. It’s still highly relevant, 13 years after its initial release, and used in a wide variety of popular monitoring and visualization projects.

Slide 54

Slide 54 text

Friday, October 5, 12 Collectd is a fantastically ﬂexible metrics collection daemon. If there’s a service out there, chances are someone has written a plugin to monitor it. It has a wide variety of output plugins and, unlike collectors like Munin, it performs quite well.

Slide 55

Slide 55 text

Friday, October 5, 12 Graphite is another metrics storage, retrieval and visualization project. What makes Graphite so great to work with is that it’s so easy to send metrics to and create graphs with. What makes it better than all other graph rendering projects in my opinion, is it’s rendering API. It comes with a huge variety of aggregation and ﬁltering functions that can be chained together for complex transformations.

Slide 56

Slide 56 text

The Adequate Friday, October 5, 12 I added a category here for projects that I really want to like, but for one reason or another, drive me nuts.

Slide 57

Slide 57 text

Friday, October 5, 12 First off, we have Reconnoiter. This is a really nice metrics collection and trending project that we used at OmniTI as the basis for Circonus. Unfortunately, while they continue to backport enhancements to the open-source project, the web interface has never seen improvements and in fact is going to be deprecated. To make matters worse there is no API; all interacts have to be done via a Cisco IOS-like console. Oh, and it’s a hassle to deploy.

Slide 58

Slide 58 text

Friday, October 5, 12 Next up we have Munin. This is another metrics collection daemon that’s really easy to deploy. It even comes with a web UI and automatic graphing. Unfortunately it has a reputation for really poor performance.

Slide 59

Slide 59 text

Friday, October 5, 12 If you were to read the OpenTSDB docs you’d be convinced pretty quickly that it’s ability to scale horizontally would make it the end-all be-all solution for metrics collection and graphing. And while the latter is true, its API is horrid and its rendering functions pale in comparison to Graphite.

Slide 60

Slide 60 text

Friday, October 5, 12 Ganglia is actually a pretty complete monitoring suite. Unfortunately, it’s a monolithic application that tries to do everything adequately and ends up failing to do any one thing particularly well. Although to be fair, I think if someone spent a couple months just cleaning up the UI that would go a long way with me.

Slide 61

Slide 61 text

The Bad Friday, October 5, 12 The applications in my bad list have a lot in common. They’re all developed by commercial companies that offer an open-source or free version. They all emphasize automatic discovery over integration with conﬁguration management. They’re all targeted to Enterprise customers. And in my opinion, they always end up looking like a prettier version of Nagios.

Slide 62

Slide 62 text

Friday, October 5, 12 Zenoss

Slide 63

Slide 63 text

Friday, October 5, 12 Zabbix

Slide 64

Slide 64 text

Friday, October 5, 12 Groundwork OpenSource

Slide 65

Slide 65 text

The Fucking Terrible Friday, October 5, 12 Now I call these “fucking terrible” with a bit of tongue-in-cheek. These projects are clearly successful and in some cases, ubiquitous for the market they were designed for. Unfortunately, they are mostly good enough that nobody has been motivated to really improve upon them, so we’ve been stuck with these for a really long time.

Slide 66

Slide 66 text

Friday, October 5, 12 It seems like everyone loves to hate on Nagios, but nobody can really explain why. Please, allow me. The user interface is horrible. Acknowledgements are indeﬁnite, meaning that I can ack an alert and completely forget about it, and the system will never remind me. It takes WAY TOO MANY clicks to get anything done. It has no pagination, so a long page will completely choke your browser. I could go on but I only have 30 minutes.

Slide 67

Slide 67 text

Friday, October 5, 12 If you’re an SNMP shop, you could do much worse than Cacti. It’s really good for adding network metrics, but it favors conﬁguration over convention and doesn’t attempt to hide many of the rarely used options. Graphs are reasonably easy to create, but their data, host and graph templates are really difficult to master.

Slide 68

Slide 68 text

Friday, October 5, 12 GOD is a process monitor for Ruby apps. It’s hugely popular among Ruby shops for making sure a process is running. If it dies or gets wedged, GOD can just launch another. Why is this a bad thing? Because it de-motivates developers to ﬁx their stuff. It’s so much easier to just let your processes respawn than debug the actual source of the problem. A funny side note... it wasn’t until I went to capture this screenshot that I realized GOD was created by Tom Preston- Werner, one of the co-founders of GitHub. Yeah, he signs my paycheck.

Slide 69

Slide 69 text

Characteristics of Future Tools Friday, October 5, 12 There's a ton of open-source and commercial monitoring tools available, so why does monitoring suck so bad? What makes us love a particular tool that only does one thing, but despise another that would seem to have everything we want?

Slide 70

Slide 70 text

the UNIX way Friday, October 5, 12 It's actually not that hard to understand. We're a ﬁnicky bunch. We prefer our systems built from small, sharp tools. We don't want the hassle of commercial software. Put it simply, we prefer the UNIX way.

Slide 71

Slide 71 text

interoperable Friday, October 5, 12 In terms of commercial software, there’s a reason why Pingdom and PagerDuty are so popular among technically competent businesses. Cost is only a small part of the picture. We understand implicitly that combining small, sharp tools into a cohesive system is a helluva lot easier than breaking apart an Enterprise monitoring suite and forcing it to meet our speciﬁc needs.

Slide 72

Slide 72 text

Friday, October 5, 12 But why do so many companies choose the “monolithic” Enterprise offering? A lot of times it’s for convenience. The illusion that one product will meet all of our needs. Other times it’s because we don’t make the choice at all. Your decision-makers are completely out of touch with the realities of day-to-day operations and end up shopping from a checklist rather than experience.

Slide 73

Slide 73 text

choice Friday, October 5, 12 Whatever the reason, those products aren’t going away anytime soon. And for Enterprise customers who can afford to make these mistakes and start over, that’s fine. But their model does NOT fit how we need to think about open- source tools. For the majority of us, we don't use open-source software because it's "free". We use it because it fits our needs, or because we can modify it to do so. It offers us choice. Because we understand how it fits together. How it makes our job easier. And how it makes our business run smoother.

Slide 74

Slide 74 text

specifics Friday, October 5, 12 Ok, those are some good general themes to draw on when we’re talking about good-vs-bad software. But what are the speciﬁc characteristics of the next generation of open-source monitoring tools?

Slide 75

Slide 75 text

composable (bad-ass robot) Friday, October 5, 12 First off, it's composable. It has well-deﬁned responsibilities, interfaces and protocols.

Slide 76

Slide 76 text

composable self-service (bad-ass robot) Friday, October 5, 12 It's self-service. It doesn't require root access or an Ops member to deploy. Developers should be able to submit metrics and craft alerts without help or impedance from anyone else.

Slide 77

Slide 77 text

composable self-service resilient (bad-ass robot) Friday, October 5, 12 It's distributed. Resilient to outages within the monitoring architecture. It can route metrics collection around failed agents or pathways.

Slide 78

Slide 78 text

composable self-service resilient automated (bad-ass robot) Friday, October 5, 12 Obviously, it's capable of being automated. It fucking LOVES to be automated. Especially by CFEngine, right Mark Burgess?

Slide 79

Slide 79 text

composable self-service resilient automated correlative (bad-ass robot) Friday, October 5, 12 It's correlative. It's able to implicitly model relationships between services. We can use it to look at seemingly unrelated metrics within the same interface.

Slide 80

Slide 80 text

composable self-service resilient automated correlative craftsmanship (bad-ass robot) Friday, October 5, 12 Last but not least, it's beautiful. It's a pleasure to use. It removes impedance from the user experience and rewards us simply by using it. What do we end up with?

Slide 81

Slide 81 text

Friday, October 5, 12 A BADASS ROBOT CAPABLE OF WORLD DOMINATION. But no, seriously, hopefully we’ll have a ﬂexible and reliable monitoring system suitable for businesses of any size.

Slide 82

Slide 82 text

The Components Friday, October 5, 12 Now that we’ve deﬁned the characteristics of a composable monitoring system we can start to look at what we already have and start classifying each of these units, and by extension, their interfaces. There’s a good chance that both current and future projects will overlap functionality, but that’s ok; the most important thing is that we start to deﬁne the formats and interfaces that make these components INTERCHANGEABLE.

Slide 83

Slide 83 text

sensor sensor sensor cpu load snmp ... event stream Friday, October 5, 12 Sensors gather and emit our metrics. They should be portable across systems and capable of accumulating as much knowledge about that system as possible. For all practical purposes these are dumb agents with no concept of state; they capture the metric key, it’s value and the timestamp associated with that value. These results are then emitted to a log stream, over HTTP as JSON, or directly to the metrics store.

Slide 84

Slide 84 text

sensor sensor sensor cpu load snmp ... sum avg 98pct ... event stream aggregator Friday, October 5, 12 Aggregators are responsible for transformation, aggregation, or possibly simply relaying of metrics. They can be used to track counters, gauges or timers. Or they might be used just to proxy data from one format type to another.

Slide 85

Slide 85 text

sensor sensor sensor cpu load snmp ... sum avg 98pct ... state engine event stream aggregator Friday, October 5, 12 The state engine tracks changes within the event stream. It contains rules which deﬁne its behavior. Ideally it can ascertain faults according to seasonality and forecasting. Generally speaking it operates on a ﬁnite set of recent data, although the ability to refer to long-term trends would be ideal. In its most basic sense, it performs Fault Detection.

Slide 86

Slide 86 text

sensor sensor sensor cpu load snmp ... sum avg 98pct ... state engine storage engine event stream aggregator Friday, October 5, 12 Storage engines are responsible for long-term storage and retrieval of metrics. They should support transformative functions and aggregations so clients don’t have to. And ideally they should be capable of near-realtime retrieval and output in standard formats such as JSON, XML or SVG.

Slide 87

Slide 87 text

sensor sensor sensor cpu load snmp ... sum avg 98pct ... state engine storage engine event stream aggregator scheduler Friday, October 5, 12 The scheduler provides an interface for managing on-call and escalation calendars. By extension, it provides notiﬁers with the routing information they need to fulﬁll their duties.

Slide 88

Slide 88 text

sensor sensor sensor cpu load snmp ... sum avg 98pct ... state engine storage engine event stream aggregator scheduler notifier Friday, October 5, 12 Notiﬁers are responsible for composing the alert message using data provided by the state engine. It refers to the scheduler for routing instructions before attempting message delivery. And it tracks the state of each message for escalation purposes.

Slide 89

Slide 89 text

sensor sensor sensor cpu load snmp ... sum avg 98pct ... state engine storage engine event stream aggregator scheduler notifier visualizer Friday, October 5, 12 Visualizers consist of dashboards and other user interfaces that consume metrics and alerts from the system. In OSI- terms, they are the application layer on top of the stack. In layman terms, they make pretty graphs from raw data.

Slide 90

Slide 90 text

sensor sensor sensor cpu load snmp ... sum avg 98pct ... state engine storage engine event stream aggregator scheduler notifier visualizer Friday, October 5, 12 This is the event stream. But how does this differ from what we’re doing today? Well, if you’re a company like Etsy, Heroku or GitHub, it really doesn’t. If you’re not doing this, you’re duplicating your metrics collection and storage. You’re probably not taking advantage of forecasting or long-term trends for your fault detection. And it’s probably a huge pain in the ass for your developers to add new checks and metrics to your system.

Slide 91

Slide 91 text

Tools of the Future (available now) Friday, October 5, 12 You’re thinking “ok I’m convinced, what can I do now!” Here are some tools that already ﬁt into the event stream model. You’re probably already using some of them, at least if you’re cool like me. ;-)

Slide 92

Slide 92 text

Friday, October 5, 12 Collectd we’ve already mentioned but it’s a perfect example of a sensor. It gathers just about every metric type you could think of and has tons of output plugins.

Slide 93

Slide 93 text

Friday, October 5, 12 Coda Hale’s metrics library. Another awesome example of a sensor and the gold standard for metric-emitting libraries everywhere.

Slide 94

Slide 94 text

Friday, October 5, 12 The awesome statsd aggregator from the team at Etsy.

Slide 95

Slide 95 text

Friday, October 5, 12 And logster, another project from Etsy. This one extracts and aggregates metrics from your log stream.

Slide 96

Slide 96 text

Friday, October 5, 12 Logstash, the swiss-army knife of log stream relaying, ﬁltering and aggregation. If you need something done with your logs, chances are Logstash supports it.

Slide 97

Slide 97 text

Friday, October 5, 12 does a lot of things, but it’s wonderful about supporting external input and outputs sensor, aggregator, state engine, visualizer

Slide 98

Slide 98 text

Friday, October 5, 12 like Riemann, Sensu is modular and capable of external inputs and outputs. unlike Nagios, it handles distributed systems nicely and has a non-sucky UI.

Slide 99

Slide 99 text

Friday, October 5, 12 Umpire, a handy little state engine from Heroku. Basically it takes a Graphite query and threshold and returns an HTTP status code. So, for example, your developers could use a Pingdom account to query Graphite results through Umpire and send alerts based on the response. It’s the epitome of self-service monitoring.

Slide 100

Slide 100 text

Friday, October 5, 12 Comsat, a notiﬁer library for Ruby. It supports backends such as Campﬁre, PagerDuty and email.

Slide 101

Slide 101 text

Friday, October 5, 12 Kibana, a nice dashboard for Logstash. If you like Splunk you’ll probably love Kibana. Especially since it comes without the Splunk pricetag.

Slide 102

Slide 102 text

Friday, October 5, 12 Tasseo, a near-realtime dashboard written for Graphite. This is actually a screenshot from a port by Mathias Meyer to support Librato Metrics on the backend.

Slide 103

Slide 103 text

Friday, October 5, 12 And Descartes, a Graphite dashboard I’ve been working on for collaboration and postmortem discovery. It looks similar to some of the other Graphite dashboards out there but really emphasizes convention over conﬁguration and a much-improved workﬂow.

Slide 104

Slide 104 text

Not Open-Source, But… Friday, October 5, 12 I’d also like to brieﬂy mention some commercial services that, while they’re not open-source, they are open-source friendly. In particular they all have decent APIs that make it really easy to send data in, and AT LEAST POSSIBLE to pull data out of.

Slide 105

Slide 105 text

Friday, October 5, 12 Pingdom, a good service for basic remote monitoring. From my previous example, we had developers at Heroku use it to setup their own Graphite-friendly checks and alerts.

Slide 106

Slide 106 text

Friday, October 5, 12 Boundary

Slide 107

Slide 107 text

Friday, October 5, 12 Although I still prefer keeping my data in Graphite, Librato Metrics is a really nice alternative if you don’t want to manage your own data storage and retention. They have a really nice API and make it easy to integrate with open- source projects.

Slide 108

Slide 108 text

Friday, October 5, 12 Although I have some gripes about their API, PagerDuty really is the best-of-breed as far as notiﬁcations and on-call scheduling goes. They’re inexpensive and pretty darn reliable. I would love to see an open-source alternative in the scheduler space, but I have no problem giving them my money either way.

Slide 109

Slide 109 text

The Future is Composable Friday, October 5, 12 Long story short, I’d love to see open-source monitoring move towards the composable event stream model. Having a deﬁned set of functions and interfaces will improve the reliability and scalability of our toolset. If you have any questions or ideas, please see me later today. If you think I’m crazy, well, thanks for listening anyways.

Slide 110

Slide 110 text

Thank You Friday, October 5, 12 One more thing, GitHub is hiring Ops people. Find me if you’re interested!