The State of Open-Source Monitoring

The State of Open-Source Monitoring

A look at the history of open source monitoring tools, an attempt at fixing our vocabulary within monitoring, and a proposal for the "event stream model" as the future of open source tooling.

1f1a3879e40c9418252a5aec3aed31b2?s=128

Jason Dixon

October 05, 2012
Tweet

Transcript

  1. 1.

    Jason Dixon The Good, The Bad, The Fucking Terrible and

    a Glimpse Into Our Future The State of Open Source Monitoring: Friday, October 5, 12 John Willis says he’s never seen the asshole element at DevOpsDays. Coincidentally, this is my first visit to DevOpsDays. Who here thinks monitoring sucks? Ok, this talk is not for you. I want to talk to people who love monitoring and want to make it better. After the end of this talk today I hope that we’ve started a new discussion around monitoring and the guys who come after me need to update their slides.
  2. 2.

    Hi, I’m Jason github.com/obfuscurity @obfuscurity Friday, October 5, 12 Hi,

    my name is Jason Dixon. A lot of people know me by my Twitter handle "obfuscurity". That’s an octocat that sort of looks like my twitter avatar.
  3. 3.

    Hi, I’m Jason github.com/obfuscurity @obfuscurity Friday, October 5, 12 Hi,

    my name is Jason Dixon. A lot of people know me by my Twitter handle "obfuscurity". That’s an octocat that sort of looks like my twitter avatar.
  4. 4.

    Friday, October 5, 12 First, a little bit about me.

    I live in Westminster MD in the USA. It’s in the middle of the country. You can drive out of our street and take the wrong turn and end up in a cornfield. But that’s ok because we have a Chipotle in town. Does everyone know what Chipotle is?
  5. 7.

    Friday, October 5, 12 At Heroku I was an Ops

    Engineer. I did a lot of work with their metrics collection and Graphite setup.
  6. 8.

    Friday, October 5, 12 And now I work at GitHub.

    This is an actual screenshot of my offer letter.
  7. 9.

    Ops-erational Visibility Friday, October 5, 12 I used to be

    a hard-core UNIX SysAdmin like a lot of you, but I found that I’m really passionate about monitoring and visualization. So that’s what I specialize in these days. I’ve coined the term "VizOps", which is basically what I do to try and improve the state of visualization in Web Operations.
  8. 11.

    Friday, October 5, 12 Is this monitoring? If you’re a

    vendor the answer is a resounding YES.
  9. 12.

    nagios httpd httpd httpd database database smtp firewall nrpe check

    alert primary oncall backup oncall shit hits the fan oncall response Friday, October 5, 12 But for most of us, this is what a traditional monitoring system looks like: a Nagios instance that runs host and service checks, it sends pager or email notifications when something is down, and it serves as the primary dashboard for interacting with alerts and recoveries.
  10. 13.

    nagios httpd httpd httpd database database smtp firewall nrpe check

    alert primary oncall backup oncall shit hits the fan oncall response Friday, October 5, 12 But what we don’t see here is that the responder doesn’t just interact with Nagios. The first place you’ll probably go (after acking the alert) is to view trends for the affected resource. Maybe you have a Cacti installation that graphs SNMP data, or if you're really lucky, collectd on all your servers and a Graphite instance to store and graph all of your metrics.
  11. 14.

    Friday, October 5, 12 I hear the term “monitoring” used

    a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  12. 15.

    Nomenclature matters. Friday, October 5, 12 I hear the term

    “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  13. 16.

    Nomenclature matters. fault detection Friday, October 5, 12 I hear

    the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  14. 17.

    Nomenclature matters. fault detection notifications Friday, October 5, 12 I

    hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  15. 18.

    Nomenclature matters. fault detection notifications metrics collection Friday, October 5,

    12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  16. 19.

    Nomenclature matters. fault detection notifications metrics collection trending Friday, October

    5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  17. 20.

    Nomenclature matters. fault detection notifications metrics collection trending capacity planning

    Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  18. 21.

    Nomenclature matters. fault detection notifications metrics collection trending capacity planning

    analytics Friday, October 5, 12 I hear the term “monitoring” used a lot to generalize about the different features and components that make up our monitoring and trending systems. I think when most people use it though they're generally referring to fault detection and notifications, but I've also heard it used to describe metrics collection, trending, capacity planning, and even business analytics.
  19. 22.

    Friday, October 5, 12 Before we dig in too far

    I think it’s important that we identify what we mean by monitoring. What are the different functions it provides? Are these already available as distinct services, or can they be in the future? Once we identify all of these we can begin to see what a modern monitoring architecture might look like.
  20. 23.

    Metrics Collection Friday, October 5, 12 This one is straightforward

    but we rarely think about it when planning our architecture, other than as a by-product of fault detection, or separately as what powers our trending graphs. Especially with tools like Nagios that throw out the data after informing us as to what’s going on RIGHT NOW.
  21. 24.

    minimal context Friday, October 5, 12 Let’s start with a

    basic Nagios check like check-host-alive that tells us whether a host is up or down. But what is it really telling us? That we received a ping response to our ping request. But what does THAT really mean? That there’s a server out there somewhere, with at least the capacity to listen for ICMP requests and respond in kind.
  22. 25.

    host metrics Friday, October 5, 12 But what does THAT

    tell us? Still not much. We have no idea whether that network host is operating as intended. We’ve all seen servers that respond on the network even after a kernel panic. So what do we do? We start checking operating system-level metrics that help us ascertain the true state of the system. Metrics like CPU jiffies and load level that tell us how the system is behaving, but they don't really explain how it got there.
  23. 26.

    Friday, October 5, 12 So we also check things like

    disk usage, memory and swap usage, the number of users logged in, network activity, etc. We start to get a better image of what was going on leading up to a system event.
  24. 27.

    service metrics Friday, October 5, 12 But we don't run

    servers for the sake of running the operating system. So we need to gather more information on the services on this host. If it's a database server maybe we'll check the number of connections, the number of long- running queries, replication state, etc. And yet, even after all of this, we still might not have all of the data we'll need to troubleshoot an event or plan for capacity upgrades.
  25. 28.

    connection metrics Friday, October 5, 12 What if the system

    is in perfect working order, but we suffer a transient event on the network, and our database clients start timing out? So we start checking the database connection from a remote host. We monitor the time to connect, time to first byte, total latency, etc.
  26. 29.

    metrics are king shit Friday, October 5, 12 The point

    I'm trying to make is that we need to place more emphasis on what metrics collection is, and why it's arguably THE MOST IMPORTANT part of the monitoring workflow. Historically this has been something we've only considered as part of Fault Detection, i.e. host and service checks. We have to start treating our metrics as first-class data, with plans for long-term data storage and recovery.
  27. 30.

    metrics drive change Friday, October 5, 12 This is the

    data that we'll use going forward to plan for future growth and strategic architectural changes. We should plan to collect as much data as possible, as granular as possible, and store it for as long as possible.
  28. 31.

    Friday, October 5, 12 If there's one thing I'd ask

    you to take away from this talk, it's to LOVE YOUR METRICS.
  29. 32.

    Fault Detection Friday, October 5, 12 This is what we're

    talking about when we describe traditional "monitoring" responsibilities. It's also probably the hardest to get right.
  30. 33.

    Friday, October 5, 12 We're looking for when a host,

    a service, or an application "goes bad". When it stops doing the job it was tasked to do. Or worse, when it begins corrupting our data.
  31. 34.

    state change Friday, October 5, 12 This isn't an easy

    job. We have to be able to, ideally in real-time, track changes within the metrics we've collected about that entity, and determine when that change goes beyond an acceptable threshold.
  32. 35.

    lacking dynamism Friday, October 5, 12 Historically our tests relied

    on simple boolean checks, i.e. does it respond, or to a greater extent, does the data we've collected fall within a predetermined range, i.e. latency. Because these are static configurations, they're highly inflexible, and frequently incorrect as your systems evolve and scale. As I'm sure many of you would attest to at 3 in the morning.
  33. 36.

    finite visibility Friday, October 5, 12 Unfortunately, because our tools

    are designed to work based on what they know about the target right now, we're (for the most part) unable to take advantage of long-term historical trending or forecasting algorithms.
  34. 37.

    Friday, October 5, 12 Regardless, we've largely made due with

    what we have at our disposal. Traditional open-source Fault Detection systems will identify when a target's state has changed will fire off some sort of notification event to let us know that things are broken.
  35. 38.

    Notification Friday, October 5, 12 Notifications are pretty straightforward and

    generally hard to fuck up. In principal. But there can be a lot of complexity in what is otherwise a very simple premise: delivering an alert from the monitoring system to the responsible party. Let's step back for a moment and review a notification might look like.
  36. 39.

    gather metadata Friday, October 5, 12 First, your Fault Detection

    system determines that something has failed. It knows which host and service (or application) triggered the alert. It knows what the previous state was. It knows the current state. It knows when this happened and hopefully has some additional metadata, maybe a link to the online documentation, that it wants to include for your convenience. Ok, now what to do with this information?
  37. 40.

    message routing Friday, October 5, 12 If we're a small

    shop, we whip up an email or pager message and fire it off to someone with the skills to fix the problem. But as we grow, how do we account for pager rotations or escalation policies? Fortunately these days we have services like PagerDuty to deal with this pain.
  38. 41.

    the manual way sucks Friday, October 5, 12 Before they

    came along there really was no easy way of handling this. If we were lazy we'd just change the email alias to the new on-call person at every shift change. If we were REALLY lazy we'd have a script that hits a database or Google Calendar to see who's on-call and then update the email alias programmatically.
  39. 42.

    PagerDuty Friday, October 5, 12 Nowadays we're pretty lucky that

    PagerDuty solved this for us. We can fire off a POST to PagerDuty's API and trust that the message is going to get routed or escalated appropriately.
  40. 43.

    good enough Friday, October 5, 12 Now I'd be lying

    if I said I wasn't a little nervous about placing so much of our collective trust in a single company. I love the team at PagerDuty but I think it would be nice to see some small, sharp, open-source tools developed around notifications and scheduling. Unfortunately they've made this mostly a solved problem and I don't see anyone wanting to deal with this pain until they do something stupid that makes us want to leave.
  41. 44.

    Trending Friday, October 5, 12 Clearly we need Fault Detection

    and Notifications to enable first response to any outages or unknown states. But long-term trending of our data is what empowers us to make intelligent decisions about how to resolve ongoing problems or plan for future change. It provides empirical evidence so that we don't have to operate purely on short- term spikes, conjecture or "educated guesses".
  42. 45.

    network & server visualization Friday, October 5, 12 Trending has

    historically been used by SysAdmins for visualizing network traffic and server load. The Multi Router Traffic Grapher (MRTG) was the first popular graphing toolkit. As its name suggests, it was commonly used to graph SNMP metrics from routers and switches.
  43. 46.

    time-series archives Friday, October 5, 12 The time-series database within

    MRTG was eventually rewritten externally as RRDtool, a new version of the TSDB that was faster, more portable, and used a sparse file that wouldn't grow over time.
  44. 47.

    Friday, October 5, 12 Although its archive format was relatively

    awkward to work with, it made it easy enough that the average Systems Administrator could start trending all of their hosts and services with a minimum of pain.
  45. 48.

    trending niche as a service Friday, October 5, 12 These

    days we've got a huge variety of trending toolkits to work with. Most commercial vendors tend to focus on specific use cases, e.g. Front-End Performance, Transactional Profiling and Business Intelligence Analytics.
  46. 49.

    graphing toolkits Friday, October 5, 12 Conversely, open-source trending projects

    tend to focus on providing a scalable storage backend and graphing toolkits or an API. More and more we’re seeing an entire ecosystem of dashboard projects built up around these as well.
  47. 50.

    Friday, October 5, 12 The most popular ones support a

    variety of transforms and filters, allowing us to do things like: aggregate or average our metrics, calculate the 99th percentile, standard deviation, adjust scales, or simply help us forecast for growth. As the algorithms become more sophisticated and our datasets mature, it makes even more sense to start looking to our trending systems as the "source of truth" for Fault Detection.
  48. 51.

    Legacy of Tools Friday, October 5, 12 Let’s take a

    quick stroll through some of the more important open-source monitoring tools through modern history. I’m going to make a case as to why I think each of these tools are good, bad or really awful. You might disagree and that’s ok, this is just my list with my own personal criteria for judging them.
  49. 52.
  50. 53.

    Friday, October 5, 12 The eternal RRDtool. Although it began

    life as part of MRTG, the rebirth of RRDtool as a separate project meant that anyone could use this format to store and retrieve metrics in their own personal or open-source projects. It has a narrow focus, metrics storage, retrieval and visualization. It’s still highly relevant, 13 years after its initial release, and used in a wide variety of popular monitoring and visualization projects.
  51. 54.

    Friday, October 5, 12 Collectd is a fantastically flexible metrics

    collection daemon. If there’s a service out there, chances are someone has written a plugin to monitor it. It has a wide variety of output plugins and, unlike collectors like Munin, it performs quite well.
  52. 55.

    Friday, October 5, 12 Graphite is another metrics storage, retrieval

    and visualization project. What makes Graphite so great to work with is that it’s so easy to send metrics to and create graphs with. What makes it better than all other graph rendering projects in my opinion, is it’s rendering API. It comes with a huge variety of aggregation and filtering functions that can be chained together for complex transformations.
  53. 56.

    The Adequate Friday, October 5, 12 I added a category

    here for projects that I really want to like, but for one reason or another, drive me nuts.
  54. 57.

    Friday, October 5, 12 First off, we have Reconnoiter. This

    is a really nice metrics collection and trending project that we used at OmniTI as the basis for Circonus. Unfortunately, while they continue to backport enhancements to the open-source project, the web interface has never seen improvements and in fact is going to be deprecated. To make matters worse there is no API; all interacts have to be done via a Cisco IOS-like console. Oh, and it’s a hassle to deploy.
  55. 58.

    Friday, October 5, 12 Next up we have Munin. This

    is another metrics collection daemon that’s really easy to deploy. It even comes with a web UI and automatic graphing. Unfortunately it has a reputation for really poor performance.
  56. 59.

    Friday, October 5, 12 If you were to read the

    OpenTSDB docs you’d be convinced pretty quickly that it’s ability to scale horizontally would make it the end-all be-all solution for metrics collection and graphing. And while the latter is true, its API is horrid and its rendering functions pale in comparison to Graphite.
  57. 60.

    Friday, October 5, 12 Ganglia is actually a pretty complete

    monitoring suite. Unfortunately, it’s a monolithic application that tries to do everything adequately and ends up failing to do any one thing particularly well. Although to be fair, I think if someone spent a couple months just cleaning up the UI that would go a long way with me.
  58. 61.

    The Bad Friday, October 5, 12 The applications in my

    bad list have a lot in common. They’re all developed by commercial companies that offer an open-source or free version. They all emphasize automatic discovery over integration with configuration management. They’re all targeted to Enterprise customers. And in my opinion, they always end up looking like a prettier version of Nagios.
  59. 65.

    The Fucking Terrible Friday, October 5, 12 Now I call

    these “fucking terrible” with a bit of tongue-in-cheek. These projects are clearly successful and in some cases, ubiquitous for the market they were designed for. Unfortunately, they are mostly good enough that nobody has been motivated to really improve upon them, so we’ve been stuck with these for a really long time.
  60. 66.

    Friday, October 5, 12 It seems like everyone loves to

    hate on Nagios, but nobody can really explain why. Please, allow me. The user interface is horrible. Acknowledgements are indefinite, meaning that I can ack an alert and completely forget about it, and the system will never remind me. It takes WAY TOO MANY clicks to get anything done. It has no pagination, so a long page will completely choke your browser. I could go on but I only have 30 minutes.
  61. 67.

    Friday, October 5, 12 If you’re an SNMP shop, you

    could do much worse than Cacti. It’s really good for adding network metrics, but it favors configuration over convention and doesn’t attempt to hide many of the rarely used options. Graphs are reasonably easy to create, but their data, host and graph templates are really difficult to master.
  62. 68.

    Friday, October 5, 12 GOD is a process monitor for

    Ruby apps. It’s hugely popular among Ruby shops for making sure a process is running. If it dies or gets wedged, GOD can just launch another. Why is this a bad thing? Because it de-motivates developers to fix their stuff. It’s so much easier to just let your processes respawn than debug the actual source of the problem. A funny side note... it wasn’t until I went to capture this screenshot that I realized GOD was created by Tom Preston- Werner, one of the co-founders of GitHub. Yeah, he signs my paycheck.
  63. 69.

    Characteristics of Future Tools Friday, October 5, 12 There's a

    ton of open-source and commercial monitoring tools available, so why does monitoring suck so bad? What makes us love a particular tool that only does one thing, but despise another that would seem to have everything we want?
  64. 70.

    the UNIX way Friday, October 5, 12 It's actually not

    that hard to understand. We're a finicky bunch. We prefer our systems built from small, sharp tools. We don't want the hassle of commercial software. Put it simply, we prefer the UNIX way.
  65. 71.

    interoperable Friday, October 5, 12 In terms of commercial software,

    there’s a reason why Pingdom and PagerDuty are so popular among technically competent businesses. Cost is only a small part of the picture. We understand implicitly that combining small, sharp tools into a cohesive system is a helluva lot easier than breaking apart an Enterprise monitoring suite and forcing it to meet our specific needs.
  66. 72.

    Friday, October 5, 12 But why do so many companies

    choose the “monolithic” Enterprise offering? A lot of times it’s for convenience. The illusion that one product will meet all of our needs. Other times it’s because we don’t make the choice at all. Your decision-makers are completely out of touch with the realities of day-to-day operations and end up shopping from a checklist rather than experience.
  67. 73.

    choice Friday, October 5, 12 Whatever the reason, those products

    aren’t going away anytime soon. And for Enterprise customers who can afford to make these mistakes and start over, that’s fine. But their model does NOT fit how we need to think about open- source tools. For the majority of us, we don't use open-source software because it's "free". We use it because it fits our needs, or because we can modify it to do so. It offers us choice. Because we understand how it fits together. How it makes our job easier. And how it makes our business run smoother.
  68. 74.

    specifics Friday, October 5, 12 Ok, those are some good

    general themes to draw on when we’re talking about good-vs-bad software. But what are the specific characteristics of the next generation of open-source monitoring tools?
  69. 75.

    composable (bad-ass robot) Friday, October 5, 12 First off, it's

    composable. It has well-defined responsibilities, interfaces and protocols.
  70. 76.

    composable self-service (bad-ass robot) Friday, October 5, 12 It's self-service.

    It doesn't require root access or an Ops member to deploy. Developers should be able to submit metrics and craft alerts without help or impedance from anyone else.
  71. 77.

    composable self-service resilient (bad-ass robot) Friday, October 5, 12 It's

    distributed. Resilient to outages within the monitoring architecture. It can route metrics collection around failed agents or pathways.
  72. 78.

    composable self-service resilient automated (bad-ass robot) Friday, October 5, 12

    Obviously, it's capable of being automated. It fucking LOVES to be automated. Especially by CFEngine, right Mark Burgess?
  73. 79.

    composable self-service resilient automated correlative (bad-ass robot) Friday, October 5,

    12 It's correlative. It's able to implicitly model relationships between services. We can use it to look at seemingly unrelated metrics within the same interface.
  74. 80.

    composable self-service resilient automated correlative craftsmanship (bad-ass robot) Friday, October

    5, 12 Last but not least, it's beautiful. It's a pleasure to use. It removes impedance from the user experience and rewards us simply by using it. What do we end up with?
  75. 81.

    Friday, October 5, 12 A BADASS ROBOT CAPABLE OF WORLD

    DOMINATION. But no, seriously, hopefully we’ll have a flexible and reliable monitoring system suitable for businesses of any size.
  76. 82.

    The Components Friday, October 5, 12 Now that we’ve defined

    the characteristics of a composable monitoring system we can start to look at what we already have and start classifying each of these units, and by extension, their interfaces. There’s a good chance that both current and future projects will overlap functionality, but that’s ok; the most important thing is that we start to define the formats and interfaces that make these components INTERCHANGEABLE.
  77. 83.

    sensor sensor sensor cpu load snmp ... event stream Friday,

    October 5, 12 Sensors gather and emit our metrics. They should be portable across systems and capable of accumulating as much knowledge about that system as possible. For all practical purposes these are dumb agents with no concept of state; they capture the metric key, it’s value and the timestamp associated with that value. These results are then emitted to a log stream, over HTTP as JSON, or directly to the metrics store.
  78. 84.

    sensor sensor sensor cpu load snmp ... sum avg 98pct

    ... event stream aggregator Friday, October 5, 12 Aggregators are responsible for transformation, aggregation, or possibly simply relaying of metrics. They can be used to track counters, gauges or timers. Or they might be used just to proxy data from one format type to another.
  79. 85.

    sensor sensor sensor cpu load snmp ... sum avg 98pct

    ... state engine event stream aggregator Friday, October 5, 12 The state engine tracks changes within the event stream. It contains rules which define its behavior. Ideally it can ascertain faults according to seasonality and forecasting. Generally speaking it operates on a finite set of recent data, although the ability to refer to long-term trends would be ideal. In its most basic sense, it performs Fault Detection.
  80. 86.

    sensor sensor sensor cpu load snmp ... sum avg 98pct

    ... state engine storage engine event stream aggregator Friday, October 5, 12 Storage engines are responsible for long-term storage and retrieval of metrics. They should support transformative functions and aggregations so clients don’t have to. And ideally they should be capable of near-realtime retrieval and output in standard formats such as JSON, XML or SVG.
  81. 87.

    sensor sensor sensor cpu load snmp ... sum avg 98pct

    ... state engine storage engine event stream aggregator scheduler Friday, October 5, 12 The scheduler provides an interface for managing on-call and escalation calendars. By extension, it provides notifiers with the routing information they need to fulfill their duties.
  82. 88.

    sensor sensor sensor cpu load snmp ... sum avg 98pct

    ... state engine storage engine event stream aggregator scheduler notifier Friday, October 5, 12 Notifiers are responsible for composing the alert message using data provided by the state engine. It refers to the scheduler for routing instructions before attempting message delivery. And it tracks the state of each message for escalation purposes.
  83. 89.

    sensor sensor sensor cpu load snmp ... sum avg 98pct

    ... state engine storage engine event stream aggregator scheduler notifier visualizer Friday, October 5, 12 Visualizers consist of dashboards and other user interfaces that consume metrics and alerts from the system. In OSI- terms, they are the application layer on top of the stack. In layman terms, they make pretty graphs from raw data.
  84. 90.

    sensor sensor sensor cpu load snmp ... sum avg 98pct

    ... state engine storage engine event stream aggregator scheduler notifier visualizer Friday, October 5, 12 This is the event stream. But how does this differ from what we’re doing today? Well, if you’re a company like Etsy, Heroku or GitHub, it really doesn’t. If you’re not doing this, you’re duplicating your metrics collection and storage. You’re probably not taking advantage of forecasting or long-term trends for your fault detection. And it’s probably a huge pain in the ass for your developers to add new checks and metrics to your system.
  85. 91.

    Tools of the Future (available now) Friday, October 5, 12

    You’re thinking “ok I’m convinced, what can I do now!” Here are some tools that already fit into the event stream model. You’re probably already using some of them, at least if you’re cool like me. ;-)
  86. 92.

    Friday, October 5, 12 Collectd we’ve already mentioned but it’s

    a perfect example of a sensor. It gathers just about every metric type you could think of and has tons of output plugins.
  87. 93.

    Friday, October 5, 12 Coda Hale’s metrics library. Another awesome

    example of a sensor and the gold standard for metric-emitting libraries everywhere.
  88. 95.

    Friday, October 5, 12 And logster, another project from Etsy.

    This one extracts and aggregates metrics from your log stream.
  89. 96.

    Friday, October 5, 12 Logstash, the swiss-army knife of log

    stream relaying, filtering and aggregation. If you need something done with your logs, chances are Logstash supports it.
  90. 97.

    Friday, October 5, 12 does a lot of things, but

    it’s wonderful about supporting external input and outputs sensor, aggregator, state engine, visualizer
  91. 98.

    Friday, October 5, 12 like Riemann, Sensu is modular and

    capable of external inputs and outputs. unlike Nagios, it handles distributed systems nicely and has a non-sucky UI.
  92. 99.

    Friday, October 5, 12 Umpire, a handy little state engine

    from Heroku. Basically it takes a Graphite query and threshold and returns an HTTP status code. So, for example, your developers could use a Pingdom account to query Graphite results through Umpire and send alerts based on the response. It’s the epitome of self-service monitoring.
  93. 100.

    Friday, October 5, 12 Comsat, a notifier library for Ruby.

    It supports backends such as Campfire, PagerDuty and email.
  94. 101.

    Friday, October 5, 12 Kibana, a nice dashboard for Logstash.

    If you like Splunk you’ll probably love Kibana. Especially since it comes without the Splunk pricetag.
  95. 102.

    Friday, October 5, 12 Tasseo, a near-realtime dashboard written for

    Graphite. This is actually a screenshot from a port by Mathias Meyer to support Librato Metrics on the backend.
  96. 103.

    Friday, October 5, 12 And Descartes, a Graphite dashboard I’ve

    been working on for collaboration and postmortem discovery. It looks similar to some of the other Graphite dashboards out there but really emphasizes convention over configuration and a much-improved workflow.
  97. 104.

    Not Open-Source, But… Friday, October 5, 12 I’d also like

    to briefly mention some commercial services that, while they’re not open-source, they are open-source friendly. In particular they all have decent APIs that make it really easy to send data in, and AT LEAST POSSIBLE to pull data out of.
  98. 105.

    Friday, October 5, 12 Pingdom, a good service for basic

    remote monitoring. From my previous example, we had developers at Heroku use it to setup their own Graphite-friendly checks and alerts.
  99. 107.

    Friday, October 5, 12 Although I still prefer keeping my

    data in Graphite, Librato Metrics is a really nice alternative if you don’t want to manage your own data storage and retention. They have a really nice API and make it easy to integrate with open- source projects.
  100. 108.

    Friday, October 5, 12 Although I have some gripes about

    their API, PagerDuty really is the best-of-breed as far as notifications and on-call scheduling goes. They’re inexpensive and pretty darn reliable. I would love to see an open-source alternative in the scheduler space, but I have no problem giving them my money either way.
  101. 109.

    The Future is Composable Friday, October 5, 12 Long story

    short, I’d love to see open-source monitoring move towards the composable event stream model. Having a defined set of functions and interfaces will improve the reliability and scalability of our toolset. If you have any questions or ideas, please see me later today. If you think I’m crazy, well, thanks for listening anyways.
  102. 110.

    Thank You Friday, October 5, 12 One more thing, GitHub

    is hiring Ops people. Find me if you’re interested!