Explains why Rubyists should view code instrumentation as imperative as unit testing, the current state of Ruby instrumentation, and what we need to do make this practice much more widely spread. Presented at RubyConf 2012: http://www.confreaks.com/videos/1276-rubyconf2012-ruby-monitoring-state-of-the-union
State of the Union
‣ Co-Founder/CTO Librato
‣ I <3 graphs
What’s all this fuss
People have been monitoring computers forever right? Why is everyone talking about
monitoring all of a sudden? What changed? In order to answer that question it’s instructive to
examine what’s changed in building/delivering SaaS over the last decade.
‣ Seed Round: $1.5M
‣ Dedicated Ops
‣ Custom Software
To operate a SaaS business 10 years ago, you had to go get a pile of money and make an up-
front investment in physical hardware. You’d hire a dedicated Ops team to run it, and most (if
not all) of your software stack would be custom-built. Your primary monitoring concern
outside of hardware failure would be capacity planning i.e. when do we need to buy another
rack of servers to step-wise increase capacity.
‣ Seed Round: $20K
‣ <=1 Ops Person
‣ OSS, External
10 years later, and we’re in a dramatically different situation. Seed rounds are orders of
magnitude smaller and infrastructure is deployed on demand, “in the cloud”. Teams are lean
so many small startups don’t even have a single dedicated Ops person. Products are built
primarily with off-the-shelf OSS and external services.
‣ more change, worse tools!
The result of this shift is that we now have “agile infrastructure” that’s paid for as an OPEX
expense and can be rapidly modiﬁed/scaled/adjusted to meet shifting business needs. The
trade-off we make to achieve this is that it’s ephemeral. This means that we’re now operating
under more challenging conditions, with worse tooling.
‣ continuous integration
‣ one-click deploy
The most important change however is the growing shift to Continuous Deployment. I’m a
CD zealot. I ﬁrmly believe that in the future CD will be the default way to ship software.
Monitoring/alerting are a key tenet of CD because code is constantly changing in production.
The only way to protect against regressions and validate hypotheses about the effects of
changes is to observe it in production.
I thought I should perform a minimal amount of research to see if there’s anything to back
that theory up. My altogether unscientiﬁc approach (i.e. Google Trends) found this.
monitoring get us?
So let’s whet your appetite with a few concrete examples.
You can use monitoring to detect regressions from new code. Anyone who’s ever managed a
queuing system probably recognizes this graph. Something shipped and “stuck” the queue at
~1:00. While an alert went off and the situation was diagnosed, the queue continued to grow.
At ~2:00 a ﬁx was deployed, and the queue drained back to it’s normal state.
You can use monitoring to detect regressions/failures in hardware. Here’s a graph showing
the per-host average latency of a particular operation across a tier of web servers. Can you
spot the bad host?
You can use monitoring to validate that performance tuning worked. This graph shows the
latency of an operation before/after shipping some performance improvements.
You can use monitoring to correlate shifts in behavior against possible causes. This is a
graph of read operations in a Cassandra ring correlated against a recurring batch job that
pulls data out of the ring every 15 minutes to perform some work on it.
You can use monitoring to detect changes in user behavior. Here’s a graph showing the
number of malformed requests received in our API before and after the leap second. It
apparently broke a lot of our user’s code.
You can also use monitoring to validate that the new features you ship are in fact being used.
When we make a decision to build/ship a feature, it’s a hypothesis that we’re adding business
value. You can also correlate use of new features against performance measurements to ﬁnd
new bottlenecks it may have introduced.
And if nothing else, you can use monitoring to ﬁnd chunky bacon in the wild. Which is the
best kind of bacon.
‣ detect regressions
‣ validate new hypotheses
‣ increases resilience to change
‣ sound familiar?
So we can use monitoring to detect regressions, validate our hypotheses about running code,
and increase our overall resilience to change. This should sound vaguely familiar.
Would we ship
code w/o tests?
I hope not.
We shouldn’t ship
You wouldn’t ship gems/apps without unit tests. You should’t ship them without
‣ Continuous Deployment
‣ Service Oriented Architecture
‣ Devs are Domain Experts
In a future if the common architecture for SaaS is continuous deployment of small, focused
services, then instrumentation is a necessity not a luxury. Observation is the only way to truly
know what your code does in production and how it’s affected by change. Furthermore, the
devs writing the code are the actual domain experts, and the ones who should be using that
knowledge to instrument their code in the right places. It’s an anti-pattern to throw code
over the fence and have the operators guessing what they should be measuring.
So what is the
state of the union?
So if this is the case, why is instrumentation so uncommon? Tl;dr, we have work to do. If you
google “monitoring ruby” and casually start searching for how to do this, you get hit with
something like this.
An overwhelming list of agents, libraries, vendors, etc. IMHO some of these are really good,
some are not so really good. It can be very hard for a developer to know where to get started.
And unfortunately, many of them push you towards a particular anti-pattern.
• Custom Stats
• MySQL threads
• Battery charge
• SNMP Service
Project X Project Y Project Z
The problem is that many of these solutions are vertically integrated monoliths. They give
you an agent, some kind of a storage backend, and some interface to visualize the data.
Unfortunately you usually end up requiring several to collect all the different data you need.
So now you have N different UI’s to learn and N different data silos of that cannot be easily
‣ New Relic
‣ Monolithic OSS
‣ statsd-ruby + statsd
‣ Librato et. al
In practice what we see today in the Ruby community is a mix of New Relic, monolithic OSS
silos built on in-memory databases, or statsd pushing to something like Librato or Graphite.
Often we’ll see users with several of these. So how do we improve this?
We need to
Monitoring can be broken down into a series of steps e.g. collecting the metrics, aggregating
those measurements across processes and hosts, storing the resulting aggregates, and
visualizing/analyzing the data. Today we’re going to focus on just instrumentation, because
that’s where as Ruby developers we need to improve the most.
‣ concise primitives
‣ minimal performance impact
‣ minimal dependencies
We need a small set of instrumentation primitives we can use to answer all the questions we
might have about our code. It needs to be performant so we can run the instrumentation in
production. Most importantly, it needs to be completely decoupled from all the other
components of monitoring we discussed previously. And that’s because ...
‣ ﬂexibility at all other layers
‣ simple introspection
‣ simple capture
How we visualize/analyze our metrics is an intensely personal decision. The operators of our
code (even if it’s just us wearing our ops hat) need completely ﬂexibility in how they
consume/manage/analyze the results of our instrumentation. Any requirement our
instrumentation places on the operators is going to limit its utility to those who agree with
our choices. So we will have none.
‣ implements primitives
‣ captures state
‣ nothing more!
So our ideal solution provides the primitives through a simple interface, captures a cross-
request aggregate of each metric in memory, provides another simple abstraction to access
the current state of each metric AND NOTHING ELSE. We’re going to look for inspiration in
@coda’s metrics library for the JVM. It deﬁnes a powerful set of primitives and in polyglot
shops like Librato we can now use a common vocabulary when discussing our JVM/Ruby
services and how they interact.
What are these
So I keep saying “primitives”. What do I mean by that? Let’s take a look at some of the
questions we might have about our code running in production and how we might use our
instrumentation to answer them. Assume for now we have a gem called “Foo” that
implements these primitives ;-)
How do we
How many jobs are
in the queue?
#enqueue a job
#complete a job
How do we
reqs/sec are we
if status >= 500
elsif status == 200
elsif status == 404
Here’s an example of instrumentation in a rack middleware that tracks the rates of different
How do we
How big are those
len = headers.fetch('Content-Length', 0).to_i
In most web applications the “average” Content-Length is going to be meaningless. So we’ll
use a histogram to track more useful metrics about the Content-Length of our responses.
How do we
How long are
# block form
# explicit form
t = Foo.timer('mygem.req').time
So given tooling with the proper abstractions, that’s truly all developers need to know to
comprehensively instrument their code! So there’s no reason instrumentation can’t become
as wide-spread as unit-testing.
How do we
get the data?
So now that the devs are instrumenting their code, how do pull out the resulting metrics so
we can aggregate them across processes/hosts, persistently store them, and analyze them
using our favorite tools for those tasks?
‣ in-memory store
‣ simple iteration
We use another simple abstraction called a “registry”. It’s basically just a list of objects
implementing the different primitives that we can iterate over and query each object to get
the current state of each metric. It’s thread-safe and double-buffered to prevent any
Foo::Registry.default.each do |n, m|
Based on the each metric’s type, we can pull different kinds of data out and then do whatever
we want with it.
‣ separate gems
‣ simple to build
‣ console, jmx, logs, statsd, librato
This registry abstraction makes it trivial to build adaptors to connect our metrics to whatever
other tools we want. Here’s just a partial list of examples.
How do we
interpret the data?
Up until now I’ve glossed over the details of what each primitive tracks to illustrate that the
proper abstractions means we don’t need much (if any) up-front investment to start
instrumenting our code. But if we’re writing a reporter or interpreting the resulting metrics in
operations, we need to know a little more about what each type provides.
There will be math.
It’s also important to note that these primitives are actually somewhat complex underneath
the hood. Ideally we’ll standardize on a very small number of implementations (hopefully just
1) that actually gets all the details correct. The reason for this complexity is ...
Streaming data is
Our primitives are actually aggregating in each process a continuous stream of events across
whatever our unit of work is e.g. web requests, jobs, etc.
‣ large number of samples
‣ recency of results matters
‣ averages suck
So we need some way to capture representative numbers over millions or even billions of
events. So stationary techniques (as taught in basic stats classes) that require us to have
access to all of the samples are not applicable. Furthermore, we’re primarily interested in the
“recent” state of the metrics, since that’s what’s affecting our business “right now”.
‣ absolute count
We’ll start with something easy. A counter is just an absolute count that we increment and
decrement with each event. It’s up to whatever tooling that samples this counter to perform
derivatives and detect resets.
‣ 1 second rates
‣ mean rate
‣ 1/5/15m EWMA rate
Meters track rates. We track a 1s rate, because that’s really the largest unit of time acceptable
to discuss the throughput of a computer (minutes/hours are just pumping up vanity metrics).
We track the mean rate, but it doesn’t help with understanding what the throughput is doing
“now”. So we use exponentially weighted moving averages.
‣ quantiles (e.g. 75/95/99%)
‣ reservoir sampling
‣ forward-decaying priority
Histograms allow us to track the distribution of what we’re measuring and understand how
our quantiles are performing e.g. what is the 95th and 99th percentile response latency on a
particular resource in our API. We use reservoir sampling to generate these percentiles
without requiring us to store the complete set of samples. We use forward-decaying priority
sampling to ensure that our distribution represents a more recent state of the process.
‣ times an operation
‣ histogram of timings
‣ meter of operation rate
Timers are just built on top of the meter and histogram abstractions. We can time an
operation and have access to both the rate of the operation and the distribution of its
How do we
While I’d love to wave a magic wand and have support for this in the standard library,
standardization typically only happens after adoption. So we’re going to need a gem of some
‣ instrumentation primitives
Luckily a signiﬁcant amount of work has already been made towards such a solution in the
Metriks gem. The README (as of this talk) is terrible however, without looking more deeply
you’d think it ties you into certain tools at the other levels (the anti-pattern). This is actually
not the case.
‣ clarify purpose
‣ extract reporters
‣ codify naming practices
‣ testing metrics
‣ base threading support
There’s still a lot of work left to do, and these are some of the areas we’ll be pushing on (like
improving the README). We’d love to have you join the conversation at Github if you’re
‣ instrumentation == tests
‣ decompose the problem
‣ zero coupling
‣ don’t underestimate complexity
Please help spread the word, instrumentation doesn’t have to be hard and we shouldn’t be
shipping code without it! Remember however, that while the abstractions are extremely
simple, the right implementation is relatively sophisticated, so please give the existing ones
(e.g. Metriks, metrics.codahale.com) a try.
Would like to thank @coda and @lindvall for all the work they’ve done on Metrics and Metriks
respectively. Also @lmarburger for putting together a neat rack middleware on top of Metriks.
Discussions with @nextmat/@tmm1/@headius all helped shape this talk as well.