Ruby Monitoring State of the Union

Ruby Monitoring State of the Union TM

‣ @josephruscio ‣ Co-Founder/CTO Librato ‣ I <3 graphs <me>
</me>

What’s all this fuss about monitoring? People have been monitoring
computers forever right? Why is everyone talking about monitoring all of a sudden? What changed? In order to answer that question it’s instructive to examine what’s changed in building/delivering SaaS over the last decade.

SaaS 2002 ‣ Seed Round: $1.5M USD ‣ Infrastructure: CAPEX
‣ Dedicated Ops Team ‣ Custom Software Stack To operate a SaaS business 10 years ago, you had to go get a pile of money and make an up- front investment in physical hardware. You’d hire a dedicated Ops team to run it, and most (if not all) of your software stack would be custom-built. Your primary monitoring concern outside of hardware failure would be capacity planning i.e. when do we need to buy another rack of servers to step-wise increase capacity.

SaaS 2012 ‣ Seed Round: $20K USD ‣ Infrastructure: OPEX
‣ <=1 Ops Person ‣ OSS, External Services 10 years later, and we’re in a dramatically different situation. Seed rounds are orders of magnitude smaller and infrastructure is deployed on demand, “in the cloud”. Teams are lean so many small startups don’t even have a single dedicated Ops person. Products are built primarily with off-the-shelf OSS and external services.

‣ agile ‣ ephemeral ‣ more change, worse tools! Modern
Infrastructure The result of this shift is that we now have “agile infrastructure” that’s paid for as an OPEX expense and can be rapidly modiﬁed/scaled/adjusted to meet shifting business needs. The trade-off we make to achieve this is that it’s ephemeral. This means that we’re now operating under more challenging conditions, with worse tooling.

‣ continuous integration ‣ one-click deploy ‣ feature-ﬂagging ‣ monitoring
‣ alerting Cont. Deployment The most important change however is the growing shift to Continuous Deployment. I’m a CD zealot. I ﬁrmly believe that in the future CD will be the default way to ship software. Monitoring/alerting are a key tenet of CD because code is constantly changing in production. The only way to protect against regressions and validate hypotheses about the effects of changes is to observe it in production.

I thought I should perform a minimal amount of research
to see if there’s anything to back that theory up. My altogether unscientiﬁc approach (i.e. Google Trends) found this. Interesting.

What does monitoring get us? So let’s whet your appetite
with a few concrete examples.

TM You can use monitoring to detect regressions from new
code. Anyone who’s ever managed a queuing system probably recognizes this graph. Something shipped and “stuck” the queue at ~1:00. While an alert went off and the situation was diagnosed, the queue continued to grow. At ~2:00 a ﬁx was deployed, and the queue drained back to it’s normal state.

TM You can use monitoring to detect regressions/failures in hardware.
Here’s a graph showing the per-host average latency of a particular operation across a tier of web servers. Can you spot the bad host?

TM You can use monitoring to validate that performance tuning
worked. This graph shows the latency of an operation before/after shipping some performance improvements.

TM You can use monitoring to correlate shifts in behavior
against possible causes. This is a graph of read operations in a Cassandra ring correlated against a recurring batch job that pulls data out of the ring every 15 minutes to perform some work on it.

TM You can use monitoring to detect changes in user
behavior. Here’s a graph showing the number of malformed requests received in our API before and after the leap second. It apparently broke a lot of our user’s code.

TM You can also use monitoring to validate that the
new features you ship are in fact being used. When we make a decision to build/ship a feature, it’s a hypothesis that we’re adding business value. You can also correlate use of new features against performance measurements to ﬁnd new bottlenecks it may have introduced.

Chunky Bacon!! And if nothing else, you can use monitoring
to ﬁnd chunky bacon in the wild. Which is the best kind of bacon.

‣ detect regressions ‣ validate new hypotheses ‣ increases resilience
to change ‣ sound familiar? So we can use monitoring to detect regressions, validate our hypotheses about running code, and increase our overall resilience to change. This should sound vaguely familiar.

Instrumentation is Unit Testing for Operations

Would we ship code w/o tests? I hope not.

We shouldn’t ship code w/o instrumentation. You wouldn’t ship gems/apps
without unit tests. You should’t ship them without instrumentation.

‣ Continuous Deployment ‣ Service Oriented Architecture ‣ Devs are
Domain Experts In a future if the common architecture for SaaS is continuous deployment of small, focused services, then instrumentation is a necessity not a luxury. Observation is the only way to truly know what your code does in production and how it’s affected by change. Furthermore, the devs writing the code are the actual domain experts, and the ones who should be using that knowledge to instrument their code in the right places. It’s an anti-pattern to throw code over the fence and have the operators guessing what they should be measuring.

So what is the state of the union? So if
this is the case, why is instrumentation so uncommon? Tl;dr, we have work to do. If you google “monitoring ruby” and casually start searching for how to do this, you get hit with something like this.

Graphite StatsD OpenTSDB Cube d3.js An overwhelming list of agents,
libraries, vendors, etc. IMHO some of these are really good, some are not so really good. It can be very hard for a developer to know where to get started. And unfortunately, many of them push you towards a particular anti-pattern.

Anti-Pattern • Custom Stats • MySQL threads • VMstat •
.... Storage • CPU • Interface • Memory • Ping • Battery charge • .... Storage • Ping • CPU • Memory • Disks • SNMP Service • .... Storage ... Project X Project Y Project Z The problem is that many of these solutions are vertically integrated monoliths. They give you an agent, some kind of a storage backend, and some interface to visualize the data. Unfortunately you usually end up requiring several to collect all the different data you need. So now you have N different UI’s to learn and N different data silos of that cannot be easily correlated.

Current Status ‣ New Relic ‣ Monolithic OSS ‣ statsd-ruby
+ statsd ‣ Librato et. al In practice what we see today in the Ruby community is a mix of New Relic, monolithic OSS silos built on in-memory databases, or statsd pushing to something like Librato or Graphite. Often we’ll see users with several of these. So how do we improve this?

We need to decompose the problem

Instrumentation Storage Aggregation Analysis Monitoring can be broken down into
a series of steps e.g. collecting the metrics, aggregating those measurements across processes and hosts, storing the resulting aggregates, and visualizing/analyzing the data. Today we’re going to focus on just instrumentation, because that’s where as Ruby developers we need to improve the most.

Devs Require ‣ concise primitives ‣ minimal performance impact ‣
minimal dependencies We need a small set of instrumentation primitives we can use to answer all the questions we might have about our code. It needs to be performant so we can run the instrumentation in production. Most importantly, it needs to be completely decoupled from all the other components of monitoring we discussed previously. And that’s because ...

Ops Require ‣ ﬂexibility at all other layers ‣ simple
introspection ‣ simple capture How we visualize/analyze our metrics is an intensely personal decision. The operators of our code (even if it’s just us wearing our ops hat) need completely ﬂexibility in how they consume/manage/analyze the results of our instrumentation. Any requirement our instrumentation places on the operators is going to limit its utility to those who agree with our choices. So we will have none.

Ideal Instrumentation ‣ implements primitives ‣ captures state ‣ nothing
more! ‣ metrics.codahale.com So our ideal solution provides the primitives through a simple interface, captures a cross- request aggregate of each metric in memory, provides another simple abstraction to access the current state of each metric AND NOTHING ELSE. We’re going to look for inspiration in @coda’s metrics library for the JVM. It deﬁnes a powerful set of primitives and in polyglot shops like Librato we can now use a common vocabulary when discussing our JVM/Ruby services and how they interact.

What are these primitives? So I keep saying “primitives”. What
do I mean by that? Let’s take a look at some of the questions we might have about our code running in production and how we might use our instrumentation to answer them. Assume for now we have a gem called “Foo” that implements these primitives ;-)

How do we count things?

How many jobs are in the queue?

Counter #enqueue a job Foo.counter('mygem.jobs').increment #complete a job Foo.counter('mygem.jobs').decrement

How do we track rates?

How many reqs/sec are we handling?

Meter if status >= 500 Foo.meter("rack.resp.error").mark elsif status == 200
Foo.meter("rack.resp.success").mark elsif status == 404 Foo.meter("rack.resp.not_found").mark ... end Here’s an example of instrumentation in a rack middleware that tracks the rates of different response types.

How do we track distributions?

How big are those requests?

Histogram len = headers.fetch('Content-Length', 0).to_i Foo.histogram('rack.resp.len').update(len) In most web applications
the “average” Content-Length is going to be meaningless. So we’ll use a histogram to track more useful metrics about the Content-Length of our responses.

How do we time something?

How long are requests taking?

Timer # block form Foo.timer('mygem.req').time do process_request end # explicit
form t = Foo.timer('mygem.req').time process_request t.stop

That’s It! So given tooling with the proper abstractions, that’s
truly all developers need to know to comprehensively instrument their code! So there’s no reason instrumentation can’t become as wide-spread as unit-testing.

How do we get the data? So now that the
devs are instrumenting their code, how do pull out the resulting metrics so we can aggregate them across processes/hosts, persistently store them, and analyze them using our favorite tools for those tasks?

Registry ‣ in-memory store ‣ simple iteration ‣ thread-safe ‣
double-buffered We use another simple abstraction called a “registry”. It’s basically just a list of objects implementing the different primitives that we can iterate over and query each object to get the current state of each metric. It’s thread-safe and double-buffered to prevent any performance impacts.

Registry Foo::Registry.default.each do |n, m| case m when Foo::Counter ...
when Foo::Meter when Foo::Histogram when Foo::Timer end Based on the each metric’s type, we can pull different kinds of data out and then do whatever we want with it.

Reporters ‣ separate gems ‣ simple to build ‣ console,
jmx, logs, statsd, librato This registry abstraction makes it trivial to build adaptors to connect our metrics to whatever other tools we want. Here’s just a partial list of examples.

How do we interpret the data? Up until now I’ve
glossed over the details of what each primitive tracks to illustrate that the proper abstractions means we don’t need much (if any) up-front investment to start instrumenting our code. But if we’re writing a reporter or interpreting the resulting metrics in operations, we need to know a little more about what each type provides.

There will be math. WARNING! It’s also important to note
that these primitives are actually somewhat complex underneath the hood. Ideally we’ll standardize on a very small number of implementations (hopefully just 1) that actually gets all the details correct. The reason for this complexity is ...

Streaming data is hard. Our primitives are actually aggregating in
each process a continuous stream of events across whatever our unit of work is e.g. web requests, jobs, etc.

Challenges ‣ large number of samples ‣ recency of results
matters ‣ averages suck So we need some way to capture representative numbers over millions or even billions of events. So stationary techniques (as taught in basic stats classes) that require us to have access to all of the samples are not applicable. Furthermore, we’re primarily interested in the “recent” state of the metrics, since that’s what’s affecting our business “right now”.

Counter ‣ absolute count We’ll start with something easy. A
counter is just an absolute count that we increment and decrement with each event. It’s up to whatever tooling that samples this counter to perform derivatives and detect resets.

Meter ‣ 1 second rates ‣ mean rate ‣ 1/5/15m
EWMA rate ‣ count Meters track rates. We track a 1s rate, because that’s really the largest unit of time acceptable to discuss the throughput of a computer (minutes/hours are just pumping up vanity metrics). We track the mean rate, but it doesn’t help with understanding what the throughput is doing “now”. So we use exponentially weighted moving averages.

Histogram ‣ mean/min/max/variance ‣ quantiles (e.g. 75/95/99%) ‣ reservoir sampling
‣ forward-decaying priority sampling Histograms allow us to track the distribution of what we’re measuring and understand how our quantiles are performing e.g. what is the 95th and 99th percentile response latency on a particular resource in our API. We use reservoir sampling to generate these percentiles without requiring us to store the complete set of samples. We use forward-decaying priority sampling to ensure that our distribution represents a more recent state of the process.

Timer ‣ times an operation ‣ histogram of timings ‣
meter of operation rate Timers are just built on top of the meter and histogram abstractions. We can time an operation and have access to both the rate of the operation and the distribution of its duration.

How do we get started? While I’d love to wave
a magic wand and have support for this in the standard library, standardization typically only happens after adoption. So we’re going to need a gem of some kind.

Metriks ‣ instrumentation primitives ‣ thread-safe ‣ performant ‣ metriksd
‣ github.com/eric/metriks ‣ github.com/lmarburger/metriks- middleware Luckily a signiﬁcant amount of work has already been made towards such a solution in the Metriks gem. The README (as of this talk) is terrible however, without looking more deeply you’d think it ties you into certain tools at the other levels (the anti-pattern). This is actually not the case.

Future Work ‣ clarify purpose ‣ extract reporters ‣ codify
naming practices ‣ testing metrics ‣ base threading support There’s still a lot of work left to do, and these are some of the areas we’ll be pushing on (like improving the README). We’d love to have you join the conversation at Github if you’re interested!

‣ instrumentation == tests ‣ decompose the problem ‣ zero
coupling ‣ don’t underestimate complexity Please help spread the word, instrumentation doesn’t have to be hard and we shouldn’t be shipping code without it! Remember however, that while the abstractions are extremely simple, the right implementation is relatively sophisticated, so please give the existing ones (e.g. Metriks, metrics.codahale.com) a try.

Thanks ‣ @lindvall ‣ @coda ‣ @lmarburger ‣ @nextmat ‣
@headius ‣ @tmm1 Would like to thank @coda and @lindvall for all the work they’ve done on Metrics and Metriks respectively. Also @lmarburger for putting together a neat rack middleware on top of Metriks. Discussions with @nextmat/@tmm1/@headius all helped shape this talk as well.

ﬁn TM

Ruby Monitoring State of the Union

Ruby Monitoring State of the Union

More Decks by Joseph Ruscio

Other Decks in Programming

Featured

Transcript