Ruby Monitoring State of the Union

Slide 1

Slide 1 text

Ruby Monitoring State of the Union TM

Slide 2

Slide 2 text

‣ @josephruscio ‣ Co-Founder/CTO Librato ‣ I <3 graphs

Slide 3

Slide 3 text

What’s all this fuss about monitoring? People have been monitoring computers forever right? Why is everyone talking about monitoring all of a sudden? What changed? In order to answer that question it’s instructive to examine what’s changed in building/delivering SaaS over the last decade.

Slide 4

Slide 4 text

SaaS 2002 ‣ Seed Round: $1.5M USD ‣ Infrastructure: CAPEX ‣ Dedicated Ops Team ‣ Custom Software Stack To operate a SaaS business 10 years ago, you had to go get a pile of money and make an up- front investment in physical hardware. You’d hire a dedicated Ops team to run it, and most (if not all) of your software stack would be custom-built. Your primary monitoring concern outside of hardware failure would be capacity planning i.e. when do we need to buy another rack of servers to step-wise increase capacity.

Slide 5

Slide 5 text

SaaS 2012 ‣ Seed Round: $20K USD ‣ Infrastructure: OPEX ‣ <=1 Ops Person ‣ OSS, External Services 10 years later, and we’re in a dramatically different situation. Seed rounds are orders of magnitude smaller and infrastructure is deployed on demand, “in the cloud”. Teams are lean so many small startups don’t even have a single dedicated Ops person. Products are built primarily with off-the-shelf OSS and external services.

Slide 6

Slide 6 text

‣ agile ‣ ephemeral ‣ more change, worse tools! Modern Infrastructure The result of this shift is that we now have “agile infrastructure” that’s paid for as an OPEX expense and can be rapidly modiﬁed/scaled/adjusted to meet shifting business needs. The trade-off we make to achieve this is that it’s ephemeral. This means that we’re now operating under more challenging conditions, with worse tooling.

Slide 7

Slide 7 text

‣ continuous integration ‣ one-click deploy ‣ feature-ﬂagging ‣ monitoring ‣ alerting Cont. Deployment The most important change however is the growing shift to Continuous Deployment. I’m a CD zealot. I ﬁrmly believe that in the future CD will be the default way to ship software. Monitoring/alerting are a key tenet of CD because code is constantly changing in production. The only way to protect against regressions and validate hypotheses about the effects of changes is to observe it in production.

Slide 8

Slide 8 text

I thought I should perform a minimal amount of research to see if there’s anything to back that theory up. My altogether unscientiﬁc approach (i.e. Google Trends) found this. Interesting.

Slide 9

Slide 9 text

What does monitoring get us? So let’s whet your appetite with a few concrete examples.

Slide 10

Slide 10 text

TM You can use monitoring to detect regressions from new code. Anyone who’s ever managed a queuing system probably recognizes this graph. Something shipped and “stuck” the queue at ~1:00. While an alert went off and the situation was diagnosed, the queue continued to grow. At ~2:00 a ﬁx was deployed, and the queue drained back to it’s normal state.

Slide 11

Slide 11 text

TM You can use monitoring to detect regressions/failures in hardware. Here’s a graph showing the per-host average latency of a particular operation across a tier of web servers. Can you spot the bad host?

Slide 12

Slide 12 text

TM You can use monitoring to validate that performance tuning worked. This graph shows the latency of an operation before/after shipping some performance improvements.

Slide 13

Slide 13 text

TM You can use monitoring to correlate shifts in behavior against possible causes. This is a graph of read operations in a Cassandra ring correlated against a recurring batch job that pulls data out of the ring every 15 minutes to perform some work on it.

Slide 14

Slide 14 text

TM You can use monitoring to detect changes in user behavior. Here’s a graph showing the number of malformed requests received in our API before and after the leap second. It apparently broke a lot of our user’s code.

Slide 15

Slide 15 text

TM You can also use monitoring to validate that the new features you ship are in fact being used. When we make a decision to build/ship a feature, it’s a hypothesis that we’re adding business value. You can also correlate use of new features against performance measurements to ﬁnd new bottlenecks it may have introduced.

Slide 16

Slide 16 text

Chunky Bacon!! And if nothing else, you can use monitoring to ﬁnd chunky bacon in the wild. Which is the best kind of bacon.

Slide 17

Slide 17 text

‣ detect regressions ‣ validate new hypotheses ‣ increases resilience to change ‣ sound familiar? So we can use monitoring to detect regressions, validate our hypotheses about running code, and increase our overall resilience to change. This should sound vaguely familiar.

Slide 18

Slide 18 text

Instrumentation is Unit Testing for Operations

Slide 19

Slide 19 text

Would we ship code w/o tests? I hope not.

Slide 20

Slide 20 text

We shouldn’t ship code w/o instrumentation. You wouldn’t ship gems/apps without unit tests. You should’t ship them without instrumentation.

Slide 21

Slide 21 text

‣ Continuous Deployment ‣ Service Oriented Architecture ‣ Devs are Domain Experts In a future if the common architecture for SaaS is continuous deployment of small, focused services, then instrumentation is a necessity not a luxury. Observation is the only way to truly know what your code does in production and how it’s affected by change. Furthermore, the devs writing the code are the actual domain experts, and the ones who should be using that knowledge to instrument their code in the right places. It’s an anti-pattern to throw code over the fence and have the operators guessing what they should be measuring.

Slide 22

Slide 22 text

So what is the state of the union? So if this is the case, why is instrumentation so uncommon? Tl;dr, we have work to do. If you google “monitoring ruby” and casually start searching for how to do this, you get hit with something like this.

Slide 23

Slide 23 text

Graphite StatsD OpenTSDB Cube d3.js An overwhelming list of agents, libraries, vendors, etc. IMHO some of these are really good, some are not so really good. It can be very hard for a developer to know where to get started. And unfortunately, many of them push you towards a particular anti-pattern.

Slide 24

Slide 24 text

Anti-Pattern • Custom Stats • MySQL threads • VMstat • .... Storage • CPU • Interface • Memory • Ping • Battery charge • .... Storage • Ping • CPU • Memory • Disks • SNMP Service • .... Storage ... Project X Project Y Project Z The problem is that many of these solutions are vertically integrated monoliths. They give you an agent, some kind of a storage backend, and some interface to visualize the data. Unfortunately you usually end up requiring several to collect all the different data you need. So now you have N different UI’s to learn and N different data silos of that cannot be easily correlated.

Slide 25

Slide 25 text

Current Status ‣ New Relic ‣ Monolithic OSS ‣ statsd-ruby + statsd ‣ Librato et. al In practice what we see today in the Ruby community is a mix of New Relic, monolithic OSS silos built on in-memory databases, or statsd pushing to something like Librato or Graphite. Often we’ll see users with several of these. So how do we improve this?

Slide 26

Slide 26 text

We need to decompose the problem

Slide 27

Slide 27 text

Instrumentation Storage Aggregation Analysis Monitoring can be broken down into a series of steps e.g. collecting the metrics, aggregating those measurements across processes and hosts, storing the resulting aggregates, and visualizing/analyzing the data. Today we’re going to focus on just instrumentation, because that’s where as Ruby developers we need to improve the most.

Slide 28

Slide 28 text

Devs Require ‣ concise primitives ‣ minimal performance impact ‣ minimal dependencies We need a small set of instrumentation primitives we can use to answer all the questions we might have about our code. It needs to be performant so we can run the instrumentation in production. Most importantly, it needs to be completely decoupled from all the other components of monitoring we discussed previously. And that’s because ...

Slide 29

Slide 29 text

Ops Require ‣ ﬂexibility at all other layers ‣ simple introspection ‣ simple capture How we visualize/analyze our metrics is an intensely personal decision. The operators of our code (even if it’s just us wearing our ops hat) need completely ﬂexibility in how they consume/manage/analyze the results of our instrumentation. Any requirement our instrumentation places on the operators is going to limit its utility to those who agree with our choices. So we will have none.

Slide 30

Slide 30 text

Ideal Instrumentation ‣ implements primitives ‣ captures state ‣ nothing more! ‣ metrics.codahale.com So our ideal solution provides the primitives through a simple interface, captures a cross- request aggregate of each metric in memory, provides another simple abstraction to access the current state of each metric AND NOTHING ELSE. We’re going to look for inspiration in @coda’s metrics library for the JVM. It deﬁnes a powerful set of primitives and in polyglot shops like Librato we can now use a common vocabulary when discussing our JVM/Ruby services and how they interact.

Slide 31

Slide 31 text

What are these primitives? So I keep saying “primitives”. What do I mean by that? Let’s take a look at some of the questions we might have about our code running in production and how we might use our instrumentation to answer them. Assume for now we have a gem called “Foo” that implements these primitives ;-)

Slide 32

Slide 32 text

How do we count things?

Slide 33

Slide 33 text

How many jobs are in the queue?

Slide 34

Slide 34 text

Counter #enqueue a job Foo.counter('mygem.jobs').increment #complete a job Foo.counter('mygem.jobs').decrement

Slide 35

Slide 35 text

How do we track rates?

Slide 36

Slide 36 text

How many reqs/sec are we handling?

Slide 37

Slide 37 text

Meter if status >= 500 Foo.meter("rack.resp.error").mark elsif status == 200 Foo.meter("rack.resp.success").mark elsif status == 404 Foo.meter("rack.resp.not_found").mark ... end Here’s an example of instrumentation in a rack middleware that tracks the rates of different response types.

Slide 38

Slide 38 text

How do we track distributions?

Slide 39

Slide 39 text

How big are those requests?

Slide 40

Slide 40 text

Histogram len = headers.fetch('Content-Length', 0).to_i Foo.histogram('rack.resp.len').update(len) In most web applications the “average” Content-Length is going to be meaningless. So we’ll use a histogram to track more useful metrics about the Content-Length of our responses.

Slide 41

Slide 41 text

How do we time something?

Slide 42

Slide 42 text

How long are requests taking?

Slide 43

Slide 43 text

Timer # block form Foo.timer('mygem.req').time do process_request end # explicit form t = Foo.timer('mygem.req').time process_request t.stop

Slide 44

Slide 44 text

That’s It! So given tooling with the proper abstractions, that’s truly all developers need to know to comprehensively instrument their code! So there’s no reason instrumentation can’t become as wide-spread as unit-testing.

Slide 45

Slide 45 text

How do we get the data? So now that the devs are instrumenting their code, how do pull out the resulting metrics so we can aggregate them across processes/hosts, persistently store them, and analyze them using our favorite tools for those tasks?

Slide 46

Slide 46 text

Registry ‣ in-memory store ‣ simple iteration ‣ thread-safe ‣ double-buffered We use another simple abstraction called a “registry”. It’s basically just a list of objects implementing the different primitives that we can iterate over and query each object to get the current state of each metric. It’s thread-safe and double-buffered to prevent any performance impacts.

Slide 47

Slide 47 text

Registry Foo::Registry.default.each do |n, m| case m when Foo::Counter ... when Foo::Meter when Foo::Histogram when Foo::Timer end Based on the each metric’s type, we can pull different kinds of data out and then do whatever we want with it.

Slide 48

Slide 48 text

Reporters ‣ separate gems ‣ simple to build ‣ console, jmx, logs, statsd, librato This registry abstraction makes it trivial to build adaptors to connect our metrics to whatever other tools we want. Here’s just a partial list of examples.

Slide 49

Slide 49 text

How do we interpret the data? Up until now I’ve glossed over the details of what each primitive tracks to illustrate that the proper abstractions means we don’t need much (if any) up-front investment to start instrumenting our code. But if we’re writing a reporter or interpreting the resulting metrics in operations, we need to know a little more about what each type provides.

Slide 50

Slide 50 text

There will be math. WARNING! It’s also important to note that these primitives are actually somewhat complex underneath the hood. Ideally we’ll standardize on a very small number of implementations (hopefully just 1) that actually gets all the details correct. The reason for this complexity is ...

Slide 51

Slide 51 text

Streaming data is hard. Our primitives are actually aggregating in each process a continuous stream of events across whatever our unit of work is e.g. web requests, jobs, etc.

Slide 52

Slide 52 text

Challenges ‣ large number of samples ‣ recency of results matters ‣ averages suck So we need some way to capture representative numbers over millions or even billions of events. So stationary techniques (as taught in basic stats classes) that require us to have access to all of the samples are not applicable. Furthermore, we’re primarily interested in the “recent” state of the metrics, since that’s what’s affecting our business “right now”.

Slide 53

Slide 53 text

Counter ‣ absolute count We’ll start with something easy. A counter is just an absolute count that we increment and decrement with each event. It’s up to whatever tooling that samples this counter to perform derivatives and detect resets.

Slide 54

Slide 54 text

Meter ‣ 1 second rates ‣ mean rate ‣ 1/5/15m EWMA rate ‣ count Meters track rates. We track a 1s rate, because that’s really the largest unit of time acceptable to discuss the throughput of a computer (minutes/hours are just pumping up vanity metrics). We track the mean rate, but it doesn’t help with understanding what the throughput is doing “now”. So we use exponentially weighted moving averages.

Slide 55

Slide 55 text

Histogram ‣ mean/min/max/variance ‣ quantiles (e.g. 75/95/99%) ‣ reservoir sampling ‣ forward-decaying priority sampling Histograms allow us to track the distribution of what we’re measuring and understand how our quantiles are performing e.g. what is the 95th and 99th percentile response latency on a particular resource in our API. We use reservoir sampling to generate these percentiles without requiring us to store the complete set of samples. We use forward-decaying priority sampling to ensure that our distribution represents a more recent state of the process.

Slide 56

Slide 56 text

Timer ‣ times an operation ‣ histogram of timings ‣ meter of operation rate Timers are just built on top of the meter and histogram abstractions. We can time an operation and have access to both the rate of the operation and the distribution of its duration.

Slide 57

Slide 57 text

How do we get started? While I’d love to wave a magic wand and have support for this in the standard library, standardization typically only happens after adoption. So we’re going to need a gem of some kind.

Slide 58

Slide 58 text

Metriks ‣ instrumentation primitives ‣ thread-safe ‣ performant ‣ metriksd ‣ github.com/eric/metriks ‣ github.com/lmarburger/metriks- middleware Luckily a signiﬁcant amount of work has already been made towards such a solution in the Metriks gem. The README (as of this talk) is terrible however, without looking more deeply you’d think it ties you into certain tools at the other levels (the anti-pattern). This is actually not the case.

Slide 59

Slide 59 text

Future Work ‣ clarify purpose ‣ extract reporters ‣ codify naming practices ‣ testing metrics ‣ base threading support There’s still a lot of work left to do, and these are some of the areas we’ll be pushing on (like improving the README). We’d love to have you join the conversation at Github if you’re interested!

Slide 60

Slide 60 text

‣ instrumentation == tests ‣ decompose the problem ‣ zero coupling ‣ don’t underestimate complexity Please help spread the word, instrumentation doesn’t have to be hard and we shouldn’t be shipping code without it! Remember however, that while the abstractions are extremely simple, the right implementation is relatively sophisticated, so please give the existing ones (e.g. Metriks, metrics.codahale.com) a try.

Slide 61

Slide 61 text

Thanks ‣ @lindvall ‣ @coda ‣ @lmarburger ‣ @nextmat ‣ @headius ‣ @tmm1 Would like to thank @coda and @lindvall for all the work they’ve done on Metrics and Metriks respectively. Also @lmarburger for putting together a neat rack middleware on top of Metriks. Discussions with @nextmat/@tmm1/@headius all helped shape this talk as well.

Slide 62

Slide 62 text

ﬁn TM