Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby Monitoring State of the Union

Ruby Monitoring State of the Union

Explains why Rubyists should view code instrumentation as imperative as unit testing, the current state of Ruby instrumentation, and what we need to do make this practice much more widely spread. Presented at RubyConf 2012: http://www.confreaks.com/videos/1276-rubyconf2012-ruby-monitoring-state-of-the-union

Joseph Ruscio

November 01, 2012
Tweet

More Decks by Joseph Ruscio

Other Decks in Programming

Transcript

  1. Ruby Monitoring
    State of the Union
    TM

    View Slide

  2. ‣ @josephruscio
    ‣ Co-Founder/CTO Librato
    ‣ I <3 graphs


    View Slide

  3. What’s all this fuss
    about monitoring?
    People have been monitoring computers forever right? Why is everyone talking about
    monitoring all of a sudden? What changed? In order to answer that question it’s instructive to
    examine what’s changed in building/delivering SaaS over the last decade.

    View Slide

  4. SaaS 2002
    ‣ Seed Round: $1.5M
    USD
    ‣ Infrastructure:
    CAPEX
    ‣ Dedicated Ops
    Team
    ‣ Custom Software
    Stack
    To operate a SaaS business 10 years ago, you had to go get a pile of money and make an up-
    front investment in physical hardware. You’d hire a dedicated Ops team to run it, and most (if
    not all) of your software stack would be custom-built. Your primary monitoring concern
    outside of hardware failure would be capacity planning i.e. when do we need to buy another
    rack of servers to step-wise increase capacity.

    View Slide

  5. SaaS 2012
    ‣ Seed Round: $20K
    USD
    ‣ Infrastructure:
    OPEX
    ‣ <=1 Ops Person
    ‣ OSS, External
    Services
    10 years later, and we’re in a dramatically different situation. Seed rounds are orders of
    magnitude smaller and infrastructure is deployed on demand, “in the cloud”. Teams are lean
    so many small startups don’t even have a single dedicated Ops person. Products are built
    primarily with off-the-shelf OSS and external services.

    View Slide

  6. ‣ agile
    ‣ ephemeral
    ‣ more change, worse tools!
    Modern Infrastructure
    The result of this shift is that we now have “agile infrastructure” that’s paid for as an OPEX
    expense and can be rapidly modified/scaled/adjusted to meet shifting business needs. The
    trade-off we make to achieve this is that it’s ephemeral. This means that we’re now operating
    under more challenging conditions, with worse tooling.

    View Slide

  7. ‣ continuous integration
    ‣ one-click deploy
    ‣ feature-flagging
    ‣ monitoring
    ‣ alerting
    Cont. Deployment
    The most important change however is the growing shift to Continuous Deployment. I’m a
    CD zealot. I firmly believe that in the future CD will be the default way to ship software.
    Monitoring/alerting are a key tenet of CD because code is constantly changing in production.
    The only way to protect against regressions and validate hypotheses about the effects of
    changes is to observe it in production.

    View Slide

  8. I thought I should perform a minimal amount of research to see if there’s anything to back
    that theory up. My altogether unscientific approach (i.e. Google Trends) found this.
    Interesting.

    View Slide

  9. What does
    monitoring get us?
    So let’s whet your appetite with a few concrete examples.

    View Slide

  10. TM
    You can use monitoring to detect regressions from new code. Anyone who’s ever managed a
    queuing system probably recognizes this graph. Something shipped and “stuck” the queue at
    ~1:00. While an alert went off and the situation was diagnosed, the queue continued to grow.
    At ~2:00 a fix was deployed, and the queue drained back to it’s normal state.

    View Slide

  11. TM
    You can use monitoring to detect regressions/failures in hardware. Here’s a graph showing
    the per-host average latency of a particular operation across a tier of web servers. Can you
    spot the bad host?

    View Slide

  12. TM
    You can use monitoring to validate that performance tuning worked. This graph shows the
    latency of an operation before/after shipping some performance improvements.

    View Slide

  13. TM
    You can use monitoring to correlate shifts in behavior against possible causes. This is a
    graph of read operations in a Cassandra ring correlated against a recurring batch job that
    pulls data out of the ring every 15 minutes to perform some work on it.

    View Slide

  14. TM
    You can use monitoring to detect changes in user behavior. Here’s a graph showing the
    number of malformed requests received in our API before and after the leap second. It
    apparently broke a lot of our user’s code.

    View Slide

  15. TM
    You can also use monitoring to validate that the new features you ship are in fact being used.
    When we make a decision to build/ship a feature, it’s a hypothesis that we’re adding business
    value. You can also correlate use of new features against performance measurements to find
    new bottlenecks it may have introduced.

    View Slide

  16. Chunky
    Bacon!!
    And if nothing else, you can use monitoring to find chunky bacon in the wild. Which is the
    best kind of bacon.

    View Slide

  17. ‣ detect regressions
    ‣ validate new hypotheses
    ‣ increases resilience to change
    ‣ sound familiar?
    So we can use monitoring to detect regressions, validate our hypotheses about running code,
    and increase our overall resilience to change. This should sound vaguely familiar.

    View Slide

  18. Instrumentation
    is
    Unit Testing
    for
    Operations

    View Slide

  19. Would we ship
    code w/o tests?
    I hope not.

    View Slide

  20. We shouldn’t ship
    code w/o
    instrumentation.
    You wouldn’t ship gems/apps without unit tests. You should’t ship them without
    instrumentation.

    View Slide

  21. ‣ Continuous Deployment
    ‣ Service Oriented Architecture
    ‣ Devs are Domain Experts
    In a future if the common architecture for SaaS is continuous deployment of small, focused
    services, then instrumentation is a necessity not a luxury. Observation is the only way to truly
    know what your code does in production and how it’s affected by change. Furthermore, the
    devs writing the code are the actual domain experts, and the ones who should be using that
    knowledge to instrument their code in the right places. It’s an anti-pattern to throw code
    over the fence and have the operators guessing what they should be measuring.

    View Slide

  22. So what is the
    state of the union?
    So if this is the case, why is instrumentation so uncommon? Tl;dr, we have work to do. If you
    google “monitoring ruby” and casually start searching for how to do this, you get hit with
    something like this.

    View Slide

  23. Graphite
    StatsD
    OpenTSDB
    Cube
    d3.js
    An overwhelming list of agents, libraries, vendors, etc. IMHO some of these are really good,
    some are not so really good. It can be very hard for a developer to know where to get started.
    And unfortunately, many of them push you towards a particular anti-pattern.

    View Slide

  24. Anti-Pattern
    • Custom Stats
    • MySQL threads
    • VMstat
    • ....
    Storage
    • CPU
    • Interface
    • Memory
    • Ping
    • Battery charge
    • ....
    Storage
    • Ping
    • CPU
    • Memory
    • Disks
    • SNMP Service
    • ....
    Storage
    ...
    Project X Project Y Project Z
    The problem is that many of these solutions are vertically integrated monoliths. They give
    you an agent, some kind of a storage backend, and some interface to visualize the data.
    Unfortunately you usually end up requiring several to collect all the different data you need.
    So now you have N different UI’s to learn and N different data silos of that cannot be easily
    correlated.

    View Slide

  25. Current Status
    ‣ New Relic
    ‣ Monolithic OSS
    ‣ statsd-ruby + statsd
    ‣ Librato et. al
    In practice what we see today in the Ruby community is a mix of New Relic, monolithic OSS
    silos built on in-memory databases, or statsd pushing to something like Librato or Graphite.
    Often we’ll see users with several of these. So how do we improve this?

    View Slide

  26. We need to
    decompose the
    problem

    View Slide

  27. Instrumentation
    Storage
    Aggregation
    Analysis
    Monitoring can be broken down into a series of steps e.g. collecting the metrics, aggregating
    those measurements across processes and hosts, storing the resulting aggregates, and
    visualizing/analyzing the data. Today we’re going to focus on just instrumentation, because
    that’s where as Ruby developers we need to improve the most.

    View Slide

  28. Devs Require
    ‣ concise primitives
    ‣ minimal performance impact
    ‣ minimal dependencies
    We need a small set of instrumentation primitives we can use to answer all the questions we
    might have about our code. It needs to be performant so we can run the instrumentation in
    production. Most importantly, it needs to be completely decoupled from all the other
    components of monitoring we discussed previously. And that’s because ...

    View Slide

  29. Ops Require
    ‣ flexibility at all other layers
    ‣ simple introspection
    ‣ simple capture
    How we visualize/analyze our metrics is an intensely personal decision. The operators of our
    code (even if it’s just us wearing our ops hat) need completely flexibility in how they
    consume/manage/analyze the results of our instrumentation. Any requirement our
    instrumentation places on the operators is going to limit its utility to those who agree with
    our choices. So we will have none.

    View Slide

  30. Ideal Instrumentation
    ‣ implements primitives
    ‣ captures state
    ‣ nothing more!
    ‣ metrics.codahale.com
    So our ideal solution provides the primitives through a simple interface, captures a cross-
    request aggregate of each metric in memory, provides another simple abstraction to access
    the current state of each metric AND NOTHING ELSE. We’re going to look for inspiration in
    @coda’s metrics library for the JVM. It defines a powerful set of primitives and in polyglot
    shops like Librato we can now use a common vocabulary when discussing our JVM/Ruby
    services and how they interact.

    View Slide

  31. What are these
    primitives?
    So I keep saying “primitives”. What do I mean by that? Let’s take a look at some of the
    questions we might have about our code running in production and how we might use our
    instrumentation to answer them. Assume for now we have a gem called “Foo” that
    implements these primitives ;-)

    View Slide

  32. How do we
    count things?

    View Slide

  33. How many jobs are
    in the queue?

    View Slide

  34. Counter
    #enqueue a job
    Foo.counter('mygem.jobs').increment
    #complete a job
    Foo.counter('mygem.jobs').decrement

    View Slide

  35. How do we
    track rates?

    View Slide

  36. How many
    reqs/sec are we
    handling?

    View Slide

  37. Meter
    if status >= 500
    Foo.meter("rack.resp.error").mark
    elsif status == 200
    Foo.meter("rack.resp.success").mark
    elsif status == 404
    Foo.meter("rack.resp.not_found").mark
    ...
    end
    Here’s an example of instrumentation in a rack middleware that tracks the rates of different
    response types.

    View Slide

  38. How do we
    track distributions?

    View Slide

  39. How big are those
    requests?

    View Slide

  40. Histogram
    len = headers.fetch('Content-Length', 0).to_i
    Foo.histogram('rack.resp.len').update(len)
    In most web applications the “average” Content-Length is going to be meaningless. So we’ll
    use a histogram to track more useful metrics about the Content-Length of our responses.

    View Slide

  41. How do we
    time something?

    View Slide

  42. How long are
    requests taking?

    View Slide

  43. Timer
    # block form
    Foo.timer('mygem.req').time do
    process_request
    end
    # explicit form
    t = Foo.timer('mygem.req').time
    process_request
    t.stop

    View Slide

  44. That’s It!
    So given tooling with the proper abstractions, that’s truly all developers need to know to
    comprehensively instrument their code! So there’s no reason instrumentation can’t become
    as wide-spread as unit-testing.

    View Slide

  45. How do we
    get the data?
    So now that the devs are instrumenting their code, how do pull out the resulting metrics so
    we can aggregate them across processes/hosts, persistently store them, and analyze them
    using our favorite tools for those tasks?

    View Slide

  46. Registry
    ‣ in-memory store
    ‣ simple iteration
    ‣ thread-safe
    ‣ double-buffered
    We use another simple abstraction called a “registry”. It’s basically just a list of objects
    implementing the different primitives that we can iterate over and query each object to get
    the current state of each metric. It’s thread-safe and double-buffered to prevent any
    performance impacts.

    View Slide

  47. Registry
    Foo::Registry.default.each do |n, m|
    case m
    when Foo::Counter
    ...
    when Foo::Meter
    when Foo::Histogram
    when Foo::Timer
    end
    Based on the each metric’s type, we can pull different kinds of data out and then do whatever
    we want with it.

    View Slide

  48. Reporters
    ‣ separate gems
    ‣ simple to build
    ‣ console, jmx, logs, statsd, librato
    This registry abstraction makes it trivial to build adaptors to connect our metrics to whatever
    other tools we want. Here’s just a partial list of examples.

    View Slide

  49. How do we
    interpret the data?
    Up until now I’ve glossed over the details of what each primitive tracks to illustrate that the
    proper abstractions means we don’t need much (if any) up-front investment to start
    instrumenting our code. But if we’re writing a reporter or interpreting the resulting metrics in
    operations, we need to know a little more about what each type provides.

    View Slide

  50. There will be math.
    WARNING!
    It’s also important to note that these primitives are actually somewhat complex underneath
    the hood. Ideally we’ll standardize on a very small number of implementations (hopefully just
    1) that actually gets all the details correct. The reason for this complexity is ...

    View Slide

  51. Streaming data is
    hard.
    Our primitives are actually aggregating in each process a continuous stream of events across
    whatever our unit of work is e.g. web requests, jobs, etc.

    View Slide

  52. Challenges
    ‣ large number of samples
    ‣ recency of results matters
    ‣ averages suck
    So we need some way to capture representative numbers over millions or even billions of
    events. So stationary techniques (as taught in basic stats classes) that require us to have
    access to all of the samples are not applicable. Furthermore, we’re primarily interested in the
    “recent” state of the metrics, since that’s what’s affecting our business “right now”.

    View Slide

  53. Counter
    ‣ absolute count
    We’ll start with something easy. A counter is just an absolute count that we increment and
    decrement with each event. It’s up to whatever tooling that samples this counter to perform
    derivatives and detect resets.

    View Slide

  54. Meter
    ‣ 1 second rates
    ‣ mean rate
    ‣ 1/5/15m EWMA rate
    ‣ count
    Meters track rates. We track a 1s rate, because that’s really the largest unit of time acceptable
    to discuss the throughput of a computer (minutes/hours are just pumping up vanity metrics).
    We track the mean rate, but it doesn’t help with understanding what the throughput is doing
    “now”. So we use exponentially weighted moving averages.

    View Slide

  55. Histogram
    ‣ mean/min/max/variance
    ‣ quantiles (e.g. 75/95/99%)
    ‣ reservoir sampling
    ‣ forward-decaying priority
    sampling
    Histograms allow us to track the distribution of what we’re measuring and understand how
    our quantiles are performing e.g. what is the 95th and 99th percentile response latency on a
    particular resource in our API. We use reservoir sampling to generate these percentiles
    without requiring us to store the complete set of samples. We use forward-decaying priority
    sampling to ensure that our distribution represents a more recent state of the process.

    View Slide

  56. Timer
    ‣ times an operation
    ‣ histogram of timings
    ‣ meter of operation rate
    Timers are just built on top of the meter and histogram abstractions. We can time an
    operation and have access to both the rate of the operation and the distribution of its
    duration.

    View Slide

  57. How do we
    get started?
    While I’d love to wave a magic wand and have support for this in the standard library,
    standardization typically only happens after adoption. So we’re going to need a gem of some
    kind.

    View Slide

  58. Metriks
    ‣ instrumentation primitives
    ‣ thread-safe
    ‣ performant
    ‣ metriksd
    ‣ github.com/eric/metriks
    ‣ github.com/lmarburger/metriks-
    middleware
    Luckily a significant amount of work has already been made towards such a solution in the
    Metriks gem. The README (as of this talk) is terrible however, without looking more deeply
    you’d think it ties you into certain tools at the other levels (the anti-pattern). This is actually
    not the case.

    View Slide

  59. Future Work
    ‣ clarify purpose
    ‣ extract reporters
    ‣ codify naming practices
    ‣ testing metrics
    ‣ base threading support
    There’s still a lot of work left to do, and these are some of the areas we’ll be pushing on (like
    improving the README). We’d love to have you join the conversation at Github if you’re
    interested!

    View Slide

  60. ‣ instrumentation == tests
    ‣ decompose the problem
    ‣ zero coupling
    ‣ don’t underestimate complexity
    Please help spread the word, instrumentation doesn’t have to be hard and we shouldn’t be
    shipping code without it! Remember however, that while the abstractions are extremely
    simple, the right implementation is relatively sophisticated, so please give the existing ones
    (e.g. Metriks, metrics.codahale.com) a try.

    View Slide

  61. Thanks
    ‣ @lindvall
    ‣ @coda
    ‣ @lmarburger
    ‣ @nextmat
    ‣ @headius
    ‣ @tmm1
    Would like to thank @coda and @lindvall for all the work they’ve done on Metrics and Metriks
    respectively. Also @lmarburger for putting together a neat rack middleware on top of Metriks.
    Discussions with @nextmat/@tmm1/@headius all helped shape this talk as well.

    View Slide

  62. fin
    TM

    View Slide