Alex Petrov - Eventoverse: near-realtime metric collection and processing

Monitorin’ it Friday, August 23, 13

This talk is rather philosophical Friday, August 23, 13

2,5 years of implementing an open-source monitoring solution Friday, August
23, 13

Several rewrites from scratch Friday, August 23, 13

Tons of experiences Friday, August 23, 13

Many extracted components already available as open-source solutions Friday, August
23, 13

Not to convince you to use “my thing” Friday, August
23, 13

But to share findings and release you from going through
same path Friday, August 23, 13

Monitoring is about intuition Friday, August 23, 13

Psychology Friday, August 23, 13

Giving exact information at any given point in time Friday,
August 23, 13

The situation has provided a cue; this cue has given
the expert access to information stored in memory, and the information provides the answer. Intuition is nothing more and nothing less than recognition Friday, August 23, 13

Valid intuitions develop when experts have learned to recognize familiar
elements in a new situation and to act in a manner that is appropriate to it. Good intuitive judgments come to mind with the same immediacy as “doggie!” Friday, August 23, 13

When faced with a difficult question, we often answer an
easier one instead, usually without noticing the substitution. Seeing an easy pattern gives us an easy decision about more complex problems, intuitively. Friday, August 23, 13

We easily think associatively, we think metaphorically, we think causally,
but statistics requires thinking about many things at once, which is something that intuition is not designed to do Friday, August 23, 13

Tooling for monitoring systems Friday, August 23, 13

Are you running a production system? Friday, August 23, 13

Do you use any kind of monitoring? Friday, August 23,
13

Are your business metrics included into monitoring? Friday, August 23,
13

If you’ve answered “yes” to all three, let’s go shopping
Friday, August 23, 13

Monitoring Architecture Friday, August 23, 13

• Easy to configure (ad-hoc vs post-hoc) • Very fast
and easy access to recent events • Near real-time analytics and report generation • Easy to go back in time • Application growth doesn’t impact monitoring performance Expectations Friday, August 23, 13

• Pipelining, an ability to inject additional processing unit •
Arbitrary data format (use what you want) • Dumb client, implementable in minutes • Wide choice of transports • Direct access to the data • Data persistency Expectations Friday, August 23, 13

• Backend is provided, front-end is customizable • Data format
intended for graph visualization • Websockets, get stuff pushed out • Re-broadcast configuration • Moving parts, implementable interfaces for everything Expectations Friday, August 23, 13

• Extensible alerting • Alerting channels (jabber, IRC, email) •
Alerting rules • Persistency to the database of your choice • Pluggable processing backends • Self-declaring services (IService with start/stop/pause) Expectations Friday, August 23, 13

Rule 1: Build from parts Friday, August 23, 13

reporter Processing unit Persistent Store State Machine Alerts Reduce Engine
(ad-hoc queries and analytics) History Graphs Table Data Real-time graphs Real-time table data Latest Events (capped in-memory store) reporter reporter reporter reporter UDP, TCP, AMQP Console (nrepl, custom consumers, real-time analysis) Collector Collector Friday, August 23, 13

Part 1: Collector Friday, August 23, 13

• I like JSON, you like JSON • That guy
likes Protocol Buffers • Another guy likes Thrift What do we do here? Collector Friday, August 23, 13

• Collector listens to incoming connections • De-serializes them and
pushes to downstream • Write your own collector, implement your own client and off you go • Choose transport of your preference, tune transport reliability • Leave space for backpressure (given your protocol is reliable) to have some flow control Collector Friday, August 23, 13

Part 2: Processing unit Friday, August 23, 13

• Thought experiment: pipelining vs black-box • Black box computes
everything itself and returns an output • Pipelining uses several interchangeable processing units • 1 person for black-box • 3 people for pipelining Processing Unit Friday, August 23, 13

• Black-box guy gets payload, holds 2 accumulators: count and
sum of factorials of even and odd numbers separately • For pipeline • 1 person is a splitter • 2 people compute factorials • 2 people compute sums Processing Unit Friday, August 23, 13

• Now, imagine we’re getting thousand requests per second •
How to reuse black box code? It’s not pluggable • How to scale black box? Processing Unit Friday, August 23, 13

• Define `emitter` • Emitter has many selectors, when selector
matches, custom function is triggered Processing Unit Friday, August 23, 13

• Filter - rebroadcasts ones for which `filter-fn` returns true
• Transformer - applies `transform-fn` and rebroadcasts them • Aggregator - initialized with `initial-state`, applies `aggregate- fn` to current state and tuple. • Rebroadcast - distributes them to several types • Splitter - splits stream in parts based on `split-fn` • Rollup - timed window, that accumulates entries until it times out, then retransmits • Buffer - buffer with given `capacity`, transmits & resets buffer on overflow Processing Unit Friday, August 23, 13

• Actually, all of them fit into two groups: •
Stateful • Stateless • And have one of properties: • Get from one - dispatch to many • Get from many - dispatch to one • Predicate-based dispatch • No further dispatch Processing Unit Friday, August 23, 13

• However smart and complex your data is, it narrows
down to just 4 values (sometimes only one of them actually matters) • Event-type, key, value and timestamp • Event type is used for high-level partitioning • Key is used for visual and logical grouping • Value is used in all possible aggregates • Value may be simple or composite • Timestamp is used to figure out when the heck that thing actually happened Processing Unit Friday, August 23, 13

Processing Unit Friday, August 23, 13

• Reusable • Distributable • Composable Processing Unit Friday, August
23, 13

Simple idea • Everything that’s coming in is an event
• Events are split by triplet (application/env/event type) • Every event can have multiple metrics • Metric has a key and value • And filter • And several rollups • Rollup has an aggregate function triggered on overflow • And ring-buffer storing last N values • And visualization (area, line, barchart) attached Processing Unit Friday, August 23, 13

Simple payload: { :type “page_load” :execution_time 542 :user_agent “Mozilla...” :status
200 } Processing Unit Friday, August 23, 13

• key is status_code, value is execution time, aggregate median
execution time • key is user agent, value doesn’t matter, aggregate count Processing Unit Friday, August 23, 13

• key is status_code, value doesn’t matter, aggregate count •
key is status_code, value is execution_time, aggregate max Processing Unit Friday, August 23, 13

• Once again, ad-hoc vs post-hoc processing • You can
yield 10 different metrics from single payload, why split them to 10 different metrics on a client? • Moving responsibility from client to server makes everything way more flexible. Processing Unit Friday, August 23, 13

• What to do when a single processing unit is
not enough? • Scale at any preferred point • Add more collectors, make each collector listen to certain event type • If collection is not a bottleneck, use downstreams • Having a processing pipeline allows you to scatter at any point in time Processing Unit Friday, August 23, 13

reporter Processing unit Persistent Store State Machine Alerts Reduce Engine
(ad-hoc queries and analytics) History Graphs Table Data Real-time graphs Real-time table data Latest Events (capped in-memory store) reporter reporter reporter reporter UDP, TCP, AMQP Console (nrepl, custom consumers, real-time analysis) Collector Collector Friday, August 23, 13

Everything you see here can be a single box, or
even many boxes Friday, August 23, 13

• Move persistent store to the separate box: configure downstream
to pipe message through preferred client/ collector to the different machine • Really, anything can go to the separate machine • If it’s about processing part, you can split down even further Processing Unit Friday, August 23, 13

• Use partition keys to keep metrics that should land
on the same box to land on the same box • Partition key? Well, anything can be a partition key, in the end: part of the identification triplet, tag, or even metric value Processing Unit Friday, August 23, 13

Searching for anomalies & statistics (let me guess, there’s no
time for that one) Friday, August 23, 13

• Given categorical entry, value and history, determining an anomaly
is nothing complicated • Stateful stream processing opens up great ways to use • Local Outlier Factor algorithm • SVM (support vector machines) Anomalies Friday, August 23, 13

• Majority of the widely-adopted tools do not support even
basic statistical functions • Most of time, they provide an insight to most recent values, without giving you power to customize outputs • Go to Coursera, take a couple of Machine-Learning and Statstic courses • Discover that all that fancy math is actually easy to understand and provides amazing value Statistics Friday, August 23, 13

• Use histograms and boxplots to make variances and distribution
obvious • Detect trends in data, where are you going right now • Find correlations between different data points at same point in time Statistics Friday, August 23, 13

Part 3: Alerting (let me guess, there’s no time for
that one, either) Friday, August 23, 13

• Most frustrating parts of most monitoring systems • Why
can’t we figure out alerts? • Trends: something is changing • Ok, but we’ve seen that trend before • Silence the trend, increase threshold • Get false-negatives Alerting Friday, August 23, 13

• Use more data • Remember previous alerts (silenced and
true ones) • Don’t look for an out-of box solution, too many variables involved • Expect false-positives during system evolvement • Figure out what causes them • Don’t increase thresholds, introduce correlations instead • Time-based correlations • Variation and distribution • Correlation between seemingly independent values Alerting Friday, August 23, 13

Part 4: Visualization Friday, August 23, 13

• Always show the data points graph was built from
• When using interpolation, also use gap detection for sparse discrete data • Always put labels on axes • Always put horizontal and vertical rules along labels for recognition • Use box charts (also, stacked ones) when it’s important to visually compare quantities Visualization Friday, August 23, 13

Friday, August 23, 13

@ifesdjeen Friday, August 23, 13

Alex Petrov - Eventoverse: near-realtime metr...

Alex Petrov - Eventoverse: near-realtime metric collection and processing

More Decks by MunichDataGeeks

Other Decks in Technology

Featured

Transcript