Avoiding the “Half-Baked Zone” - The Fallacy of Real-Time Analytics in Performance Monitoring

What do I mean by the “Fallacy of Analy4cs”
1

•  Perhaps you have seen this before: Gartner Group’s Hype
Cycle •  It purports to describe the phases that a new technology traverses on its way to sucessful adop4on. •  What I mean by “the fallacy of analy4cs” is this: the DevOps community is here when it should be here. •  More specifically, analy4cs is under-‐u4lized. •  Why is this? And, specifically, what can DevOps do to use analy4cs more effec4vely? •  That is what I want to explore with you today. 2

We need to understand where we are now. Let’s
start with a bit of a selﬁe What is YOUR own personal state of the art in applying big data + analy4cs to DevOps? 3

•  Is this YOU? •  When you think about
it … is it not true that all you can know about what is happening comes from data and analy4cs. •  Let’s take these two one at a 4me •  DATA … •  The state of the art in IT instrumenta4on is vastly improved from when I started out with SNMP agents and syslog. •  You can get great agents for all layers in the stack from networks to servers to applica4on basically FREE from open source projects •  Data is NOT a problem. •  Also, this talk is not about data so let’s quickly move on to ANALYTICS •  So what analy4cs to you have? 4

Do you have Dashboards? The most basic analy4cs simply
transform the metrics into graphs. This is a terriﬁc place to start. I love visualiza4on. 5

I have seen this carried to extremes. I have
been guilty of this. How about you? The problem is: As the number of metrics grows, the probability of missing an alarm grows exponen4ally. So one obviously turns to: automa4on. Strategy: Detect problems using soaware. What sort of alarm detec4on soaware did you start with? I started with sta4c alarms. How about you? 6

Oops. This is less than perfect. The result
… 7

A system that cries wolf. Too much noise.
Credibility suﬀers, to put it mildly. The mathema4cal term for this is “FALSE POSITIVES” 8

Worse than that, even with all the false posi4ves ..
Important events are missed. And you look like you have your head in the sand 9

Moreover, simple analy4cs that focus on only ONE metric at
a 4me don’t give you the big picture. Do you feel like you are driving blind? 10

You are NOT alone. The good news is
that there is HOPE. There is lots of evidence that good things happen from beher, more sophis4cated USE of analy4cs This hope is NOT irra4onal. I have an inspiring example … 11

•  In 1857, there was a ship called the Central
America that sunk somewhere oﬀ the southeast coast of the US in a big storm. •  On board was 30,000 pounds of gold from the California Gold Rush valued at between $100-‐150M •  Needless to say, there was high interest in ﬁnding this ship wreck and recovering the gold. •  Many teams looked. 12

•  This is what they faced. •  They did
have some data from historical accounts. •  Finally, in the 1988, one team managed to ﬁnd it •  They were data scien4sts ! •  And what was their secret weapon ??? 13

•  Bayesian search theory! •  There are lots of
other examples, but this one is my personal favorite 14

•  So refocusing back on DevOps… •  Back in
2002, a good friend of mine, call him Bob connected the dots between the promise of analy4cs and the ubiquitous suﬀering in datacenters. •  Bob pitched the idea. 15

On the hype cycle, this was where the industry stood
in 2002. 16

Everyone LOVED it – VC’s, poten4al customers, even the compe44on.
The company was born. Its mission: IT-‐-‐Living beher with analy4cs. So, what does Living Beher with Analy4cs mean? Bob deﬁned it in terms of reaching two goals. 17

•  The state of the art back in 2002 included
extensive use of visualiza4on + simple analy4cs based upon determinis4c models. •  It was obvious that this was not working. •  First goal was to just reduce alarms to a manageable level … say tens or less per day. •  The How: Leverage a progression of analy4cs: visualiza4on, determinis4c analy4cs and add more sophis4cated analy4cs that use sta4s4cs and other more advanced models and algorithms •  Also, leverage the experience in other applica4ons such as ﬁnding sunken treasure and other more fundamental areas such as managing the power grid or high frequency trading 18

The second goal: Alarm quality or accuracy Ameasured
by the Signal to Noise Ra4o (SNR). SNR is the calculated as shown … the number of True Posi4ves (informa4on that you do want) divided by the number of False Posi4ves (informa4on that you do not want). 19

•  So back to Bob and his company.
•  They built a black box. And con4nued to reﬁne it to the present 4me. •  It leveraged algorithms and research that had been used in sta4s4cal learning for predic4ng power as a func4on of weather, customer usage, and other factors. •  As it happens these same algorithms are heavily used by market makers in high frequency trading applica4ons. •  The black box was ini4ally implemented as on-‐premise soaware and now is also packaged as a cloud service. 20

Customers would use Bob’s Black Box as follows:
1) Drop it into a data center. 2) Connect it up to data sources 3) Give it some 4me to “learn” what is “normal” given certain speciﬁed levels of conﬁdence. 4) The box con4nues to learn and make adjustments very frequently as 4me goes on. 5) In a steady opera4onal state, as new observa4ons arrive, the box determines if they are “normal” 6) If not it detects a DEVIATION and raises an alarm. 21

The next few slides show how the black box worked
in more detail. First, this is how the black box visualized the results of its sta4s4cal models. The black line represents the changing values of one raw metric over 4me. The light green band reﬂects normal values derived via a uni-‐variate model The blue band reﬂects normal values derived via a mul4-‐variate model As you can see, the mul4variate bands are much narrower. 22

Univariate learning: you can see the green band get
narrower around a fairly constant metric. 23

Here are two correlated metrics that undergo a signficant change.
Traffic starts being directed to a new ELB coming on-‐line. Request count increases (bohom graph) HTTP error percent (upper graph) risesfrom 0-‐2% to a level that hovers around 4%. At first this is viewed as a devia4ons and alarms are issued. (See red area at the top) Aaer awhile, the system sehles in to prehy consistent behavior, new bands are learned, and the alarms stop. 24

Devia4ons are easy to spot and can be detected automa4cally.
Also note the rela4ve widths of the univariate and mul4variate bands. 25

The bands are actually much more helpful that you might
think. Consider the following hypothe4cal scenario 26

A tale of two devia4ons … (hypothe4cal) Say your
primary analy4c was a sta4c threshold. See the red line. How many devia4ons do you see? How many alarms should be issued? Two, right? 27

When you factor in learning, it becomes clear that the
ﬁrst peak is “normal” while the second peak is not. 28

The narrower the bands, the earlier the catch! The
sooner you can marshall the troops, the beher. Time is of the essence when trying to prevent a developing outage. 29

Classic rela4onship between u4liza4on (metric #1) and response 4me.
Just one speciﬁc case where 4me is of the essence. 30

Putng sta4s4cal models to use across many datacenters ranging from
small to huge, led us to conclude that we are not there yet 31

•  Sta4s4cal analy4cs was successful … it reduced noise by
a factor of 3 orders of magnitude. •  Good, even great. •  But given the amount of data that we need to deal with, this not enough to achieve our target of less than 10s of alarms per day. •  So we came up with a new term “half-‐baked”. •  Sta4s4cal models are a NECESSARY but NOT SUFFICIENT applica4on of analy4cs. 32

Here are a few insights as to why. 33

Implemen4ng an algorithm is more than just calling a canned
math library … the math has to ﬁt into an en4re ecosystem of func4ons that involve data inges4on, cleansing, storage, transforma4on, visualiza4on … all in real 4me. •  Issues of real-‐4me performance at scale – n > 1000 •  Numerical stability •  Data quality issues •  Combina4on of skills required 34

The second insight. Probability is extremely helpful, but it
is not the whole story for reducing noise. Here is a graph of the normal distribu4on. I could have selected any number of others. What I say applies equally well to any probability density func4on. The x-‐axis reflects observa4ons of a metric’s values in units of standard devia4ons away from the mean. The y-‐axis reflects the probability of observing the value. The annota4ons reflect the percentage of all observa4ons that would normally fall into each shaded area, assuming a normal distribu4on. For example, only 2.1% of all observa4ons would fall between 2 and 3 standard devia4ons above the mean or below the mean. 35

•  The tails of the distribu4on are where devia4ons can
occur. •  Let’s apply this distribu4on to observa4ons that we might see in various sizes of datacenters. 36

•  First: we look at a 4ny infrastructure with
500 metrics, sampled at a rate of one observa4on per 5 minutes. •  Two to four collectd agents can produce 500 metrics, easy. •  So, with just two-‐four agents, this is a 4ny installa4on. •  In this case, during one day, we would under normal condi4ons see over 228 thousand devia4ons of at least one standard devia4on from the mean. •  During one day, we would, under normal condi4ons, see 45 observa4ons greater than or equal to 4 standard devia4ons from the mean. •  Lets call them 4-‐sigma devia4ons. •  Of these 45, some may be real problems, but there is a dis4nct possibility that they are not. Eyeballing just 45 observa4ons per day will take 4me. 37

2 x 2280 devia4ons is not useful. 38

Conclusion. The bigger the data (equivalently, the more metrics
you have), then the more devia4ons will occur – even under “sta4s4cally normal” condi4ons. One cannot map devia4ons to alarms. This sets up a false posi4ve nightmare. 39

•  Next insight: There is significant tension associated with
dis4nguishing good from bad alarms. •  Again, sta4s4cs is helpful, but it is not the whole story •  Looking at a single metric. Divide all the observa4ons that you see into two piles: OK and !OK. •  The distribu4on on the lea shows the probability for the values of a metric if all is OK •  The distribu4on on the right shows the probability of values of a metric if all is NOT OK. •  Here is the problem: The two graphs overlap. •  The area where the two curves overlap is where the alarm quality suffers. There are two flavors of “trouble”. •  False Nega4ves (FN) or False Posi4ves(FP) •  FN is the “asleep at the switch” situa4on in which the monitoring system makes a correct observa4on but interprets is as OK when it is not. This is represented by the RED area in which the Pr(OK) > Pr(NOTOK). •  FP is the crying wolf situa4on in which the monitoring system makes a correct observa4on but interprets as NOTOK when in fact is is OK. This is represented by the ORANGE area in which Pr(NOTOK) > Pr(OK). Conclusion … so again, pure sta4s4cs and probabili4es, while incredibly helpful, 40

•  Insight #4: A Purely sta4s4cal approach cannot help
dis4nguish between a long running problem and a new normal. •  Here we see a graph of observa4ons along with an “alarm skyline” that shows when the analy4cs were deciding there was something important to report. •  The ﬁrst four sets of cycles look appropriate. The last set, where no alarms were issued, is a problem, but the sta4s4cal analy4c interpreted it as a “new normal”. •  Given that this is metric is measuring rate of success for some transac4on, we do not have a new normal. We have a BIG PERSISTENT problem. Conclusion: Sta4s4cs can sense a change but not whether it is good or bad. 41

So, given these insights, how can we improve the u4lity
of analy4cs to opera4ons I oﬀer FOUR pearls of wisdom 42

There are just too many metrics AND there are just
too many devia4ons. Individual metric alarms create too much noise. So an obvious approach is to group them in ways that can be leveraged by the analy4cs. In the past, companies used to deploy CMDB tools to understand grouping and rela4onships. These mechanisms have seen limited, if any, success in DevOps. Basically, these tools were never designed to deal with the scale and elas4city of cloud compu4ng. So what grouping strategies are suitable for DevOps? 43

Tags are one grouping mechanism. Groups can be deﬁned
simply by giving a set of metrics the same tag. Tags can be organized into hierarchies of folders. Tagging is easy to understand and incredibly ﬂexible. 44

Tags are one grouping mechanism. Groups can be deﬁned
simply by giving a set of metrics the same tag. Tags can be organized into hierarchies of folders. Tagging is easy to understand and incredibly ﬂexible. 45

Here is another grouping strategy: Elements This is
a server element 46

This is an element that encapsulates metrics that deal with
performance of method calls on a Java class. 47

Load balanced clusters are common types of groups that have
very specific proper4es. One can view them as a group of Elements, all of which share the same set of metrics. The group is significant, not the individual member. Key Benefits: •  Alarms can get detected at metric, element, group, or tag level •  Knowledge of these associa4ons can help immensely in 1) deciding whether or not to raise an alarm and 2) providing key troubleshoo4ng informa4on 48

By context, I referring here to SEMANTICS. The MEANING
of the analy4cs results. There are lots of examples. Here is one. 49

Suppose that the analy4cs are telling you that you have
a 50% error rate in some applica4on’s transac4ons. Should you press the panic buhon 50

Decision logic provides yet another opportunity to inject human knowledge
between detec4on of devia4ons and ac4ons that are taken … ac4ons that range from sending an alarm to ﬁring up complex conﬁgura4on changes. 52

•  The second area is Policies. •  Policies are
Decision logic that encapsulates a priori knowledge. •  A priori knowledge has three components: •  Scope – what subset of the infrastructure to which it applies – Scopes can be deﬁned at metric, element, by group, or by tag •  Condi4ons – condi4ons that cons4tute a policy viola4on •  Ac4ons – to invoke if the policy is violated 53

•  An example of a policy 54

This is a great strategy for reducing alarm count
55

Leveraging tag hierarchies you can inject knowledge of these associa4ons
can help immensely in 1) deciding whether or not to raise an alarm and 2) providing key troubleshoo4ng informa4on Services such as database, web, etc. 56

•  Putng this all together we obtain an analy4cs cycle.
•  The workflow can be synchronous or asynchronous … triggered at regular intervals or by external events •  The workflow starts with available data from one or more sources. •  Quan4ta4ve analy4cs are applied and produce results. •  Results are overlaid on a logical model of the datacenter and its applica4ons. •  Policies are evaluated to see if there are any viola4ons. •  When viola4ons are detected AND deemed to be ac4onable, then ac4ons are invoked. Let’s see this workflow in ac4on … I have 4me for one real example 57

•  This example is from a highly mature organiza4on that
processes airline reserva4ons. •  By highly mature, I mean that they have very few actual outages. But they certainly want to be as proac4ve as possible in catching those that do occur. Most of the groups were automa4cally created but some were manually created. #problems = # alarms, ideally 58

USE of analy4cs that includes math + a priori knowledge
+ tagging and grouping is consistently delivering results like this. Pure math delivers 3 orders of magnitude improvement. The priors and service models each deliver about 1 order of magnitude improvement. 59

•  So the fallacy is in thinking that pure
math is enough. •  Pure math, models, and algorithms are ABSOLUTELY necessary; but they are NOT sufficient. •  If analy4cs is in the trough, it is because of INCOMPLETE applica4on. •  Analy4cs can and should not in the trough. •  Instead, there they should be here. •  Note the placement … Our work is NOT done. •  BUT We know that analy4cs can deliver benefits to DevOps and improve all our lives •  For now … the final thought that I would like you to take away is 60

The best analy4cs leverage BOTH math and human knowledge.
Thank you for your ahen4on. 61

Avoiding the “Half-Baked Zone” - The Fallacy o...

Avoiding the “Half-Baked Zone” - The Fallacy of Real-Time Analytics in Performance Monitoring

More Decks by DevOpsDays DC

Other Decks in Technology

Featured

Transcript