Observing and Understanding Behavior in Complex Systems

Slide 1

Slide 1 text

Slide 2

Slide 2 text

THE MANY FACES OF THEO FUN WITH BEARDS AND HAIR FUCK IT ALL VENDETTA SCARY DETERMINED CAREFREE NO-FLY  ZONE

Slide 3

Slide 3 text

Complex Systems surprisingly ubiquitous. often not complex at ﬁrst thought. will dominate your entire life going forward.

Slide 4

Slide 4 text

Complex? Classic Complex Chaotic damped, driven system whose total energy exceeds the thresholds according to classical mechanics but does not meet the thresholds for chaotic systems. Operator Happiness 100

Slide 5

Slide 5 text

Chaotic Systems are sensitive to initial conditions, must be topologically mixing, its periodic orbits must be dense. ! to apply this directly  … we must have an evolution function  … we do not ✓ ✓ analogously ✓ with bugs

Slide 6

Slide 6 text

Where from here? Our systems are complex. Bugs make them chaotic at times. Decoupling makes them very hard to observe  … let alone understand. ! because determining causality by observation  is not possible.

Slide 7

Slide 7 text

Data over logs The transaction is “lost” in an asynchronous system Determination of causal effects cross-txn is impossible The best we can do is: Observe Analyze Correlate

Slide 8

Slide 8 text

Some Tenets Most people suck at monitoring. They monitor all the wrong things (somewhat bad) The don’t monitor the important things (awful)

Slide 9

Slide 9 text

Do not collect rates of things Rates are like trees making sounds falling in the forest. Direct measurement of rates leads to data loss  and ultimately ignorance.

Slide 10

Slide 10 text

Prefer high level telemetry 1. Business drivers via KPIs, 2. Team KPIs, 3. Staff KPIs, 4. ... then telemetry from everything else.

Slide 11

Slide 11 text

Implementation Herein it gets tricky.

Slide 12

Slide 12 text

Methodology I’m going to focus on methodology  that can be applied across whatever toolset you have.

Slide 13

Slide 13 text

Pull vs. Push Anyone who says one is better than the other is...  WRONG. They both have their uses.

Slide 14

Slide 14 text

Reasons for pull 1. Synthesized observation is desirable. 2. Observable activity is infrequent. 3. Alterations in observation frequency are useful.

Slide 15

Slide 15 text

Reasons for push Direct observation is desirable. Discrete observed actions are useful. Discrete observed actions are frequent.

Slide 16

Slide 16 text

False reasons. Polling doesn’t scale.

Slide 17

Slide 17 text

Protocol Soup The great thing about standards is...  there are so many to choose from.

Slide 18

Slide 18 text

Protocol Soup SNMP(v1,v2,v3) both push(trap) and pull(query) collectd(v4,v5) push only statsd push only JMX, JDBC, ICMP, DHCP, NTP, SSH, TCP, UDP, barf.

Slide 19

Slide 19 text

Color me RESTy Use JSON. HTTP(s) PUT/POST somewhere for push HTTP(s) GET something for pull

Slide 20

Slide 20 text

High-volume Data Occasionally, data velocity is beyond what’s reasonable for individual HTTP PUT/POST for each observation. 1. You can fall back to UDP (try statsd) 2. I prefer to batch them and continue to use REST

Slide 21

Slide 21 text

The real question is: “what?” What should I be monitoring? This is the best question you can ask yourself. Before you start. While you’re implementing. After you’re done.

Slide 22

Slide 22 text

The industry answer: MONITOR ALL THE THINGS! I’ll tell you this too, in fact. But we have put the cart ahead of the horse.

Slide 23

Slide 23 text

Question? If I could monitor one thing, what would it be?    hint: CPU utilization on your web server ain’t it.

Slide 24

Slide 24 text

Answer: It depends on your business. If you don’t know the answer to this,  I suggest you stop worrying about monitoring  and start worrying about WTF your company does.

Slide 25

Slide 25 text

Here, we can’t continue. Unless I make stuff up... So, here I go makin’ stuff up.

Slide 26

Slide 26 text

Let us assume we run a web site where customers buy products

Slide 27

Slide 27 text

Monitoring purchases. So, we should monitor how many purchases were made and ensure it is within acceptable levels. ! Not so fast.

Slide 28

Slide 28 text

Actually. We want to make sure customers  can purchase from the site and  are purchasing from the site. This semantic different is critically important. And choosing which comes down to velocity.

Slide 29

Slide 29 text

What is this velocity thing? Displacement / time  (i.e. purchases/second or $/second) BUT WAIT! You said:  “Do not collect rates of things.” Correct...  collect the displacement,  visualize and alert on the rate.

Slide 30

Slide 30 text

So which? High velocity w/ predictably smooth trends:  velocity is more important Low velocity or uneven arrival rates:  measuring capability is more important

Slide 31

Slide 31 text

To rephrase If you have sufﬁcient real data,  observing that data works best; otherwise, you must  synthesize data and monitor that.

Slide 32

Slide 32 text

As a tenet. Always synthesize. additionally observe real data when possible

Slide 33

Slide 33 text

More demonstrable (in a short session) I’ve got a web site that my customers need to visit. The business understands that we need to serve customers with at least a basic level of QoS:  no page loads over 4s

Slide 34

Slide 34 text

Active checks.

Slide 35

Slide 35 text

A ﬁrst attempt curl http://surge.omniti.com/ extract the HTTP response code if 200, we’re super good! ! Admittedly not so good.

Slide 36

Slide 36 text

A wealth of data. Synthesizing an HTTPS GET could provide: SSL Subject, validity, expiration HTTP code, Headers and Content Timings on TCP connection, ﬁrst byte, full payload

Slide 37

Slide 37 text

Still, this is highly imperfect. Don’t get me wrong, they are useful.  We use them all over the place... they are cheap. But, ideally, you want to load the page closer to the way a user does (all assets, javascript, etc.) Enter phantomjs

Slide 38

Slide 38 text

Slide 39

Slide 39 text

var start, stats = { status: null , errors: 0 , load_started: null , load_finished: null , resources: 0 , resource_errors: 0 , url_redirects: 0 };

Slide 40

Slide 40 text

Passive checks.

Slide 41

Slide 41 text

Now for the passive stuff Some examples are Google Analytics, Omniture, etc. Statsd (out-of-the-box) and Metrics  are mediocre approach. If we have a lot of observable data N,  N̅ isn’t so useful,  , |N|, q(0.5), q(0.95), q(0.99), q(0), q(1), add a lot.

Slide 42

Slide 42 text

Still... we can do better. N̅, , |N|, q(0,0.5,0.95,0.99,1) is 8 statistical aggregates Let’s look at API latencies...  say we do 1000/s,  that’s 60k/minute. Over a minute of time, 60k points to 8 represents...  a lot of information loss.

Slide 43

Slide 43 text

First 60k/minute, how? statsd http puts logs etc.

Slide 44

Slide 44 text

Histograms

Slide 45

Slide 45 text

Histograms 101 This. This is a histogram. It shows the frequency of  values within a population. Height represents frequency

Slide 46

Slide 46 text

Histograms 101 This. This is a histogram. It shows the frequency of  values within a population. Now, height and color  represents frequency

Slide 47

Slide 47 text

This. This is a histogram. It shows the frequency of  values within a population. Now, only color  represents frequency Histograms 101

Slide 48

Slide 48 text

This. This is a histogram. It shows the frequency of  values within a population. Now, only color  represents frequency Histograms ➠ time series at a single time interval

Slide 49

Slide 49 text

A line graph of data.

Slide 50

Slide 50 text

A heatmap of data.

Slide 51

Slide 51 text

Zoomed in on a heatmap.

Slide 52

Slide 52 text

Unfolding to a histogram.

Slide 53

Slide 53 text

Observability I don’t want to launch into a tutorial on DTrace  despite the fact that you can simple spin up an OmniOS AMI in Amazon and have it now. Instead let’s talk about what shouldn’t happen.

Slide 54

Slide 54 text

The production questions: I wonder if that queue is backed up... Performance like that should only happen if our binary tree is badly imbalanced (replace with countless other pathologically bad precipitates of failure); I wonder if it is... It’s almost like some requests are super slow; I wonder if they are. STOP WONDERING.

Slide 55

Slide 55 text

Instrument your software Instrument your software and systems   and stop the wonder Do it for the kids This is simple with DTrace & a bit more work otherwise Avoiding work is not an excuse for ignorance

Slide 56

Slide 56 text

A tour through our Sauna We have this software that stores data...  happens to store all data visualized in Circonus. We have to get data into the system. We have to get data out of the system. I don’t wonder... here’s why.

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

Data tenet. Do not collect data twice. That which you collect for visualization  should be the same data on which you alert.

Slide 60

Slide 60 text

Thank you!