Slide 1

Slide 1 text

Why Analytics Fail Betsy Nichols, Ph.D. Chief Data Scientist DevOpsDays Austin May 2016

Slide 2

Slide 2 text

www.netuitive.com 2 The stories you are about to hear are true. The names have been changed to protect the innocent. @eanTweet

Slide 3

Slide 3 text

www.netuitive.com 3 Where is DevOps?

Slide 4

Slide 4 text

www.netuitive.com 4 Market Visibility Time Hype Cycle: Analytics for DevOps Analytics for DevOps Hyperbole Index 2016 @eanTweet

Slide 5

Slide 5 text

www.netuitive.com 5 = Applied Analytics engine- ering math hacking computer science

Slide 6

Slide 6 text

www.netuitive.com 6 Engineering Computer Science Math Hacking

Slide 7

Slide 7 text

www.netuitive.com 7 Math Hacking Engineering Computer Science

Slide 8

Slide 8 text

www.netuitive.com 8 Math Hacking Engineering Computer Science

Slide 9

Slide 9 text

www.netuitive.com 9 Math Hacking Engineering Computer Science

Slide 10

Slide 10 text

www.netuitive.com 10 Types of Analytics • Off-line analytics (“Reporting”) o  Trends over hours, weeks, or months o  Optimization strategies o  Recommendations o  Business intelligence • Hybrid • Near real-time analytics (“Monitoring”) o  Detection o  Troubleshooting o  Remediation @eanTweet

Slide 11

Slide 11 text

Reporting @eanTweet

Slide 12

Slide 12 text

www.netuitive.com 12 Report: ASG Capacity vs Utilization ASG Group: # Nodes Provisioned 15 ASG Group: 95% Percentile CPU Utilization Time

Slide 13

Slide 13 text

www.netuitive.com 13 Report: ASG Capacity vs Utilization ASG Group: # Nodes Provisioned 15 ASG Group: 95% Percentile CPU Utilization Time $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

Slide 14

Slide 14 text

Monitoring @eanTweet

Slide 15

Slide 15 text

www.netuitive.com 15 Alarm No Alarm Problem No Problem TP TN FP Good, Bad, @eanTweet

Slide 16

Slide 16 text

www.netuitive.com 16 = Wanted _ Alarms Unwanted _ Alarms SNR = TP FP Success (model) @eanTweet

Slide 17

Slide 17 text

www.netuitive.com 17 Goal: @eanTweet

Slide 18

Slide 18 text

www.netuitive.com 18 Analytics DeterministicwStatistical @eanTweet

Slide 19

Slide 19 text

www.netuitive.com 19 Deterministic Analytics* * Current DevOps state of the art @eanTweet

Slide 20

Slide 20 text

www.netuitive.com 20 + Static Thresholds + Counting & Transformations + Eyeballing Dashboards @eanTweet Common Deterministic Analytics

Slide 21

Slide 21 text

www.netuitive.com 21 Dashboards: Visualization @eanTweet

Slide 22

Slide 22 text

www.netuitive.com 22 Dashboards: Do Not Scale Pr(Missed Alarm) # Metrics @eanTweet

Slide 23

Slide 23 text

www.netuitive.com 23 False Negative: EC2 Cost Explosion Duration: 13 hours Cost: $thousands Cost Per Hour Hour Anomalies unnoticed due to lack of automation @eanTweet Credentials revoked Credentials stolen

Slide 24

Slide 24 text

www.netuitive.com 24 Static Thresholds: Intuitive Cost Per Hour Hour @eanTweet

Slide 25

Slide 25 text

www.netuitive.com 25 Cost Per Hour Hour Static Thresholds: Hard To Set @eanTweet

Slide 26

Slide 26 text

www.netuitive.com 26 Static Thresholds: Monotonic Metrics Anomalies Hidden in Cumulative Data @eanTweet

Slide 27

Slide 27 text

www.netuitive.com 27 Static Thresholds: Monotonic Metrics Anomalies Hidden in Cumulative Data @eanTweet

Slide 28

Slide 28 text

www.netuitive.com 28 Counting + Transformations (Uni-Variate) •  Delta: raw[n] à (raw[n] – raw[n-1]) •  Rate: raw[n] à (raw[n] / time) •  Scale: raw[n] à (raw[n] * constant) •  Min: raw[n] à min(raw[…]) •  Max: raw[n] à max(raw[…]) •  RHMAX: raw[n] à (raw[n] / max(raw)) •  Frequency: range(x) à # observations @eanTweet

Slide 29

Slide 29 text

www.netuitive.com 29 Transformation: Monotonic Metrics Sum Delta m[n] = x[n] - x[n-1] @eanTweet

Slide 30

Slide 30 text

www.netuitive.com 30 Transformation: Frequency Possible alternative to static threshold F(range) = count @eanTweet

Slide 31

Slide 31 text

www.netuitive.com 31 Transforms: Seasonal Anomaly @eanTweet

Slide 32

Slide 32 text

www.netuitive.com 32 Statistical Analytics* *Where the state of the art is heading @eanTweet

Slide 33

Slide 33 text

www.netuitive.com 33 Statistical Analytics Assumption: *The Tempest by William Shakespeare, Act II, Scene I “What’s past is prologue*” @eanTweet

Slide 34

Slide 34 text

www.netuitive.com 34 + Correlation Models 1 0.8 0.4 0 -0.4 -0.8 -1 + Machine Learning @eanTweet

Slide 35

Slide 35 text

www.netuitive.com 35 Correlation t1 t3 t1 t3 t4 t5 t1 t2 t3 t4 t5 t7 t6 t1 t2 t3 t4 t5 t7 t6 Confidence Interval = x% Revenues/sec Requests/sec @eanTweet

Slide 36

Slide 36 text

www.netuitive.com 36 Correlation Analytic r = ((x i −µx )(y i −µy )) i=1 n ∑ (x i −µx )2 i=1 n ∑ (y i −µy )2 i=1 n ∑ Pearson Product Moment Coefficient of Correlation for two metrics X and Y @eanTweet

Slide 37

Slide 37 text

www.netuitive.com 37 2D: Correlated Metrics (r~0.91) Requests/sec Revenue/sec @eanTweet

Slide 38

Slide 38 text

www.netuitive.com 38 2D: Correlation Anomaly (r~0.51) Requests/sec Revenue/sec

Slide 39

Slide 39 text

www.netuitive.com 39 High Correlation (r~0.99) Requests/sec Revenue/sec @eanTweet

Slide 40

Slide 40 text

www.netuitive.com 40 Correlation: Non-Linear Relationships For each pattern r = 0.00

Slide 41

Slide 41 text

www.netuitive.com 41 Statistical Machine Learning @eanTweet

Slide 42

Slide 42 text

www.netuitive.com 42 Metrics Model Phase I: Learning Learning Engine @eanTweet

Slide 43

Slide 43 text

www.netuitive.com 43 Metrics Model Phase II: Detection Anomalies Learning Engine @eanTweet

Slide 44

Slide 44 text

www.netuitive.com 44 Metrics New Model Phase III: Adaptive Learning Anomalies Adaptations Learning Engine @eanTweet

Slide 45

Slide 45 text

www.netuitive.com 45 Learned Bands of Normalcy Time 05/01 05/02 05/03 05/04 05/06 05/07 05/08 200 150 100 50 0 Values = Raw Metric Observations Xi ! σn (Xi | X j ) ! σn (X W i ) @eanTweet multi-variate uni-variate “Bands of Normalcy”

Slide 46

Slide 46 text

www.netuitive.com 46 Learning: Web Page Views @eanTweet multi-variate uni-variate observed values Time

Slide 47

Slide 47 text

www.netuitive.com 47 Learning @ Work Deviation from Norm @eanTweet observed values uni-variate band of normalcy (narrowing)

Slide 48

Slide 48 text

www.netuitive.com 48 Credit Card Authorizations Deviations @eanTweet observed values multi-variate uni-variate

Slide 49

Slide 49 text

www.netuitive.com 49 Nuance: New Normal t2.small (1 vcpu) m4.large (2 vcpu) CPU Utilization Save $$$ @eanTweet multi-variate uni-variate observed values

Slide 50

Slide 50 text

www.netuitive.com 50 Nuance: Bad Change Bad Good Response Time SLA Not Met multi-variate uni-variate observed values

Slide 51

Slide 51 text

www.netuitive.com 51 How do you tell the difference ? New Normal vs Long Running Problem︎

Slide 52

Slide 52 text

www.netuitive.com 52 Nuance: Memory Leak Algorithm: Only if no anomalies: 1. Take sequence of relative minimum values 2. Fit a linear regression model. Test goodness of fit. 3. Use model to predict future relative min values 4. Alarm two hours relative min of heap used = 90% % Heap Used Garbage collection Garbage collection Garbage collection Garbage collection

Slide 53

Slide 53 text

www.netuitive.com 53 What metrics are relevant? A very specific failure mode︎ @eanTweet

Slide 54

Slide 54 text

www.netuitive.com 54 Math is the Queen of the Sciences …. Impressive, but not enough. @eanTweet

Slide 55

Slide 55 text

www.netuitive.com 55 Context is Important @eanTweet

Slide 56

Slide 56 text

Adding Context

Slide 57

Slide 57 text

www.netuitive.com 57 Selection @eanTweet

Slide 58

Slide 58 text

www.netuitive.com 58 Workflow Bakes in Context Context Results Data Analytics @eanTweet Context Actions

Slide 59

Slide 59 text

www.netuitive.com 59 Take Aways

Slide 60

Slide 60 text

Best Analytics = Math +

Slide 61

Slide 61 text

www.netuitive.com 61 @eanTweet

Slide 62

Slide 62 text

www.netuitive.com 62 Contact Info Elizabeth Nichols, Ph.D. Chief Data Scientist [email protected] @eantweet www.netuitive.com @netuitive (703) 464-1500 @eanTweet