Slide 1

Slide 1 text

Effectively Adding Analytics to Devops Monitoring Betsy Nichols, Ph.D. Chief Data Scientist June 2016 @eanTweet

Slide 2

Slide 2 text

www.netuitive.com 2 Abstract •  This presentation is about leveraging analytics as part of a larger framework of collaborators to continuously improve health and performance monitoring. •  This talk starts with a survey of analytics that have applicability in environments that range from very small to huge. Techniques discussed include deterministic and statistical analytics, machine learning, uni-variate models, and multi-variate models. For each technique, I provide examples from (anonymized) cases that illustrate where each can succeed, each can fail, and why. •  The second part of the talk describes a framework in which analytics play a role as one of many collaborators. The framework provides key services such as integration of collaborators, orchestration of tasks, feedback/control loops, scenario replay, and sensitivity analysis, packaging, and incremental improvement. •  The final part of the talk describes a use case. It shows how a framework can drive continuing improvement as business conditions evolve and the collaborators mature. •  This presentation is about tools and techniques that have been effective in environments ranging from tiny to huge. It is not an academic discourse on math, statistics or probability. 40 min @eanTweet

Slide 3

Slide 3 text

www.netuitive.com 3 The stories you are about to hear are true. The names have been changed to protect the innocent. @eanTweet

Slide 4

Slide 4 text

www.netuitive.com 4 Hype Cycle Market Visibility Time Analytics for DevOps Monitoring Hyperbole Index 2016 2 1 3 4 5

Slide 5

Slide 5 text

www.netuitive.com 5 You CAN Live Better with Math The Queen of the Sciences Hint

Slide 6

Slide 6 text

C ntext Analytics Fail without @eanTweet

Slide 7

Slide 7 text

What is Failure @eanTweet

Slide 8

Slide 8 text

www.netuitive.com 8 Alarm No Alarm Problem No Problem TP TN FP Good, Bad, @eanTweet

Slide 9

Slide 9 text

www.netuitive.com 9 = Wanted _ Alarms Unwanted _ Alarms SNR = TP FP Success (model) @eanTweet

Slide 10

Slide 10 text

What is Co ntext @eanTweet

Slide 11

Slide 11 text

www.netuitive.com 11 Without Context @eanTweet

Slide 12

Slide 12 text

www.netuitive.com 12 With Context @eanTweet

Slide 13

Slide 13 text

www.netuitive.com 13 Adaptive ML Analytic (Without Context) Raw Metric Values Bands of Normalcy Anomalies Time

Slide 14

Slide 14 text

www.netuitive.com 14 DevOps Workflow Context Data Analytic @eanTweet

Slide 15

Slide 15 text

www.netuitive.com 15 With Context (Case 1) t2.small (1 vcpu) m4.large (2 vcpu) EC2 CPU Utilization @eanTweet FP

Slide 16

Slide 16 text

www.netuitive.com 16 With Context (Case 2) Good Response Time @eanTweet FN SLA not met

Slide 17

Slide 17 text

www.netuitive.com 17 Math is the Queen of the Sciences …. Impressive, but not enough. @eanTweet

Slide 18

Slide 18 text

How to Add Co ntext @eanTweet

Slide 19

Slide 19 text

www.netuitive.com 19 Analytics Ecosystem (!Context) Context Data Analytics @eanTweet

Slide 20

Slide 20 text

www.netuitive.com 20 Analytics Ecosystem (Context) Context Results Data Analytics @eanTweet Context Policy for Action Feed back

Slide 21

Slide 21 text

EcoSystem Components @eanTweet

Slide 22

Slide 22 text

www.netuitive.com 22 Analytics Ecosystem Context Results Data Analytics @eanTweet Context Feed back Policy for Action

Slide 23

Slide 23 text

www.netuitive.com 23 Data •  Time series •  Metadata •  Logical models •  Text •  JSON •  Events @eanTweet

Slide 24

Slide 24 text

www.netuitive.com 24 Goal: Coverage @eanTweet

Slide 25

Slide 25 text

www.netuitive.com 25 Analytics Ecosystem Context Results Data Analytics @eanTweet Context Feed back Policy for Action

Slide 26

Slide 26 text

www.netuitive.com 26 Goal: Enriched Information Results clean profile connect aggregate classify detect anomalies @eanTweet

Slide 27

Slide 27 text

www.netuitive.com 27 Wide Repertoire http://www.mln.io/resources/periodic-table/ @eanTweet

Slide 28

Slide 28 text

www.netuitive.com 28 Analytics Ecosystem Context Results Data Analytics @eanTweet Context Feed back Policy for Action

Slide 29

Slide 29 text

www.netuitive.com 29 Totum major summa partum @eanTweet

Slide 30

Slide 30 text

www.netuitive.com 30 @eanTweet EC2 EC2 EC2 Service Model Arrival Rate •  Queue Length •  Service Time Completion Rate •  # EC2’s •  Mean CPU Utilization Queueing Auto-Scaling •  # EC2’s •  Mean CPU Utilization Latency Quant Models

Slide 31

Slide 31 text

www.netuitive.com 31 EC2 EC2 CPU_Utilization Memory RunQ #Processes NetIO Swap CTX @eanTweet

Slide 32

Slide 32 text

www.netuitive.com 32 Grouping via Tags Simulate relationships Facilitate aggregation Insanely flexible @eanTweet

Slide 33

Slide 33 text

www.netuitive.com 33 Context Increases with Grouping Raw Observations à Elements à Services à Clusters à Applications à Business à Metrics

Slide 34

Slide 34 text

www.netuitive.com 34 Analytics Ecosystem Contex Results Data Analytics @eanTweet Context Policy for Action Feed back

Slide 35

Slide 35 text

www.netuitive.com 35 Policy Components Scope Conditions Action(s) @eanTweet

Slide 36

Slide 36 text

www.netuitive.com 36 Example: Policy for Action For all EC2’s in US West Region with tag=“PROD” in ASGx … If AvgCpuUtil(EC2) deviating && RunQ > 2*(# CPUs) && ReqRate(ELBx ) !deviating && AvgLatency(ELBx ) > 2 sec For duration >=10 minutes Send critical event Invoke autoscaleUp on ASGx @eanTweet

Slide 37

Slide 37 text

www.netuitive.com 37 Analytics Ecosystem Results Data Analytics @eanTweet Context Policy for Action Feed back

Slide 38

Slide 38 text

Use Cases @eanTweet

Slide 39

Slide 39 text

www.netuitive.com 39 1.  Human in the Loop: Alarm Tuning via Scenario Replay 2.  Automatic ASG Scaling Feedback Control via Service Models

Slide 40

Slide 40 text

Use Case Scenario Replay @eanTweet 1

Slide 41

Slide 41 text

Alarm Quality

Slide 42

Slide 42 text

www.netuitive.com 42 Time Time Transactions/s Detected Events Example: Scenario Replay Context: Non-holiday day-over-day pattern should not vary by much. @eanTweet

Slide 43

Slide 43 text

www.netuitive.com 43 Mean Difference Analytic time Transactions/s now Compare period width Gap between periods to compare = µ2 -µ1 mean difference µ2 µ1 @eanTweet

Slide 44

Slide 44 text

www.netuitive.com 44 Configure the Analytic Level of Confidence that the compared intervals’ two means are different = 99% @eanTweet Gap = 1 day Period width = 2 hours

Slide 45

Slide 45 text

Turn Back the Hands of Time @eanTweet

Slide 46

Slide 46 text

www.netuitive.com 46 Experiment #1 1 Context = Caught but too late @eanTweet

Slide 47

Slide 47 text

www.netuitive.com 47 Adjust Model Setting Level of Confidence that the compared intervals’ means are different: 99% à 90% @eanTweet

Slide 48

Slide 48 text

Turn Back the Hands of Time @eanTweet

Slide 49

Slide 49 text

www.netuitive.com 49 Experiment #2 Overlaid Time Time Transactions/s 1 2 Context: Earlier, but too many FP @eanTweet

Slide 50

Slide 50 text

www.netuitive.com 50 Adjust Model Setting Level of Confidence that the compared intervals’ means are different: 90% à 95% @eanTweet

Slide 51

Slide 51 text

Turn Back the Hands of Time @eanTweet

Slide 52

Slide 52 text

www.netuitive.com 52 Time All Transactions Experiment #3 Overlaid Time Time Transactions/s 1 2 3 @eanTweet

Slide 53

Slide 53 text

www.netuitive.com 53 Analytics Ecosystem Context Data Analytics @eanTweet Context Policy for Action ML # FPn # FNn Results Confign+1 + Mean Difference Normal Day SNR goal = #TP # FP

Slide 54

Slide 54 text

Use Case Feedback-Control @eanTweet 2

Slide 55

Slide 55 text

Cost Optimization

Slide 56

Slide 56 text

www.netuitive.com 56 @eanTweet

Slide 57

Slide 57 text

www.netuitive.com 57 @eanTweet EC2 EC2 EC2 ASG Service Model Arrival Rate •  Queue Length •  Response Time Completion Rate •  # EC2’s •  Mean CPU Utilization Queueing Auto-Scaling •  # EC2’s •  Mean CPU Utilization Latency Quant Models

Slide 58

Slide 58 text

www.netuitive.com 58 Make It Work Max Workload Time in Hours à # Nodes = 11 Response Time < 3 sec ASG Nodes Workload Max Requests/sec Requests/sec @eanTweet

Slide 59

Slide 59 text

www.netuitive.com 59 Regular Shapes Actual Requests/sec # Nodes = 11 Response Time < 3 sec @eanTweet

Slide 60

Slide 60 text

M/M/c Queuing Model •  λ = Arrival rate with Poisson distribution •  µ = Average service time with exponential distribution •  c = # servers •  Servers serve from the front of the queue (FCFS) •  If there are less than c jobs, some servers will be idle •  If there are greater than c jobs, some will queue in a buffer •  The buffer is of infinite size @eanTweet

Slide 61

Slide 61 text

Equations Response time = The probability that an arriving job is forced to queue is given by Erlange’s C formula: intensity = Ref: https://en.wikipedia.org/wiki/M/M/c_queue @eanTweet

Slide 62

Slide 62 text

M/M/c Model Results 2 4 6 8 10 µ = 1.0s Response Time @eanTweet

Slide 63

Slide 63 text

M/M/c Model Application INPUT: • Service Time (µ) • Arrival rate of requests (λ) • Response Time goal (3s) OUTPUT: • Response Time • Mapping: arrival rate à # nodes with constraint: response time < 3s @eanTweet

Slide 64

Slide 64 text

www.netuitive.com 64 Scaling Model Reserve OnDemand Spot Response Time < 3 sec Arrival Rate @eanTweet

Slide 65

Slide 65 text

www.netuitive.com 65 Risk Mitigation Factors Max scale-in rate: Max scale-out rate: Min instance counts: Zero thresholds: Target response time: Workload metric: Update frequency: 3 sec aws.elb.elb-c.arrivalrate hourly 1% 2 3 unlimited @eanTweet

Slide 66

Slide 66 text

www.netuitive.com 66 Analytics Ecosystem Context Data Analytics(M/M/c) @eanTweet Context Auto Scale Policy for Action λn µn Tn cn+1 Tgoal =3s K=crec cn = crec Risk Results cn

Slide 67

Slide 67 text

www.netuitive.com 67 Take Aways @eanTweet

Slide 68

Slide 68 text

www.netuitive.com 68 Analytics Ecosystem Context Results Data Analytics @eanTweet Context Policy for Action Feed back

Slide 69

Slide 69 text

Best Analytics = Math + @eanTweet Context

Slide 70

Slide 70 text

www.netuitive.com 70 @eanTweet Cracks are how the light gets in.

Slide 71

Slide 71 text

www.netuitive.com 71 Contact Info Elizabeth(Betsy) Nichols, Ph.D. Chief Data Scientist [email protected] @eanTweet www.netuitive.com @netuitive (703) 464-1500 @eanTweet