Effectively Adding Analytics to DevOps Monitoring

Effectively Adding Analytics to Devops Monitoring Betsy Nichols, Ph.D. Chief
Data Scientist June 2016 @eanTweet

www.netuitive.com 2 Abstract •  This presentation is about leveraging analytics
as part of a larger framework of collaborators to continuously improve health and performance monitoring. •  This talk starts with a survey of analytics that have applicability in environments that range from very small to huge. Techniques discussed include deterministic and statistical analytics, machine learning, uni-variate models, and multi-variate models. For each technique, I provide examples from (anonymized) cases that illustrate where each can succeed, each can fail, and why. •  The second part of the talk describes a framework in which analytics play a role as one of many collaborators. The framework provides key services such as integration of collaborators, orchestration of tasks, feedback/control loops, scenario replay, and sensitivity analysis, packaging, and incremental improvement. •  The final part of the talk describes a use case. It shows how a framework can drive continuing improvement as business conditions evolve and the collaborators mature. •  This presentation is about tools and techniques that have been effective in environments ranging from tiny to huge. It is not an academic discourse on math, statistics or probability. 40 min @eanTweet

www.netuitive.com 3 The stories you are about to hear are
true. The names have been changed to protect the innocent. @eanTweet

www.netuitive.com 4 Hype Cycle Market Visibility Time Analytics for DevOps
Monitoring Hyperbole Index 2016 2 1 3 4 5

www.netuitive.com 5 You CAN Live Better with Math The Queen
of the Sciences Hint

C ntext Analytics Fail without @eanTweet

What is Failure @eanTweet

www.netuitive.com 8 Alarm No Alarm Problem No Problem TP TN
FP Good, Bad, @eanTweet

www.netuitive.com 9 = Wanted _ Alarms Unwanted _ Alarms SNR
= TP FP Success (model) @eanTweet

What is Co ntext @eanTweet

www.netuitive.com 11 Without Context @eanTweet

www.netuitive.com 12 With Context @eanTweet

www.netuitive.com 13 Adaptive ML Analytic (Without Context) Raw Metric Values
Bands of Normalcy Anomalies Time

www.netuitive.com 14 DevOps Workflow Context Data Analytic @eanTweet

www.netuitive.com 15 With Context (Case 1) t2.small (1 vcpu) m4.large
(2 vcpu) EC2 CPU Utilization @eanTweet FP

www.netuitive.com 16 With Context (Case 2) Good Response Time @eanTweet
FN SLA not met

www.netuitive.com 17 Math is the Queen of the Sciences ….
Impressive, but not enough. @eanTweet

How to Add Co ntext @eanTweet

www.netuitive.com 19 Analytics Ecosystem (!Context) Context Data Analytics @eanTweet

www.netuitive.com 20 Analytics Ecosystem (Context) Context Results Data Analytics @eanTweet
Context Policy for Action Feed back

EcoSystem Components @eanTweet

www.netuitive.com 22 Analytics Ecosystem Context Results Data Analytics @eanTweet Context
Feed back Policy for Action

www.netuitive.com 23 Data •  Time series •  Metadata •  Logical
models •  Text •  JSON •  Events @eanTweet

www.netuitive.com 24 Goal: Coverage @eanTweet

www.netuitive.com 26 Goal: Enriched Information Results clean profile connect aggregate
classify detect anomalies @eanTweet

www.netuitive.com 27 Wide Repertoire http://www.mln.io/resources/periodic-table/ @eanTweet

www.netuitive.com 29 Totum major summa partum @eanTweet

www.netuitive.com 30 @eanTweet EC2 EC2 EC2 Service Model Arrival Rate
•  Queue Length •  Service Time Completion Rate •  # EC2’s •  Mean CPU Utilization Queueing Auto-Scaling •  # EC2’s •  Mean CPU Utilization Latency Quant Models

www.netuitive.com 31 EC2 EC2 CPU_Utilization Memory RunQ #Processes NetIO Swap
CTX @eanTweet

www.netuitive.com 32 Grouping via Tags Simulate relationships Facilitate aggregation Insanely
flexible @eanTweet

www.netuitive.com 33 Context Increases with Grouping Raw Observations à Elements
à Services à Clusters à Applications à Business à Metrics

www.netuitive.com 34 Analytics Ecosystem Contex Results Data Analytics @eanTweet Context
Policy for Action Feed back

www.netuitive.com 35 Policy Components Scope Conditions Action(s) @eanTweet

www.netuitive.com 36 Example: Policy for Action For all EC2’s in
US West Region with tag=“PROD” in ASGx … If AvgCpuUtil(EC2) deviating && RunQ > 2*(# CPUs) && ReqRate(ELBx ) !deviating && AvgLatency(ELBx ) > 2 sec For duration >=10 minutes Send critical event Invoke autoscaleUp on ASGx @eanTweet

www.netuitive.com 37 Analytics Ecosystem Results Data Analytics @eanTweet Context Policy
for Action Feed back

Use Cases @eanTweet

www.netuitive.com 39 1.  Human in the Loop: Alarm Tuning via
Scenario Replay 2.  Automatic ASG Scaling Feedback Control via Service Models

Use Case Scenario Replay @eanTweet 1

Alarm Quality

www.netuitive.com 42 Time Time Transactions/s Detected Events Example: Scenario Replay
Context: Non-holiday day-over-day pattern should not vary by much. @eanTweet

www.netuitive.com 43 Mean Difference Analytic time Transactions/s now Compare period
width Gap between periods to compare = µ2 -µ1 mean difference µ2 µ1 @eanTweet

www.netuitive.com 44 Configure the Analytic Level of Confidence that the
compared intervals’ two means are different = 99% @eanTweet Gap = 1 day Period width = 2 hours

Turn Back the Hands of Time @eanTweet

www.netuitive.com 46 Experiment #1 1 Context = Caught but too
late @eanTweet

www.netuitive.com 47 Adjust Model Setting Level of Confidence that the
compared intervals’ means are different: 99% à 90% @eanTweet

www.netuitive.com 49 Experiment #2 Overlaid Time Time Transactions/s 1 2
Context: Earlier, but too many FP @eanTweet

www.netuitive.com 50 Adjust Model Setting Level of Confidence that the
compared intervals’ means are different: 90% à 95% @eanTweet

www.netuitive.com 52 Time All Transactions Experiment #3 Overlaid Time Time
Transactions/s 1 2 3 @eanTweet

www.netuitive.com 53 Analytics Ecosystem Context Data Analytics @eanTweet Context Policy
for Action ML # FPn # FNn Results Confign+1 + Mean Difference Normal Day SNR goal = #TP # FP

Use Case Feedback-Control @eanTweet 2

Cost Optimization

www.netuitive.com 56 @eanTweet

www.netuitive.com 57 @eanTweet EC2 EC2 EC2 ASG Service Model Arrival
Rate •  Queue Length •  Response Time Completion Rate •  # EC2’s •  Mean CPU Utilization Queueing Auto-Scaling •  # EC2’s •  Mean CPU Utilization Latency Quant Models

www.netuitive.com 58 Make It Work Max Workload Time in Hours
à # Nodes = 11 Response Time < 3 sec ASG Nodes Workload Max Requests/sec Requests/sec @eanTweet

www.netuitive.com 59 Regular Shapes Actual Requests/sec # Nodes = 11
Response Time < 3 sec @eanTweet

M/M/c Queuing Model •  λ = Arrival rate with Poisson
distribution •  µ = Average service time with exponential distribution •  c = # servers •  Servers serve from the front of the queue (FCFS) •  If there are less than c jobs, some servers will be idle •  If there are greater than c jobs, some will queue in a buffer •  The buffer is of infinite size @eanTweet

Equations Response time = The probability that an arriving job
is forced to queue is given by Erlange’s C formula: intensity = Ref: https://en.wikipedia.org/wiki/M/M/c_queue @eanTweet

M/M/c Model Results 2 4 6 8 10 µ =
1.0s Response Time @eanTweet

M/M/c Model Application INPUT: • Service Time (µ) • Arrival rate of
requests (λ) • Response Time goal (3s) OUTPUT: • Response Time • Mapping: arrival rate à # nodes with constraint: response time < 3s @eanTweet

www.netuitive.com 64 Scaling Model Reserve OnDemand Spot Response Time <
3 sec Arrival Rate @eanTweet

www.netuitive.com 65 Risk Mitigation Factors Max scale-in rate: Max scale-out
rate: Min instance counts: Zero thresholds: Target response time: Workload metric: Update frequency: 3 sec aws.elb.elb-c.arrivalrate hourly 1% 2 3 unlimited @eanTweet

www.netuitive.com 66 Analytics Ecosystem Context Data Analytics(M/M/c) @eanTweet Context Auto
Scale Policy for Action λn µn Tn cn+1 Tgoal =3s K=crec cn = crec Risk Results cn

www.netuitive.com 67 Take Aways @eanTweet

Policy for Action Feed back

Best Analytics = Math + @eanTweet Context

www.netuitive.com 70 @eanTweet Cracks are how the light gets in.

www.netuitive.com 71 Contact Info Elizabeth(Betsy) Nichols, Ph.D. Chief Data Scientist
[email protected] @eanTweet www.netuitive.com @netuitive (703) 464-1500 @eanTweet

Effectively Adding Analytics to DevOps Monitoring

Effectively Adding Analytics to DevOps Monitoring

More Decks by Elizabeth Nichols

Other Decks in Technology

Featured

Transcript