Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Effectively Adding Analytics to DevOps Monitoring

Effectively Adding Analytics to DevOps Monitoring

This presentation is about leveraging analytics as part of a larger framework of collaborators to continuously improve health and performance monitoring.

Elizabeth Nichols

June 30, 2016
Tweet

More Decks by Elizabeth Nichols

Other Decks in Technology

Transcript

  1. www.netuitive.com 2 Abstract •  This presentation is about leveraging analytics

    as part of a larger framework of collaborators to continuously improve health and performance monitoring. •  This talk starts with a survey of analytics that have applicability in environments that range from very small to huge. Techniques discussed include deterministic and statistical analytics, machine learning, uni-variate models, and multi-variate models. For each technique, I provide examples from (anonymized) cases that illustrate where each can succeed, each can fail, and why. •  The second part of the talk describes a framework in which analytics play a role as one of many collaborators. The framework provides key services such as integration of collaborators, orchestration of tasks, feedback/control loops, scenario replay, and sensitivity analysis, packaging, and incremental improvement. •  The final part of the talk describes a use case. It shows how a framework can drive continuing improvement as business conditions evolve and the collaborators mature. •  This presentation is about tools and techniques that have been effective in environments ranging from tiny to huge. It is not an academic discourse on math, statistics or probability. 40 min @eanTweet
  2. www.netuitive.com 3 The stories you are about to hear are

    true. The names have been changed to protect the innocent. @eanTweet
  3. www.netuitive.com 17 Math is the Queen of the Sciences ….

    Impressive, but not enough. @eanTweet
  4. www.netuitive.com 23 Data •  Time series •  Metadata •  Logical

    models •  Text •  JSON •  Events @eanTweet
  5. www.netuitive.com 30 @eanTweet EC2 EC2 EC2 Service Model Arrival Rate

    •  Queue Length •  Service Time Completion Rate •  # EC2’s •  Mean CPU Utilization Queueing Auto-Scaling •  # EC2’s •  Mean CPU Utilization Latency Quant Models
  6. www.netuitive.com 33 Context Increases with Grouping Raw Observations à Elements

    à Services à Clusters à Applications à Business à Metrics
  7. www.netuitive.com 36 Example: Policy for Action For all EC2’s in

    US West Region with tag=“PROD” in ASGx … If AvgCpuUtil(EC2) deviating && RunQ > 2*(# CPUs) && ReqRate(ELBx ) !deviating && AvgLatency(ELBx ) > 2 sec For duration >=10 minutes Send critical event Invoke autoscaleUp on ASGx @eanTweet
  8. www.netuitive.com 39 1.  Human in the Loop: Alarm Tuning via

    Scenario Replay 2.  Automatic ASG Scaling Feedback Control via Service Models
  9. www.netuitive.com 42 Time Time Transactions/s Detected Events Example: Scenario Replay

    Context: Non-holiday day-over-day pattern should not vary by much. @eanTweet
  10. www.netuitive.com 43 Mean Difference Analytic time Transactions/s now Compare period

    width Gap between periods to compare = µ2 -µ1 mean difference µ2 µ1 @eanTweet
  11. www.netuitive.com 44 Configure the Analytic Level of Confidence that the

    compared intervals’ two means are different = 99% @eanTweet Gap = 1 day Period width = 2 hours
  12. www.netuitive.com 47 Adjust Model Setting Level of Confidence that the

    compared intervals’ means are different: 99% à 90% @eanTweet
  13. www.netuitive.com 50 Adjust Model Setting Level of Confidence that the

    compared intervals’ means are different: 90% à 95% @eanTweet
  14. www.netuitive.com 53 Analytics Ecosystem Context Data Analytics @eanTweet Context Policy

    for Action ML # FPn # FNn Results Confign+1 + Mean Difference Normal Day SNR goal = #TP # FP
  15. www.netuitive.com 57 @eanTweet EC2 EC2 EC2 ASG Service Model Arrival

    Rate •  Queue Length •  Response Time Completion Rate •  # EC2’s •  Mean CPU Utilization Queueing Auto-Scaling •  # EC2’s •  Mean CPU Utilization Latency Quant Models
  16. www.netuitive.com 58 Make It Work Max Workload Time in Hours

    à # Nodes = 11 Response Time < 3 sec ASG Nodes Workload Max Requests/sec Requests/sec @eanTweet
  17. M/M/c Queuing Model •  λ = Arrival rate with Poisson

    distribution •  µ = Average service time with exponential distribution •  c = # servers •  Servers serve from the front of the queue (FCFS) •  If there are less than c jobs, some servers will be idle •  If there are greater than c jobs, some will queue in a buffer •  The buffer is of infinite size @eanTweet
  18. Equations Response time = The probability that an arriving job

    is forced to queue is given by Erlange’s C formula: intensity = Ref: https://en.wikipedia.org/wiki/M/M/c_queue @eanTweet
  19. M/M/c Model Results 2 4 6 8 10 µ =

    1.0s Response Time @eanTweet
  20. M/M/c Model Application INPUT: • Service Time (µ) • Arrival rate of

    requests (λ) • Response Time goal (3s) OUTPUT: • Response Time • Mapping: arrival rate à # nodes with constraint: response time < 3s @eanTweet
  21. www.netuitive.com 65 Risk Mitigation Factors Max scale-in rate: Max scale-out

    rate: Min instance counts: Zero thresholds: Target response time: Workload metric: Update frequency: 3 sec aws.elb.elb-c.arrivalrate hourly 1% 2 3 unlimited @eanTweet
  22. www.netuitive.com 66 Analytics Ecosystem Context Data Analytics(M/M/c) @eanTweet Context Auto

    Scale Policy for Action λn µn Tn cn+1 Tgoal =3s K=crec cn = crec Risk Results cn