Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Canary Evaluation (for fun and profit)

Canary Evaluation (for fun and profit)

You're finally deploying canaries, now what?

The talk covers:
- What to compare and how?
- Basics of distribution analysis.
- Tips to reduce noise.
- Best practices of Canary Evaluation.

Artem Yakimenko

February 27, 2020
Tweet

More Decks by Artem Yakimenko

Other Decks in Programming

Transcript

  1. Canarying is a partial deployment of a change in a

    small fraction of a service and its evaluation. The part of the service that receives the change is “the canary,” and the remainder of the service is “the control.” Canarying is effectively an A/B testing process.
  2. General process overview Testing Canary 1% Canary 0.1% Test Frontend

    Public Frontend Prod Artefact Eval Canary ...% Eval Eval Eval
  3. Some important things about deployment... • Canary promotion stages should

    respond to roughly exponential increases in traffic • Canaries should start from the least practical percentage of traffic (i.e. percentage where you still get useable signal). ◦ Advanced: if you have risk dimensions other than traffic - you should adjust the stages to take those into account as well. • You should avoid location/userbase discriminations ◦ This skews your data and has the potential to degrade user experience for a particular demographic.
  4. Evaluation requires some elbow grease • Canary deployment is more-or-less

    a widely solved problem: ◦ Modern container engines: ▪ K8s ▪ Mesos ▪ Nomad ◦ Cloud/VM orchestrators: ▪ AWS/GCP Instance Groups ▪ OpenStack
  5. Evaluation requires some elbow grease • Canary evaluation is not

    widely solved, however: ◦ Most books/articles on the topic heavily handwave around the evaluation part: ▪ "Make sure it looks right, then..." ▪ "If it's functioning correctly, proceed with..." ◦ Solutions that do exist are either: ▪ Proof-of-Concept/Simplistic (Pulumi canary strategies) ▪ Support only simple static metric thresholds (Flagger, Canarini) ▪ Hard to integrate outside of their specific environment (Kayenta)
  6. A note on data quality... • Before working on canarying

    you need to have good metrics. Garbage in -> garbage out. • Metrics need to be carefully selected. Focus on customer, preventing past incidents and your dependent services. ◦ Application metrics (e.g. not "CPU is X%, Memory is Y% but "Has user been served a request?") ◦ Make sure you're using "canonical" error codes, i.e. known good vs. known valid/bad ◦ Avoid queries/metrics with grouping
  7. Static target - Pros • Easier to implement (simple query

    on the metric, no need to compare 2 streams) • Some reasonable solutions already exist: Flagger, Kanarini • Doesn't require much statistical knowledge
  8. Static target - Cons • Higher false positive / false

    negative rate • Maintenance of static comparisons is toilsome • Doesn't catch the "common-law" SLO violations • It's easy to set arbitrary thresholds which alert on wrong things
  9. What is smart target canarying? • Comparative evaluator, used to

    evaluate the canary metrics against the control population • Leveraging statistical analysis to: ◦ Reliably compare canary and control metrics ◦ De-noise metrics and rule out false positives ◦ Exclude random events
  10. Smart target - Pros • Much better false positive /

    false negative rate • Much easier to maintain once the math is in (matter of tuning coefficients, e.g. significance level)
  11. Smart target - Cons • Easily confused and/or requires significant

    work in some specific cases (notably bimodal distributions)
  12. Smart target - Cons • Easily confused and/or requires significant

    work in some specific cases (notably bimodal distributions) • The only existing OSS solution - Kayenta is very tightly integrated with its' parent project (Spinnaker)
  13. Smart target - Cons • Easily confused and/or requires significant

    work in some specific cases (notably bimodal distributions) • The only existing OSS solution - Kayenta is very tightly integrated with its' parent project (Spinnaker) • Statistics
  14. Outlier detection using interquartile range image - crop of Boxplot_vs_PDF.svg

    by Jhguch at en.wikipedia licensed by CC BY-SA 2.5
  15. 28.2624,28.2624,28.2624,28.2624,28.2624,28.380420338983033,29.831245283018866,29.831 245283018866,29.831245283018866,29.831245283018866,29.831245283018866,29.94806153846 1562,30.11764705882353,30.11764705882353,30.11764705882353,30.11764705882353,30.1176 4705882353,30.048255999999977,25.974634146341465,25.974634146341465 ... 95th pecentile request latency ┌

    ┐ [ 10.0, 20.0) ┤▇▇▇▇▇ 9259 [ 20.0, 30.0) ┤▇▇▇▇▇▇▇▇ 15730 [ 30.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 63767 [ 40.0, 50.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 64604 [ 50.0, 60.0) ┤▇▇▇▇ 7967 [ 60.0, 70.0) ┤ 289 [ 70.0, 80.0) ┤ 5 [ 80.0, 90.0) ┤ 0 [ 90.0, 100.0) ┤ 5 [100.0, 110.0) ┤ 1 [110.0, 120.0) ┤ 0 [120.0, 130.0) ┤ 0 [130.0, 140.0) ┤ 0 [140.0, 150.0) ┤ 0 [150.0, 160.0) ┤ 0 [160.0, 170.0) ┤ 0 [170.0, 180.0) ┤ 0 [180.0, 190.0) ┤ 6 [190.0, 200.0) ┤ 0 [200.0, 210.0) ┤ 1 [210.0, 220.0) ┤ 0 [220.0, 230.0) ┤ 6 └ ┘ Frequency
  16. Median is 39.1 Q1 = 25th percentile = 33.5 Q3

    = 75th percentile = 43.8 IQR = Q3 - Q1 = 10.30 Lower outlier boundary = Q1 - 1.5*IQR = 18.1 Higher outlier boundary = Q3 + 1.5*IQR = 59.24
  17. 95th pecentile request latency - filtered ┌ ┐ [18.0, 20.0)

    ┤▇▇ 1261 [20.0, 22.0) ┤▇▇▇ 1562 [22.0, 24.0) ┤▇▇▇▇ 1988 [24.0, 26.0) ┤▇▇▇▇▇ 2739 [26.0, 28.0) ┤▇▇▇▇▇▇▇ 3861 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇▇▇▇ 5580 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7606 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10491 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12741 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15525 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 17404 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 17394 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16584 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14206 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9916 [48.0, 50.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇ 6504 [50.0, 52.0) ┤▇▇▇▇▇▇▇▇ 4106 [52.0, 54.0) ┤▇▇▇▇ 2071 [54.0, 56.0) ┤▇▇ 1022 [56.0, 58.0) ┤▇ 568 [58.0, 60.0) ┤ 158 └ ┘ Frequency
  18. 95th pecentile request latency - task set 1 ┌ ┐

    [22.0, 24.0) ┤▇ 5 [24.0, 26.0) ┤▇▇▇▇ 15 [26.0, 28.0) ┤▇▇▇▇▇▇▇▇ 27 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇ 29 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 55 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 89 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 120 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 122 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 13 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 3 └ ┘
  19. 95th pecentile request latency - task set 2 ┌ ┐

    [22.0, 24.0) ┤▇▇ 7 [24.0, 26.0) ┤▇▇▇▇▇ 18 [26.0, 28.0) ┤▇▇▇▇▇▇▇ 26 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇▇▇ 35 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 57 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 66 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 116 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 128 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 14 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 4 └ ┘
  20. Student's TTest (difference of means): Alternatives: • Mann-Whitney Utest Note:

    You don't have to implement your own math, libraries are readily available in SciPy, GSL, etc.
  21. 95th pecentile request latency - task set 1 ┌ ┐

    [22.0, 24.0) ┤▇ 5 [24.0, 26.0) ┤▇▇▇▇ 15 [26.0, 28.0) ┤▇▇▇▇▇▇▇▇ 27 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇ 29 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 55 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 89 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 120 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 122 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 13 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 3 └ ┘
  22. 95th pecentile request latency - task set 2 ┌ ┐

    [22.0, 24.0) ┤▇▇ 7 [24.0, 26.0) ┤▇▇▇▇▇ 18 [26.0, 28.0) ┤▇▇▇▇▇▇▇ 26 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇▇▇ 35 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 57 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 66 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 116 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 128 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 14 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 4 └ ┘
  23. StatisticalTest::TTest.perform( alpha = 0.05, tail = :two_tail, comparison_mean = task_subset_1.mean,

    group_to_compare = task_subset_2 ) #=> {:t_score=>0.27767921898389974, :probability=>0.6093385568522004, :p_value=>0.7813228862955992, :alpha=>0.05, :null=>true, :alternative=>false, :confidence_level=>0.95}
  24. 95th pecentile request latency - task set 1 ┌ ┐

    [22.0, 24.0) ┤▇ 5 [24.0, 26.0) ┤▇▇▇▇ 15 [26.0, 28.0) ┤▇▇▇▇▇▇▇▇ 27 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇ 29 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 55 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 89 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 120 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 122 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 13 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 3 └ ┘
  25. 95th pecentile request latency - alternative task set ┌ ┐

    [36.0, 37.0) ┤▇▇▇ 11 [37.0, 38.0) ┤▇▇▇ 10 [38.0, 39.0) ┤▇▇▇▇ 15 [39.0, 40.0) ┤▇▇ 7 [40.0, 41.0) ┤▇▇▇▇▇▇ 21 [41.0, 42.0) ┤▇▇▇▇▇ 18 [42.0, 43.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇ 41 [43.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 63 [44.0, 45.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 58 [45.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 88 [46.0, 47.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 123 [47.0, 48.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 97 [48.0, 49.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 77 [49.0, 50.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 [50.0, 51.0) ┤▇▇▇▇▇▇▇▇ 27 [51.0, 52.0) ┤▇▇▇▇ 14 [52.0, 53.0) ┤▇▇▇ 10 [53.0, 54.0) ┤ 1 [54.0, 55.0) ┤▇ 2 └ ┘
  26. StatisticalTest::TTest.perform( alpha = 0.05, tail = :two_tail, comparison_mean = task_subset_1.mean,

    group_to_compare = task_set_2 ) #=> {:t_score=>-53.68570746331448, :probability=>0.0, :p_value=>0.0, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}
  27. Best Practices - Compare comparable things • Do not compare

    tasks from different geographical locations or sublocations • If you use metrics from dependencies, make sure you filter out the tasks that are actually talking to your service (using trace annotations, etc.) • Do not compare task's past to its' future (do not perform canaries on 100% deployment)
  28. Best Practices - Compare comparable things • Do not compare

    "warm" tasks to cold start • Prefer more "immediate" effects (e.g. latency, error rates), have different mechanisms for handling very slow errors (like memory inefficiency)
  29. Best Practices - Working with others • Educate your team,

    write good playbooks, extensively document checks: ◦ What is a true false positive? ◦ What do the coefficients mean ◦ Remember: garbage in -> garbage out • Canary checks maintenance should be a shared responsibility
  30. And I really mean comment things... // We're 99% certain

    that an estimate taken from our mean extends // to a similar-looking population. Generally, higher significance // level means lower amount of false positives, but a higher chance // to miss something. In latency, we need a pretty strict threshold // in order to be confident, whereas in error rates for a highly // reliable service, we want a higher pvalue threshold to catch // issues since we're certain any amount of deviation is // significant. significance_level = 0.01
  31. Best Practices - Working with others • Good UI/UX is

    important: ◦ Staging changes, dry runs ◦ Explain what failed ◦ Pessimistic mode (fallback) for new rules/algos