Canary Evaluation (for fun and profit)

Canary Evaluation (for fun and proﬁt) by Artem Yakimenko

What is Canarying?

Canarying is a partial deployment of a change in a
small fraction of a service and its evaluation. The part of the service that receives the change is “the canary,” and the remainder of the service is “the control.” Canarying is effectively an A/B testing process.

General process overview Testing Canary 1% Canary 0.1% Test Frontend
Public Frontend Prod Artefact Eval Canary ...% Eval Eval Eval

Some important things about deployment... • Canary promotion stages should
respond to roughly exponential increases in traffic • Canaries should start from the least practical percentage of traffic (i.e. percentage where you still get useable signal). ◦ Advanced: if you have risk dimensions other than traffic - you should adjust the stages to take those into account as well. • You should avoid location/userbase discriminations ◦ This skews your data and has the potential to degrade user experience for a particular demographic.

Canary evaluation

Evaluation requires some elbow grease • Canary deployment is more-or-less
a widely solved problem: ◦ Modern container engines: ▪ K8s ▪ Mesos ▪ Nomad ◦ Cloud/VM orchestrators: ▪ AWS/GCP Instance Groups ▪ OpenStack

Evaluation requires some elbow grease • Canary evaluation is not
widely solved, however: ◦ Most books/articles on the topic heavily handwave around the evaluation part: ▪ "Make sure it looks right, then..." ▪ "If it's functioning correctly, proceed with..." ◦ Solutions that do exist are either: ▪ Proof-of-Concept/Simplistic (Pulumi canary strategies) ▪ Support only simple static metric thresholds (Flagger, Canarini) ▪ Hard to integrate outside of their speciﬁc environment (Kayenta)

You'll have to spend some development cycles on the evaluator

A note on data quality... • Before working on canarying
you need to have good metrics. Garbage in -> garbage out. • Metrics need to be carefully selected. Focus on customer, preventing past incidents and your dependent services. ◦ Application metrics (e.g. not "CPU is X%, Memory is Y% but "Has user been served a request?") ◦ Make sure you're using "canonical" error codes, i.e. known good vs. known valid/bad ◦ Avoid queries/metrics with grouping

Canary evaluator types 1 - Static target

Static target

Static target - Pros • Easier to implement (simple query
on the metric, no need to compare 2 streams) • Some reasonable solutions already exist: Flagger, Kanarini • Doesn't require much statistical knowledge

Static target - Cons • Higher false positive / false
negative rate • Maintenance of static comparisons is toilsome • Doesn't catch the "common-law" SLO violations • It's easy to set arbitrary thresholds which alert on wrong things

Canary evaluator types 2 - Smart target (Comparative)

What is smart target canarying? • Comparative evaluator, used to
evaluate the canary metrics against the control population • Leveraging statistical analysis to: ◦ Reliably compare canary and control metrics ◦ De-noise metrics and rule out false positives ◦ Exclude random events

Smart target - Pros • Much better false positive /
false negative rate • Much easier to maintain once the math is in (matter of tuning coeﬃcients, e.g. signiﬁcance level)

Smart target - Cons • Easily confused and/or requires signiﬁcant
work in some speciﬁc cases (notably bimodal distributions)

work in some speciﬁc cases (notably bimodal distributions) • The only existing OSS solution - Kayenta is very tightly integrated with its' parent project (Spinnaker)

work in some speciﬁc cases (notably bimodal distributions) • The only existing OSS solution - Kayenta is very tightly integrated with its' parent project (Spinnaker) • Statistics

Statistical methods of canary evaluation

Statistical methods of... Ruling out false positives

Outlier detection using interquartile range image - crop of Boxplot_vs_PDF.svg
by Jhguch at en.wikipedia licensed by CC BY-SA 2.5

28.2624,28.2624,28.2624,28.2624,28.2624,28.380420338983033,29.831245283018866,29.831 245283018866,29.831245283018866,29.831245283018866,29.831245283018866,29.94806153846 1562,30.11764705882353,30.11764705882353,30.11764705882353,30.11764705882353,30.1176 4705882353,30.048255999999977,25.974634146341465,25.974634146341465 ... 95th pecentile request latency ┌
┐ [ 10.0, 20.0) ┤▇▇▇▇▇ 9259 [ 20.0, 30.0) ┤▇▇▇▇▇▇▇▇ 15730 [ 30.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 63767 [ 40.0, 50.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 64604 [ 50.0, 60.0) ┤▇▇▇▇ 7967 [ 60.0, 70.0) ┤ 289 [ 70.0, 80.0) ┤ 5 [ 80.0, 90.0) ┤ 0 [ 90.0, 100.0) ┤ 5 [100.0, 110.0) ┤ 1 [110.0, 120.0) ┤ 0 [120.0, 130.0) ┤ 0 [130.0, 140.0) ┤ 0 [140.0, 150.0) ┤ 0 [150.0, 160.0) ┤ 0 [160.0, 170.0) ┤ 0 [170.0, 180.0) ┤ 0 [180.0, 190.0) ┤ 6 [190.0, 200.0) ┤ 0 [200.0, 210.0) ┤ 1 [210.0, 220.0) ┤ 0 [220.0, 230.0) ┤ 6 └ ┘ Frequency

Median is 39.1 Q1 = 25th percentile = 33.5 Q3
= 75th percentile = 43.8 IQR = Q3 - Q1 = 10.30 Lower outlier boundary = Q1 - 1.5*IQR = 18.1 Higher outlier boundary = Q3 + 1.5*IQR = 59.24

95th pecentile request latency - filtered ┌ ┐ [18.0, 20.0)
┤▇▇ 1261 [20.0, 22.0) ┤▇▇▇ 1562 [22.0, 24.0) ┤▇▇▇▇ 1988 [24.0, 26.0) ┤▇▇▇▇▇ 2739 [26.0, 28.0) ┤▇▇▇▇▇▇▇ 3861 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇▇▇▇ 5580 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7606 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10491 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12741 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15525 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 17404 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 17394 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16584 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14206 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9916 [48.0, 50.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇ 6504 [50.0, 52.0) ┤▇▇▇▇▇▇▇▇ 4106 [52.0, 54.0) ┤▇▇▇▇ 2071 [54.0, 56.0) ┤▇▇ 1022 [56.0, 58.0) ┤▇ 568 [58.0, 60.0) ┤ 158 └ ┘ Frequency

Statistical methods of... Testing if canary metrics are within the
same boundaries as control

95th pecentile request latency - task set 1 ┌ ┐
[22.0, 24.0) ┤▇ 5 [24.0, 26.0) ┤▇▇▇▇ 15 [26.0, 28.0) ┤▇▇▇▇▇▇▇▇ 27 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇ 29 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 55 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 89 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 120 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 122 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 13 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 3 └ ┘

[22.0, 24.0) ┤▇▇ 7 [24.0, 26.0) ┤▇▇▇▇▇ 18 [26.0, 28.0) ┤▇▇▇▇▇▇▇ 26 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇▇▇ 35 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 57 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 66 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 116 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 128 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 14 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 4 └ ┘

Student's TTest (difference of means): Alternatives: • Mann-Whitney Utest Note:
You don't have to implement your own math, libraries are readily available in SciPy, GSL, etc.

[22.0, 24.0) ┤▇ 5 [24.0, 26.0) ┤▇▇▇▇ 15 [26.0, 28.0) ┤▇▇▇▇▇▇▇▇ 27 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇ 29 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 55 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 89 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 120 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 122 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 13 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 3 └ ┘

[22.0, 24.0) ┤▇▇ 7 [24.0, 26.0) ┤▇▇▇▇▇ 18 [26.0, 28.0) ┤▇▇▇▇▇▇▇ 26 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇▇▇ 35 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 57 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 66 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 99 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 116 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 128 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 14 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 4 └ ┘

StatisticalTest::TTest.perform( alpha = 0.05, tail = :two_tail, comparison_mean = task_subset_1.mean,
group_to_compare = task_subset_2 ) #=> {:t_score=>0.27767921898389974, :probability=>0.6093385568522004, :p_value=>0.7813228862955992, :alpha=>0.05, :null=>true, :alternative=>false, :confidence_level=>0.95}

[22.0, 24.0) ┤▇ 5 [24.0, 26.0) ┤▇▇▇▇ 15 [26.0, 28.0) ┤▇▇▇▇▇▇▇▇ 27 [28.0, 30.0) ┤▇▇▇▇▇▇▇▇ 29 [30.0, 32.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 55 [32.0, 34.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 89 [34.0, 36.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 93 [36.0, 38.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 120 [38.0, 40.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 122 [40.0, 42.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 117 [42.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85 [44.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 76 [46.0, 48.0) ┤▇▇▇▇▇▇▇▇▇ 32 [48.0, 50.0) ┤▇▇▇▇ 13 [50.0, 52.0) ┤▇ 5 [52.0, 54.0) ┤▇ 3 └ ┘

95th pecentile request latency - alternative task set ┌ ┐
[36.0, 37.0) ┤▇▇▇ 11 [37.0, 38.0) ┤▇▇▇ 10 [38.0, 39.0) ┤▇▇▇▇ 15 [39.0, 40.0) ┤▇▇ 7 [40.0, 41.0) ┤▇▇▇▇▇▇ 21 [41.0, 42.0) ┤▇▇▇▇▇ 18 [42.0, 43.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇ 41 [43.0, 44.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 63 [44.0, 45.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 58 [45.0, 46.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 88 [46.0, 47.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 123 [47.0, 48.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 97 [48.0, 49.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 77 [49.0, 50.0) ┤▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 48 [50.0, 51.0) ┤▇▇▇▇▇▇▇▇ 27 [51.0, 52.0) ┤▇▇▇▇ 14 [52.0, 53.0) ┤▇▇▇ 10 [53.0, 54.0) ┤ 1 [54.0, 55.0) ┤▇ 2 └ ┘

StatisticalTest::TTest.perform( alpha = 0.05, tail = :two_tail, comparison_mean = task_subset_1.mean,
group_to_compare = task_set_2 ) #=> {:t_score=>-53.68570746331448, :probability=>0.0, :p_value=>0.0, :alpha=>0.05, :null=>false, :alternative=>true, :confidence_level=>0.95}

Statistical methods of... Testing if an event can be considered
random (e.g. Task Crashing)

Poisson process adherence

Best practices

Best Practices - Compare comparable things • Do not compare
tasks from different geographical locations or sublocations • If you use metrics from dependencies, make sure you ﬁlter out the tasks that are actually talking to your service (using trace annotations, etc.) • Do not compare task's past to its' future (do not perform canaries on 100% deployment)

Best Practices - Compare comparable things • Do not compare
"warm" tasks to cold start • Prefer more "immediate" effects (e.g. latency, error rates), have different mechanisms for handling very slow errors (like memory ineﬃciency)

Best Practices - Working with others • Educate your team,
write good playbooks, extensively document checks: ◦ What is a true false positive? ◦ What do the coeﬃcients mean ◦ Remember: garbage in -> garbage out • Canary checks maintenance should be a shared responsibility

And I really mean comment things... // We're 99% certain
that an estimate taken from our mean extends // to a similar-looking population. Generally, higher significance // level means lower amount of false positives, but a higher chance // to miss something. In latency, we need a pretty strict threshold // in order to be confident, whereas in error rates for a highly // reliable service, we want a higher pvalue threshold to catch // issues since we're certain any amount of deviation is // significant. significance_level = 0.01

Best Practices - Working with others • Good UI/UX is
important: ◦ Staging changes, dry runs ◦ Explain what failed ◦ Pessimistic mode (fallback) for new rules/algos

Thank you! Twitter: temikus@ itguy@{temikus.net, google.com}

Canary Evaluation (for fun and profit)

Canary Evaluation (for fun and profit)

More Decks by Artem Yakimenko

Other Decks in Programming

Featured

Transcript