Automated Canary Analysis with New Relic

Automated Canary Analysis (New Relic-ﬂavored introduction) @smithclay 5/14/2018

It looks like you're doing a canary deploy.

It looks like you're doing a canary deploy. I want
to make sure this really works in production. DO

Load Balancer Canary Existing in Production Majority of traffic Some
traffic If the canary traffic "looks good"—it's safe! CANARY DEPLOY Metrics

Production uncertainty circa 2018 What's really going to happen when
I deploy a change to service C? Did it work? A D E B C F ... Z Many services, many relationships: major incidents are almost always a surprise ("didn't know it worked that way!")

Production uncertainty circa 2018 What's really going to happen when
I make a change to {service}? Did it work? ... 6 hours later I looked at 5TB of logs and created a pivot table in Microsoft Excel—maybe? DO

Is there a better way? I work at [redacted large
SV- based company]. Let me send you this paper!* *https://queue.acm.org/detail.cfm?id=3194655 DO

Is there A BETTER aN EASIER way? Application/service instrumentation and
metrics can help (a lot). DO

Things *not* discussed -IN this talK- • Jenkins (how to
build using a pipeline) • Rollout tools (how do I script a canary deploy, etc) • Orchestration layer (i.e. Kubernetes, Marathon, etc) • Non-code changes (conﬁguration, ﬁrmware, recipes, etc) AGENDA

Things discussed -IN this talK- • Assumption: You have something
(image, container, etc) that you want to put in production. • Introduction: automated canary analysis • How to: instrumentation of services and applications for canary deploys • Code: Public-cloud proof of concept • Resources: Miscellaneous other tools or approaches AGENDA

(KINDA) CONTROL THEORY* FEEDBACK LOOP APPLIED TO DEPLOYS deployer running
application monitoring baseline + - *apologies to anyone with an academic background in control theory output This is where people talk a lot about observability: how well can you measure the work the system is doing from output?

"OBSERVABILITY" credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)

MONITORING OR OBSERVABILITY? credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38) White
Box Black Box Instrumentation of code Is it working? (i.e. ping or nagios check—traditional "monitoring") (i.e. application performance monitoring using agents—increases observability)

AUTOMATED CANARIES: We have the technology. baseline + - output
... instrumentation deploy solutions containers + immutable infra // TO DO: insert automation here to detect good/bad deploys above

What is 'bad'? And how do you measure that? It
depends on several things. DO

• Should be stable over time (service level objective perhaps?)
• If customer-facing: does it clearly indicate something's broken for your users when it goes sideways? • Scoped to the individual app/service with version as a dimension. CANARY METRIC(S)

APP INSTRUMENTATION FOR CANARY DEPLOYS (part 1) Build Artifact (container,
image, binary, etc) Version (this is done almost universally)

HOW DO YOU GET THE VERSION IN CODE? (This stuff
is ideally exposed through platform tooling: exists in Mesos, ECS, K8S, Fargate, and others) https://kubernetes.io/docs/tasks/inject-data- application/environment-variable-expose-pod- information/#the-downward-api 'The Kubernetes Downward API' Orchestrator metadata Injected into Container/ App as Env Var

APP INSTRUMENTATION FOR CANARY DEPLOYS newrelic.addCustomParameter ('VERSION', deployedV); Version Annotate
transaction traces with current running version In service/app code:

CLOSING THE LOOP Is the canary better or worse? (NRQL
style) SELECT percentile(duration, 90) FROM Transaction WHERE appName='RPM API Production' SINCE 7 minutes ago WHERE appGitRevision='a2e441...' Event Name Measurement Selector: app name Selector: time window Selector: version

SOLUTION RECAP Version artifacts. Add custom parameter (version) to monitoring.
Query metrics for particular version. Automate?

case studies Not automated, quasi-automated, fully-automated.

CASE STUDY 1: mobile Phased Releases (iOS) or Google Play
Staged Rollouts (Android) More users (%) get new version Is it working? Are there crashes? New version of app Automation score:

CASE STUDY 2: NO SERVERS (AWS) Deploy Start Deploy End
Before Allow Traffic (first check) After Allow Traffic (final check) Event-driven deploys: hook into them as needed Run some code in response to event Run some code in response to event

NEW FUNCTION DEPLOY Deploy Start Deploy End Before Allow Traffic
(first check) After Allow Traffic (final check) (Some) customers interact with new version • Is it slower for people? • Are there no errors? Metrics • Error count • Perceived Performance Facets • Function Version • Browser, Geo, Device Canary10Percent10Minutes ... Linear10PercentEvery10Minutes .. AllAtOnce ... Supported Types: Automation score:

WHAT MEANS 'GOOD'? SELECT percentile(duration, 90) from PageView SINCE 10
minutes ago WHERE functionVersion='14' SELECT count(*) from JavaScriptError SINCE 10 minutes ago WHERE releaseIds='{"functionVersion":"14"}' Is the page loading slower than expected?* (New Relic NRQL) How many errors are there?* (New Relic NRQL) * These also make great alert conditions

CANARY CHECK: static baseline const response = JSON.parse(nrqlResponse); const errorCount
= response.results[0].count; if (errorCount > 0) { status = 'Failed'; } https://github.com/smithclay/aws-lambda-nr-well-instrumented/blob/master/post- traﬃc-hook/index.js Rejecting any deploys with frontend- errors in AWS CodeDeploy

CASE STUDY 3: Kayenta https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69 Load Balancer Canary Existing Majority
of traffic Some traffic Metrics Baseline

WHAT's THIS? Load Balancer Canary Existing Majority of traﬃc Some
traﬃc Metrics Baseline "Creating a brand new baseline cluster ensures that the metrics produced are free of any effects caused by long-running processes."

METRIC COMPARISON SCOPE: CLUSTER TYPE AND TIME WINDOW FROM Metrics
Metrics Metrics Data Source(s) "The Judge"

YOU MUST BE THIS TALL TO DO AUTOMATED CANARY ANALYSIS
Real-time metrics from multiple data sources Dimensionality in metrics: especially version Solid deployment tooling + infrastructure

Next steps Create dashboards with application metrics with version dimensions
(error rate, response time, throughput)... you'll never believe what happens next!

Thank you! @smithclay 5/14/2018

Automated Canary Analysis with New Relic

Automated Canary Analysis with New Relic

Clay Smith

More Decks by Clay Smith

Other Decks in Programming

Featured

Transcript

Automated Canary Analysis (New Relic-ﬂavored introduction) @smithclay 5/14/2018

It looks like you're doing a canary deploy.

It looks like you're doing a canary deploy. I want

Load Balancer Canary Existing in Production Majority of traﬃc Some

Production uncertainty circa 2018 What's really going to happen when

Production uncertainty circa 2018 What's really going to happen when

Is there a better way? I work at [redacted large

Is there A BETTER aN EASIER way? Application/service instrumentation and

Things not discussed -IN this talK- • Jenkins (how to

Things discussed -IN this talK- • Assumption: You have something

(KINDA) CONTROL THEORY* FEEDBACK LOOP APPLIED TO DEPLOYS deployer running

"OBSERVABILITY" credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)

MONITORING OR OBSERVABILITY? credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38) White

AUTOMATED CANARIES: We have the technology. baseline + - output

What is 'bad'? And how do you measure that? It

• Should be stable over time (service level objective perhaps?)

APP INSTRUMENTATION FOR CANARY DEPLOYS (part 1) Build Artifact (container,

HOW DO YOU GET THE VERSION IN CODE? (This stuff

APP INSTRUMENTATION FOR CANARY DEPLOYS newrelic.addCustomParameter ('VERSION', deployedV); Version Annotate

CLOSING THE LOOP Is the canary better or worse? (NRQL

SOLUTION RECAP Version artifacts. Add custom parameter (version) to monitoring.

case studies Not automated, quasi-automated, fully-automated.

CASE STUDY 1: mobile Phased Releases (iOS) or Google Play

CASE STUDY 2: NO SERVERS (AWS) Deploy Start Deploy End

NEW FUNCTION DEPLOY Deploy Start Deploy End Before Allow Trafﬁc

WHAT MEANS 'GOOD'? SELECT percentile(duration, 90) from PageView SINCE 10

CANARY CHECK: static baseline const response = JSON.parse(nrqlResponse); const errorCount

CASE STUDY 3: Kayenta https://medium.com/netﬂix-techblog/automated-canary-analysis-at-netﬂix-with-kayenta-3260bc7acc69 Load Balancer Canary Existing Majority

WHAT's THIS? Load Balancer Canary Existing Majority of traﬃc Some

METRIC COMPARISON SCOPE: CLUSTER TYPE AND TIME WINDOW FROM Metrics

YOU MUST BE THIS TALL TO DO AUTOMATED CANARY ANALYSIS

Next steps Create dashboards with application metrics with version dimensions

Thank you! @smithclay 5/14/2018