Automated Canary Analysis with New Relic

Slide 1

Slide 1 text

Automated Canary Analysis (New Relic-ﬂavored introduction) @smithclay 5/14/2018

Slide 2

Slide 2 text

It looks like you're doing a canary deploy.

Slide 3

Slide 3 text

It looks like you're doing a canary deploy. I want to make sure this really works in production. DO

Slide 4

Slide 4 text

Load Balancer Canary Existing in Production Majority of traffic Some traffic If the canary traffic "looks good"—it's safe! CANARY DEPLOY Metrics

Slide 5

Slide 5 text

Production uncertainty circa 2018 What's really going to happen when I deploy a change to service C? Did it work? A D E B C F ... Z Many services, many relationships: major incidents are almost always a surprise ("didn't know it worked that way!")

Slide 6

Slide 6 text

Production uncertainty circa 2018 What's really going to happen when I make a change to {service}? Did it work? ... 6 hours later I looked at 5TB of logs and created a pivot table in Microsoft Excel—maybe? DO

Slide 7

Slide 7 text

Is there a better way? I work at [redacted large SV- based company]. Let me send you this paper!* *https://queue.acm.org/detail.cfm?id=3194655 DO

Slide 8

Slide 8 text

Is there A BETTER aN EASIER way? Application/service instrumentation and metrics can help (a lot). DO

Slide 9

Slide 9 text

Things *not* discussed -IN this talK- • Jenkins (how to build using a pipeline) • Rollout tools (how do I script a canary deploy, etc) • Orchestration layer (i.e. Kubernetes, Marathon, etc) • Non-code changes (conﬁguration, ﬁrmware, recipes, etc) AGENDA

Slide 10

Slide 10 text

Things discussed -IN this talK- • Assumption: You have something (image, container, etc) that you want to put in production. • Introduction: automated canary analysis • How to: instrumentation of services and applications for canary deploys • Code: Public-cloud proof of concept • Resources: Miscellaneous other tools or approaches AGENDA

Slide 11

Slide 11 text

(KINDA) CONTROL THEORY* FEEDBACK LOOP APPLIED TO DEPLOYS deployer running application monitoring baseline + - *apologies to anyone with an academic background in control theory output This is where people talk a lot about observability: how well can you measure the work the system is doing from output?

Slide 12

Slide 12 text

"OBSERVABILITY" credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)

Slide 13

Slide 13 text

MONITORING OR OBSERVABILITY? credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38) White Box Black Box Instrumentation of code Is it working? (i.e. ping or nagios check—traditional "monitoring") (i.e. application performance monitoring using agents—increases observability)

Slide 14

Slide 14 text

AUTOMATED CANARIES: We have the technology. baseline + - output ... instrumentation deploy solutions containers + immutable infra // TO DO: insert automation here to detect good/bad deploys above

Slide 15

Slide 15 text

What is 'bad'? And how do you measure that? It depends on several things. DO

Slide 16

Slide 16 text

• Should be stable over time (service level objective perhaps?) • If customer-facing: does it clearly indicate something's broken for your users when it goes sideways? • Scoped to the individual app/service with version as a dimension. CANARY METRIC(S)

Slide 17

Slide 17 text

APP INSTRUMENTATION FOR CANARY DEPLOYS (part 1) Build Artifact (container, image, binary, etc) Version (this is done almost universally)

Slide 18

Slide 18 text

HOW DO YOU GET THE VERSION IN CODE? (This stuff is ideally exposed through platform tooling: exists in Mesos, ECS, K8S, Fargate, and others) https://kubernetes.io/docs/tasks/inject-data- application/environment-variable-expose-pod- information/#the-downward-api 'The Kubernetes Downward API' Orchestrator metadata Injected into Container/ App as Env Var

Slide 19

Slide 19 text

APP INSTRUMENTATION FOR CANARY DEPLOYS newrelic.addCustomParameter ('VERSION', deployedV); Version Annotate transaction traces with current running version In service/app code:

Slide 20

Slide 20 text

CLOSING THE LOOP Is the canary better or worse? (NRQL style) SELECT percentile(duration, 90) FROM Transaction WHERE appName='RPM API Production' SINCE 7 minutes ago WHERE appGitRevision='a2e441...' Event Name Measurement Selector: app name Selector: time window Selector: version

Slide 21

Slide 21 text

SOLUTION RECAP Version artifacts. Add custom parameter (version) to monitoring. Query metrics for particular version. Automate?

Slide 22

Slide 22 text

case studies Not automated, quasi-automated, fully-automated.

Slide 23

Slide 23 text

CASE STUDY 1: mobile Phased Releases (iOS) or Google Play Staged Rollouts (Android) More users (%) get new version Is it working? Are there crashes? New version of app Automation score:

Slide 24

Slide 24 text

CASE STUDY 2: NO SERVERS (AWS) Deploy Start Deploy End Before Allow Traffic (first check) After Allow Traffic (final check) Event-driven deploys: hook into them as needed Run some code in response to event Run some code in response to event

Slide 25

Slide 25 text

NEW FUNCTION DEPLOY Deploy Start Deploy End Before Allow Traffic (first check) After Allow Traffic (final check) (Some) customers interact with new version • Is it slower for people? • Are there no errors? Metrics • Error count • Perceived Performance Facets • Function Version • Browser, Geo, Device Canary10Percent10Minutes ... Linear10PercentEvery10Minutes .. AllAtOnce ... Supported Types: Automation score:

Slide 26

Slide 26 text

WHAT MEANS 'GOOD'? SELECT percentile(duration, 90) from PageView SINCE 10 minutes ago WHERE functionVersion='14' SELECT count(*) from JavaScriptError SINCE 10 minutes ago WHERE releaseIds='{"functionVersion":"14"}' Is the page loading slower than expected?* (New Relic NRQL) How many errors are there?* (New Relic NRQL) * These also make great alert conditions

Slide 27

Slide 27 text

CANARY CHECK: static baseline const response = JSON.parse(nrqlResponse); const errorCount = response.results[0].count; if (errorCount > 0) { status = 'Failed'; } https://github.com/smithclay/aws-lambda-nr-well-instrumented/blob/master/post- traﬃc-hook/index.js Rejecting any deploys with frontend- errors in AWS CodeDeploy

Slide 28

Slide 28 text

CASE STUDY 3: Kayenta https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69 Load Balancer Canary Existing Majority of traffic Some traffic Metrics Baseline

Slide 29

Slide 29 text

WHAT's THIS? Load Balancer Canary Existing Majority of traﬃc Some traﬃc Metrics Baseline "Creating a brand new baseline cluster ensures that the metrics produced are free of any effects caused by long-running processes."

Slide 30

Slide 30 text

METRIC COMPARISON SCOPE: CLUSTER TYPE AND TIME WINDOW FROM Metrics Metrics Metrics Data Source(s) "The Judge"

Slide 31

Slide 31 text

YOU MUST BE THIS TALL TO DO AUTOMATED CANARY ANALYSIS Real-time metrics from multiple data sources Dimensionality in metrics: especially version Solid deployment tooling + infrastructure

Slide 32

Slide 32 text

Next steps Create dashboards with application metrics with version dimensions (error rate, response time, throughput)... you'll never believe what happens next!

Slide 33

Slide 33 text

Thank you! @smithclay 5/14/2018