I deploy a change to service C? Did it work? A D E B C F ... Z Many services, many relationships: major incidents are almost always a surprise ("didn't know it worked that way!")
(image, container, etc) that you want to put in production. • Introduction: automated canary analysis • How to: instrumentation of services and applications for canary deploys • Code: Public-cloud proof of concept • Resources: Miscellaneous other tools or approaches AGENDA
application monitoring baseline + - *apologies to anyone with an academic background in control theory output This is where people talk a lot about observability: how well can you measure the work the system is doing from output?
Box Black Box Instrumentation of code Is it working? (i.e. ping or nagios check—traditional "monitoring") (i.e. application performance monitoring using agents—increases observability)
• If customer-facing: does it clearly indicate something's broken for your users when it goes sideways? • Scoped to the individual app/service with version as a dimension. CANARY METRIC(S)
is ideally exposed through platform tooling: exists in Mesos, ECS, K8S, Fargate, and others) https://kubernetes.io/docs/tasks/inject-data- application/environment-variable-expose-pod- information/#the-downward-api 'The Kubernetes Downward API' Orchestrator metadata Injected into Container/ App as Env Var
style) SELECT percentile(duration, 90) FROM Transaction WHERE appName='RPM API Production' SINCE 7 minutes ago WHERE appGitRevision='a2e441...' Event Name Measurement Selector: app name Selector: time window Selector: version
Before Allow Traffic (first check) After Allow Traffic (final check) Event-driven deploys: hook into them as needed Run some code in response to event Run some code in response to event
(first check) After Allow Traffic (final check) (Some) customers interact with new version • Is it slower for people? • Are there no errors? Metrics • Error count • Perceived Performance Facets • Function Version • Browser, Geo, Device Canary10Percent10Minutes ... Linear10PercentEvery10Minutes .. AllAtOnce ... Supported Types: Automation score:
minutes ago WHERE functionVersion='14' SELECT count(*) from JavaScriptError SINCE 10 minutes ago WHERE releaseIds='{"functionVersion":"14"}' Is the page loading slower than expected?* (New Relic NRQL) How many errors are there?* (New Relic NRQL) * These also make great alert conditions
= response.results[0].count; if (errorCount > 0) { status = 'Failed'; } https://github.com/smithclay/aws-lambda-nr-well-instrumented/blob/master/post- traffic-hook/index.js Rejecting any deploys with frontend- errors in AWS CodeDeploy
traffic Metrics Baseline "Creating a brand new baseline cluster ensures that the metrics produced are free of any effects caused by long-running processes."