Production uncertainty circa 2018 What's really going to happen when I deploy a change to service C? Did it work? A D E B C F ... Z Many services, many relationships: major incidents are almost always a surprise ("didn't know it worked that way!")
Production uncertainty circa 2018 What's really going to happen when I make a change to {service}? Did it work? ... 6 hours later I looked at 5TB of logs and created a pivot table in Microsoft Excel—maybe? DO
(KINDA) CONTROL THEORY* FEEDBACK LOOP APPLIED TO DEPLOYS deployer running application monitoring baseline + - *apologies to anyone with an academic background in control theory output This is where people talk a lot about observability: how well can you measure the work the system is doing from output?
MONITORING OR OBSERVABILITY? credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)
White Box Black Box Instrumentation of code Is it working? (i.e. ping or nagios check—traditional "monitoring") (i.e. application performance monitoring using agents—increases observability)
AUTOMATED CANARIES: We have the technology. baseline + - output ... instrumentation deploy solutions containers + immutable infra // TO DO: insert automation here to detect good/bad deploys above
HOW DO YOU GET THE VERSION IN CODE? (This stuff is ideally exposed through platform tooling: exists in Mesos, ECS, K8S, Fargate, and others) https://kubernetes.io/docs/tasks/inject-data- application/environment-variable-expose-pod- information/#the-downward-api 'The Kubernetes Downward API' Orchestrator metadata Injected into Container/ App as Env Var
APP INSTRUMENTATION FOR CANARY DEPLOYS newrelic.addCustomParameter ('VERSION', deployedV); Version Annotate transaction traces with current running version In service/app code:
CLOSING THE LOOP Is the canary better or worse? (NRQL style) SELECT percentile(duration, 90) FROM Transaction WHERE appName='RPM API Production' SINCE 7 minutes ago WHERE appGitRevision='a2e441...' Event Name Measurement Selector: app name Selector: time window Selector: version
CASE STUDY 1: mobile Phased Releases (iOS) or Google Play Staged Rollouts (Android) More users (%) get new version Is it working? Are there crashes? New version of app Automation score:
CASE STUDY 2: NO SERVERS (AWS) Deploy Start Deploy End Before Allow Traffic (first check) After Allow Traffic (final check) Event-driven deploys: hook into them as needed Run some code in response to event Run some code in response to event
NEW FUNCTION DEPLOY Deploy Start Deploy End Before Allow Traffic (first check) After Allow Traffic (final check) (Some) customers interact with new version • Is it slower for people?
WHAT MEANS 'GOOD'? SELECT percentile(duration, 90) from PageView SINCE 10 minutes ago WHERE functionVersion='14' SELECT count(*) from JavaScriptError SINCE 10 minutes ago WHERE releaseIds='{"functionVersion":"14"}' Is the page loading slower than expected?* (New Relic NRQL) How many errors are there?* (New Relic NRQL) * These also make great alert conditions
CASE STUDY 3: Kayenta https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69 Load Balancer Canary Existing Majority of traffic Some traffic Metrics Baseline
WHAT's THIS? Load Balancer Canary Existing Majority of traffic Some traffic Metrics Baseline "Creating a brand new baseline cluster ensures that the metrics produced are free of any effects caused by long-running processes."
YOU MUST BE THIS TALL TO DO AUTOMATED CANARY ANALYSIS Real-time metrics from multiple data sources Dimensionality in metrics: especially version Solid deployment tooling + infrastructure
Next steps Create dashboards with application metrics with version dimensions (error rate, response time, throughput)... you'll never believe what happens next!