Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automated Canary Analysis with New Relic

Automated Canary Analysis with New Relic

Clay Smith

May 15, 2018
Tweet

More Decks by Clay Smith

Other Decks in Programming

Transcript

  1. It looks like you're doing a canary deploy. I want

    to make sure this really works in production. DO
  2. Load Balancer Canary Existing in Production Majority of traffic Some

    traffic If the canary traffic "looks good"—it's safe! CANARY DEPLOY Metrics
  3. Production uncertainty circa 2018 What's really going to happen when

    I deploy a change to service C? Did it work? A D E B C F ... Z Many services, many relationships: major incidents are almost always a surprise ("didn't know it worked that way!")
  4. Production uncertainty circa 2018 What's really going to happen when

    I make a change to {service}? Did it work? ... 6 hours later I looked at 5TB of logs and created a pivot table in Microsoft Excel—maybe? DO
  5. Is there a better way? I work at [redacted large

    SV- based company]. Let me send you this paper!* *https://queue.acm.org/detail.cfm?id=3194655 DO
  6. Things *not* discussed -IN this talK- • Jenkins (how to

    build using a pipeline) • Rollout tools (how do I script a canary deploy, etc) • Orchestration layer (i.e. Kubernetes, Marathon, etc) • Non-code changes (configuration, firmware, recipes, etc) AGENDA
  7. Things discussed -IN this talK- • Assumption: You have something

    (image, container, etc) that you want to put in production. • Introduction: automated canary analysis • How to: instrumentation of services and applications for canary deploys • Code: Public-cloud proof of concept • Resources: Miscellaneous other tools or approaches AGENDA
  8. (KINDA) CONTROL THEORY* FEEDBACK LOOP APPLIED TO DEPLOYS deployer running

    application monitoring baseline + - *apologies to anyone with an academic background in control theory output This is where people talk a lot about observability: how well can you measure the work the system is doing from output?
  9. MONITORING OR OBSERVABILITY? credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38) White

    Box Black Box Instrumentation of code Is it working? (i.e. ping or nagios check—traditional "monitoring") (i.e. application performance monitoring using agents—increases observability)
  10. AUTOMATED CANARIES: We have the technology. baseline + - output

    ... instrumentation deploy solutions containers + immutable infra // TO DO: insert automation here to detect good/bad deploys above
  11. What is 'bad'? And how do you measure that? It

    depends on several things. DO
  12. • Should be stable over time (service level objective perhaps?)

    • If customer-facing: does it clearly indicate something's broken for your users when it goes sideways? • Scoped to the individual app/service with version as a dimension. CANARY METRIC(S)
  13. APP INSTRUMENTATION FOR CANARY DEPLOYS (part 1) Build Artifact (container,

    image, binary, etc) Version (this is done almost universally)
  14. HOW DO YOU GET THE VERSION IN CODE? (This stuff

    is ideally exposed through platform tooling: exists in Mesos, ECS, K8S, Fargate, and others) https://kubernetes.io/docs/tasks/inject-data- application/environment-variable-expose-pod- information/#the-downward-api 'The Kubernetes Downward API' Orchestrator metadata Injected into Container/ App as Env Var
  15. CLOSING THE LOOP Is the canary better or worse? (NRQL

    style) SELECT percentile(duration, 90) FROM Transaction WHERE appName='RPM API Production' SINCE 7 minutes ago WHERE appGitRevision='a2e441...' Event Name Measurement Selector: app name Selector: time window Selector: version
  16. CASE STUDY 1: mobile Phased Releases (iOS) or Google Play

    Staged Rollouts (Android) More users (%) get new version Is it working? Are there crashes? New version of app Automation score:
  17. CASE STUDY 2: NO SERVERS (AWS) Deploy Start Deploy End

    Before Allow Traffic (first check) After Allow Traffic (final check) Event-driven deploys: hook into them as needed Run some code in response to event Run some code in response to event
  18. NEW FUNCTION DEPLOY Deploy Start Deploy End Before Allow Traffic

    (first check) After Allow Traffic (final check) (Some) customers interact with new version • Is it slower for people? • Are there no errors? Metrics • Error count • Perceived Performance Facets • Function Version • Browser, Geo, Device Canary10Percent10Minutes ... Linear10PercentEvery10Minutes .. AllAtOnce ... Supported Types: Automation score:
  19. WHAT MEANS 'GOOD'? SELECT percentile(duration, 90) from PageView SINCE 10

    minutes ago WHERE functionVersion='14' SELECT count(*) from JavaScriptError SINCE 10 minutes ago WHERE releaseIds='{"functionVersion":"14"}' Is the page loading slower than expected?* (New Relic NRQL) How many errors are there?* (New Relic NRQL) * These also make great alert conditions
  20. CANARY CHECK: static baseline const response = JSON.parse(nrqlResponse); const errorCount

    = response.results[0].count; if (errorCount > 0) { status = 'Failed'; } https://github.com/smithclay/aws-lambda-nr-well-instrumented/blob/master/post- traffic-hook/index.js Rejecting any deploys with frontend- errors in AWS CodeDeploy
  21. WHAT's THIS? Load Balancer Canary Existing Majority of traffic Some

    traffic Metrics Baseline "Creating a brand new baseline cluster ensures that the metrics produced are free of any effects caused by long-running processes."
  22. METRIC COMPARISON SCOPE: CLUSTER TYPE AND TIME WINDOW FROM Metrics

    Metrics Metrics Data Source(s) "The Judge"
  23. YOU MUST BE THIS TALL TO DO AUTOMATED CANARY ANALYSIS

    Real-time metrics from multiple data sources Dimensionality in metrics: especially version Solid deployment tooling + infrastructure
  24. Next steps Create dashboards with application metrics with version dimensions

    (error rate, response time, throughput)... you'll never believe what happens next!