Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automated Canary Analysis with New Relic

Automated Canary Analysis with New Relic

Clay Smith

May 15, 2018

More Decks by Clay Smith

Other Decks in Programming


  1. Automated Canary Analysis (New Relic-flavored introduction) @smithclay 5/14/2018

  2. It looks like you're doing a canary deploy.

  3. It looks like you're doing a canary deploy. I want

    to make sure this really works in production. DO
  4. Load Balancer Canary Existing in Production Majority of traffic Some

    traffic If the canary traffic "looks good"—it's safe! CANARY DEPLOY Metrics
  5. Production uncertainty circa 2018 What's really going to happen when

    I deploy a change to service C? Did it work? A D E B C F ... Z Many services, many relationships: major incidents are almost always a surprise ("didn't know it worked that way!")
  6. Production uncertainty circa 2018 What's really going to happen when

    I make a change to {service}? Did it work? ... 6 hours later I looked at 5TB of logs and created a pivot table in Microsoft Excel—maybe? DO
  7. Is there a better way? I work at [redacted large

    SV- based company]. Let me send you this paper!* *https://queue.acm.org/detail.cfm?id=3194655 DO
  8. Is there A BETTER aN EASIER way? Application/service instrumentation and

    metrics can help (a lot). DO
  9. Things *not* discussed -IN this talK- • Jenkins (how to

    build using a pipeline) • Rollout tools (how do I script a canary deploy, etc) • Orchestration layer (i.e. Kubernetes, Marathon, etc) • Non-code changes (configuration, firmware, recipes, etc) AGENDA
  10. Things discussed -IN this talK- • Assumption: You have something

    (image, container, etc) that you want to put in production. • Introduction: automated canary analysis • How to: instrumentation of services and applications for canary deploys • Code: Public-cloud proof of concept • Resources: Miscellaneous other tools or approaches AGENDA

    application monitoring baseline + - *apologies to anyone with an academic background in control theory output This is where people talk a lot about observability: how well can you measure the work the system is doing from output?
  12. "OBSERVABILITY" credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)

  13. MONITORING OR OBSERVABILITY? credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https:// medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38) White

    Box Black Box Instrumentation of code Is it working? (i.e. ping or nagios check—traditional "monitoring") (i.e. application performance monitoring using agents—increases observability)
  14. AUTOMATED CANARIES: We have the technology. baseline + - output

    ... instrumentation deploy solutions containers + immutable infra // TO DO: insert automation here to detect good/bad deploys above
  15. What is 'bad'? And how do you measure that? It

    depends on several things. DO
  16. • Should be stable over time (service level objective perhaps?)

    • If customer-facing: does it clearly indicate something's broken for your users when it goes sideways? • Scoped to the individual app/service with version as a dimension. CANARY METRIC(S)
  17. APP INSTRUMENTATION FOR CANARY DEPLOYS (part 1) Build Artifact (container,

    image, binary, etc) Version (this is done almost universally)

    is ideally exposed through platform tooling: exists in Mesos, ECS, K8S, Fargate, and others) https://kubernetes.io/docs/tasks/inject-data- application/environment-variable-expose-pod- information/#the-downward-api 'The Kubernetes Downward API' Orchestrator metadata Injected into Container/ App as Env Var
  19. APP INSTRUMENTATION FOR CANARY DEPLOYS newrelic.addCustomParameter ('VERSION', deployedV); Version Annotate

    transaction traces with current running version In service/app code:
  20. CLOSING THE LOOP Is the canary better or worse? (NRQL

    style) SELECT percentile(duration, 90) FROM Transaction WHERE appName='RPM API Production' SINCE 7 minutes ago WHERE appGitRevision='a2e441...' Event Name Measurement Selector: app name Selector: time window Selector: version
  21. SOLUTION RECAP Version artifacts. Add custom parameter (version) to monitoring.

    Query metrics for particular version. Automate?
  22. case studies Not automated, quasi-automated, fully-automated.

  23. CASE STUDY 1: mobile Phased Releases (iOS) or Google Play

    Staged Rollouts (Android) More users (%) get new version Is it working? Are there crashes? New version of app Automation score:
  24. CASE STUDY 2: NO SERVERS (AWS) Deploy Start Deploy End

    Before Allow Traffic (first check) After Allow Traffic (final check) Event-driven deploys: hook into them as needed Run some code in response to event Run some code in response to event
  25. NEW FUNCTION DEPLOY Deploy Start Deploy End Before Allow Traffic

    (first check) After Allow Traffic (final check) (Some) customers interact with new version • Is it slower for people? • Are there no errors? Metrics • Error count • Perceived Performance Facets • Function Version • Browser, Geo, Device Canary10Percent10Minutes ... Linear10PercentEvery10Minutes .. AllAtOnce ... Supported Types: Automation score:
  26. WHAT MEANS 'GOOD'? SELECT percentile(duration, 90) from PageView SINCE 10

    minutes ago WHERE functionVersion='14' SELECT count(*) from JavaScriptError SINCE 10 minutes ago WHERE releaseIds='{"functionVersion":"14"}' Is the page loading slower than expected?* (New Relic NRQL) How many errors are there?* (New Relic NRQL) * These also make great alert conditions
  27. CANARY CHECK: static baseline const response = JSON.parse(nrqlResponse); const errorCount

    = response.results[0].count; if (errorCount > 0) { status = 'Failed'; } https://github.com/smithclay/aws-lambda-nr-well-instrumented/blob/master/post- traffic-hook/index.js Rejecting any deploys with frontend- errors in AWS CodeDeploy
  28. CASE STUDY 3: Kayenta https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69 Load Balancer Canary Existing Majority

    of traffic Some traffic Metrics Baseline
  29. WHAT's THIS? Load Balancer Canary Existing Majority of traffic Some

    traffic Metrics Baseline "Creating a brand new baseline cluster ensures that the metrics produced are free of any effects caused by long-running processes."

    Metrics Metrics Data Source(s) "The Judge"

    Real-time metrics from multiple data sources Dimensionality in metrics: especially version Solid deployment tooling + infrastructure
  32. Next steps Create dashboards with application metrics with version dimensions

    (error rate, response time, throughput)... you'll never believe what happens next!
  33. Thank you! @smithclay 5/14/2018