Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automated Canary Analysis with New Relic

Automated Canary Analysis with New Relic

Clay Smith

May 15, 2018
Tweet

More Decks by Clay Smith

Other Decks in Programming

Transcript

  1. Automated Canary
    Analysis
    (New Relic-flavored introduction)
    @smithclay
    5/14/2018

    View full-size slide

  2. It looks like you're doing a
    canary deploy.

    View full-size slide

  3. It looks like you're doing a
    canary deploy.
    I want to make sure this really
    works in production.
    DO

    View full-size slide

  4. Load Balancer
    Canary
    Existing in Production
    Majority of traffic
    Some traffic
    If the canary traffic "looks
    good"—it's safe!
    CANARY DEPLOY
    Metrics

    View full-size slide

  5. Production uncertainty
    circa 2018
    What's really going to happen
    when I deploy a change to
    service C? Did it work?
    A
    D
    E
    B
    C
    F
    ...
    Z
    Many services, many relationships:
    major incidents are almost always a surprise
    ("didn't know it worked that way!")

    View full-size slide

  6. Production uncertainty
    circa 2018
    What's really going to happen
    when I make a change to
    {service}? Did it work?
    ... 6 hours later
    I looked at 5TB of logs and
    created a pivot table in
    Microsoft Excel—maybe?
    DO

    View full-size slide

  7. Is there a better
    way?
    I work at [redacted large SV-
    based company]. Let me send
    you this paper!*
    *https://queue.acm.org/detail.cfm?id=3194655
    DO

    View full-size slide

  8. Is there A BETTER
    aN EASIER way?
    Application/service
    instrumentation and metrics
    can help (a lot).
    DO

    View full-size slide

  9. Things *not* discussed
    -IN this talK-
    • Jenkins (how to build using a pipeline)

    • Rollout tools (how do I script a canary deploy, etc)

    • Orchestration layer (i.e. Kubernetes, Marathon, etc)

    • Non-code changes (configuration, firmware, recipes, etc)
    AGENDA

    View full-size slide

  10. Things discussed
    -IN this talK-
    • Assumption: You have something (image, container, etc)
    that you want to put in production.

    • Introduction: automated canary analysis

    • How to: instrumentation of services and applications
    for canary deploys

    • Code: Public-cloud proof of concept

    • Resources: Miscellaneous other tools or approaches
    AGENDA

    View full-size slide

  11. (KINDA) CONTROL THEORY*
    FEEDBACK LOOP APPLIED TO DEPLOYS
    deployer running
    application
    monitoring
    baseline +
    -
    *apologies to anyone with an academic background in control theory
    output
    This is where people talk a lot about observability: how well can
    you measure the work the system is doing from output?

    View full-size slide

  12. "OBSERVABILITY"
    credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https://
    medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)

    View full-size slide

  13. MONITORING OR
    OBSERVABILITY?
    credit: @peterbourgon, https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html, @copyconstruct, https://
    medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38)

    White Box Black Box
    Instrumentation of code Is it working?
    (i.e. ping or nagios check—traditional
    "monitoring")
    (i.e. application performance
    monitoring using agents—increases
    observability)

    View full-size slide

  14. AUTOMATED CANARIES: We
    have the technology.
    baseline
    +
    -
    output
    ...
    instrumentation
    deploy solutions containers + immutable infra
    // TO DO: insert automation here to detect good/bad deploys above

    View full-size slide

  15. What is 'bad'? And how do you
    measure that?
    It depends on several things.
    DO

    View full-size slide

  16. • Should be stable over time (service level objective
    perhaps?)

    • If customer-facing: does it clearly indicate something's
    broken for your users when it goes sideways?

    • Scoped to the individual app/service with version as a
    dimension.
    CANARY METRIC(S)

    View full-size slide

  17. APP INSTRUMENTATION FOR
    CANARY DEPLOYS (part 1)
    Build Artifact
    (container, image,
    binary, etc)
    Version
    (this is done almost universally)

    View full-size slide

  18. HOW DO YOU GET THE VERSION
    IN CODE?
    (This stuff is ideally exposed through platform tooling:
    exists in Mesos, ECS, K8S, Fargate, and others)
    https://kubernetes.io/docs/tasks/inject-data-
    application/environment-variable-expose-pod-
    information/#the-downward-api
    'The Kubernetes Downward API'
    Orchestrator metadata
    Injected into Container/
    App as Env Var

    View full-size slide

  19. APP INSTRUMENTATION FOR
    CANARY DEPLOYS
    newrelic.addCustomParameter
    ('VERSION', deployedV);
    Version
    Annotate transaction traces with
    current running version
    In service/app code:

    View full-size slide

  20. CLOSING THE LOOP
    Is the canary better or worse? (NRQL style)
    SELECT percentile(duration, 90) FROM Transaction
    WHERE appName='RPM API Production'
    SINCE 7 minutes ago
    WHERE appGitRevision='a2e441...'
    Event Name
    Measurement
    Selector: app name
    Selector: time window
    Selector: version

    View full-size slide

  21. SOLUTION RECAP
    Version artifacts.
    Add custom parameter (version) to
    monitoring.
    Query metrics for particular version.
    Automate?

    View full-size slide

  22. case studies
    Not automated, quasi-automated, fully-automated.

    View full-size slide

  23. CASE STUDY 1: mobile
    Phased Releases (iOS) or Google Play Staged Rollouts (Android)
    More users (%) get new version
    Is it working? Are there crashes?
    New version
    of app
    Automation score:

    View full-size slide

  24. CASE STUDY 2: NO
    SERVERS (AWS)
    Deploy Start Deploy End
    Before Allow Traffic
    (first check)
    After Allow Traffic
    (final check)
    Event-driven deploys: hook into them as needed
    Run some code
    in response to
    event
    Run some code
    in response to
    event

    View full-size slide

  25. NEW FUNCTION DEPLOY
    Deploy Start Deploy End
    Before Allow Traffic
    (first check)
    After Allow Traffic
    (final check)
    (Some) customers
    interact with
    new version
    • Is it slower for people?

    • Are there no errors?
    Metrics
    • Error count

    • Perceived Performance
    Facets
    • Function Version

    • Browser, Geo, Device
    Canary10Percent10Minutes
    ...
    Linear10PercentEvery10Minutes
    ..
    AllAtOnce
    ...
    Supported Types:
    Automation score:

    View full-size slide

  26. WHAT MEANS 'GOOD'?
    SELECT percentile(duration, 90) from PageView SINCE 10 minutes
    ago WHERE functionVersion='14'
    SELECT count(*) from JavaScriptError SINCE 10 minutes
    ago WHERE releaseIds='{"functionVersion":"14"}'
    Is the page loading slower than expected?* (New Relic NRQL)
    How many errors are there?* (New Relic NRQL)
    * These also make great alert conditions

    View full-size slide

  27. CANARY CHECK: static
    baseline
    const response = JSON.parse(nrqlResponse);
    const errorCount = response.results[0].count;
    if (errorCount > 0) {
    status = 'Failed';
    }
    https://github.com/smithclay/aws-lambda-nr-well-instrumented/blob/master/post-
    traffic-hook/index.js
    Rejecting any deploys with frontend-
    errors in AWS CodeDeploy

    View full-size slide

  28. CASE STUDY 3: Kayenta
    https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69
    Load Balancer
    Canary
    Existing Majority of traffic
    Some traffic
    Metrics
    Baseline

    View full-size slide

  29. WHAT's THIS?
    Load Balancer
    Canary
    Existing Majority of traffic
    Some traffic
    Metrics
    Baseline
    "Creating a brand new
    baseline cluster
    ensures that the
    metrics produced are
    free of any effects
    caused by long-running
    processes."

    View full-size slide

  30. METRIC COMPARISON
    SCOPE: CLUSTER TYPE
    AND TIME WINDOW FROM
    Metrics
    Metrics
    Metrics
    Data Source(s)
    "The Judge"

    View full-size slide

  31. YOU MUST BE THIS TALL TO DO
    AUTOMATED CANARY ANALYSIS
    Real-time metrics from multiple data sources
    Dimensionality in metrics: especially version
    Solid deployment tooling + infrastructure

    View full-size slide

  32. Next steps
    Create dashboards with
    application metrics with version
    dimensions (error rate, response
    time, throughput)... you'll never
    believe what happens next!

    View full-size slide

  33. Thank you!
    @smithclay
    5/14/2018

    View full-size slide