Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Release Engineering from the Ground Up

Tom Santero
November 10, 2014

Release Engineering from the Ground Up

Slides from my talk at the USENIX Release Engineering Summit West '14: https://www.usenix.org/conference/ures14west/summit-program/presentation/santero

Tom Santero

November 10, 2014
Tweet

More Decks by Tom Santero

Other Decks in Programming

Transcript

  1. Release Engineering
    from the Ground Up
    Tom Santero
    @tsantero
    The New York Times Company

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. Search
    http://nytimes.com

    View Slide

  8. Listener Rules Engine
    Idx
    Mgr
    Asset Data Directory

    View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. An Experience Report
    releng

    View Slide

  16. View Slide

  17. Continuous
    Automated Deploys

    View Slide

  18. Single Target Environment

    View Slide

  19. Repository Migrations

    View Slide

  20. SVN
    GitHub

    View Slide

  21. Master
    Feature
    - release by commit
    - commit

    View Slide

  22. Listens for commits
    - builds on every push to any branch
    !
    Run unit tests, reports build/test statistics
    !
    If branch == master:
    - cut release as RPM
    - increment version number
    - push RPM to yum repo

    View Slide

  23. View Slide

  24. provisioning / termination
    !
    release ver upgrades
    !
    host system configuration
    - registration and discovery

    View Slide

  25. single repo: roles, tasks, files
    !
    abstract out common tasks
    e.g. ElasticSearch, Riak, Jenkins
    !
    parameterized per env + svc

    View Slide

  26. Jenkins: update release tag
    in Ansible repo
    !
    Source of Truth?
    - correlate builds, releases
    and environments *

    View Slide

  27. View Slide

  28. Load Balancer

    View Slide

  29. nyt_lb*
    * naming is hard (also, too bad there’s no logo)
    service registration + discovery
    !
    allow for load balancing internal + external traffic
    !
    lightweight, robust, redundant
    !
    scalable, highly-available

    View Slide

  30. RESTful API
    svc plugins: nginx, haproxy…
    in-memory db
    persistence &
    failure
    recovery
    distributed
    systems
    magic
    !
    gossip + CRDTs

    View Slide

  31. nyt_lb
    nyt_lb nyt_lb
    all cluster state are CRDTs
    - node membership
    - registered services
    - service attributes

    View Slide

  32. nyt_lb
    nyt_lb nyt_lb
    quorum operations + gossip
    !
    all state is monotonic & confluent
    !
    new state converges

    View Slide

  33. nyt_lb
    nyt_lb nyt_lb
    upon provision and configuration,
    services register themselves
    !
    take themselves out of LBs during
    upgrades; maintenance; destroy

    View Slide

  34. what’s up?

    View Slide

  35. unique identifiers
    env-level tagging

    View Slide

  36. event = {
    'host' : ip-10-45-136-116,
    'service' : load-average,
    'metric' : 0.7,
    'state' : ok,
    'time' : 1413551091.341055,
    'tags' : [dev, suggest-api, load]
    'ttl' : 10,
    'description' : description
    }
    event = {
    'host' : ip-10-45-136-116,
    'service' : load-average,
    'metric' : 3.2,
    'state' : warning,
    'time' : 1413551176.852009,
    'tags' : [dev, suggest-api, load]
    'ttl' : 10,
    'description' : description
    }

    View Slide

  37. operational challenges and failures are a given
    isolate and identify root causes
    !
    check logic belongs close to the thing monitored
    !
    push events ; compute per grp/env + expectation

    View Slide

  38. Graphite

    View Slide

  39. build dev stg prd

    View Slide

  40. Test Metrics
    System Metrics
    Event Metrics

    View Slide

  41. what does a green test
    really mean, anyway?

    View Slide

  42. maybe the build is red
    because we fixed all the bugs?

    View Slide

  43. test coverage as actionable
    !
    becomes a problem of categorization

    View Slide

  44. which machines are working harder?
    !
    do failures have a pattern?

    View Slide

  45. how often does X happen?
    !
    logging, alerts: indicators

    View Slide

  46. Lessons Learned and Future(?) Work
    Lot of work; difficult tradeoff for low-barrier to entry + robust system
    !
    Containers are nice, but ecosystem is still too immature
    !
    Correlating application, system, build metrics still manual
    - maybe emit events from Jenkins —> Riemann —> Datomic
    - Push button re-deploys of point-in-time environments
    !
    Historical performance metrics as automated regression testing
    !
    Automated security auditing, static code analysis, etc..

    View Slide

  47. Questions?
    Tom Santero
    @tsantero
    The New York Times Company
    8D

    View Slide