Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Release Engineering from the Ground Up

Tom Santero
November 10, 2014

Release Engineering from the Ground Up

Slides from my talk at the USENIX Release Engineering Summit West '14: https://www.usenix.org/conference/ures14west/summit-program/presentation/santero

Tom Santero

November 10, 2014
Tweet

More Decks by Tom Santero

Other Decks in Programming

Transcript

  1. Listens for commits - builds on every push to any

    branch ! Run unit tests, reports build/test statistics ! If branch == master: - cut release as RPM - increment version number - push RPM to yum repo
  2. provisioning / termination ! release ver upgrades ! host system

    configuration - registration and discovery
  3. single repo: roles, tasks, files ! abstract out common tasks

    e.g. ElasticSearch, Riak, Jenkins ! parameterized per env + svc
  4. Jenkins: update release tag in Ansible repo ! Source of

    Truth? - correlate builds, releases and environments *
  5. nyt_lb* * naming is hard (also, too bad there’s no

    logo) service registration + discovery ! allow for load balancing internal + external traffic ! lightweight, robust, redundant ! scalable, highly-available
  6. RESTful API svc plugins: nginx, haproxy… in-memory db persistence &

    failure recovery distributed systems magic ! gossip + CRDTs
  7. nyt_lb nyt_lb nyt_lb all cluster state are CRDTs - node

    membership - registered services - service attributes
  8. nyt_lb nyt_lb nyt_lb quorum operations + gossip ! all state

    is monotonic & confluent ! new state converges
  9. nyt_lb nyt_lb nyt_lb upon provision and configuration, services register themselves

    ! take themselves out of LBs during upgrades; maintenance; destroy
  10. event = { 'host' : ip-10-45-136-116, 'service' : load-average, 'metric'

    : 0.7, 'state' : ok, 'time' : 1413551091.341055, 'tags' : [dev, suggest-api, load] 'ttl' : 10, 'description' : description } event = { 'host' : ip-10-45-136-116, 'service' : load-average, 'metric' : 3.2, 'state' : warning, 'time' : 1413551176.852009, 'tags' : [dev, suggest-api, load] 'ttl' : 10, 'description' : description }
  11. operational challenges and failures are a given isolate and identify

    root causes ! check logic belongs close to the thing monitored ! push events ; compute per grp/env + expectation
  12. Lessons Learned and Future(?) Work Lot of work; difficult tradeoff

    for low-barrier to entry + robust system ! Containers are nice, but ecosystem is still too immature ! Correlating application, system, build metrics still manual - maybe emit events from Jenkins —> Riemann —> Datomic - Push button re-deploys of point-in-time environments ! Historical performance metrics as automated regression testing ! Automated security auditing, static code analysis, etc..