Release Engineering from the Ground Up

7c4bac30ed2d3a9d346ced746b1d985d?s=47 Tom Santero
November 10, 2014

Release Engineering from the Ground Up

Slides from my talk at the USENIX Release Engineering Summit West '14: https://www.usenix.org/conference/ures14west/summit-program/presentation/santero

7c4bac30ed2d3a9d346ced746b1d985d?s=128

Tom Santero

November 10, 2014
Tweet

Transcript

  1. 2.
  2. 3.
  3. 4.
  4. 5.
  5. 6.
  6. 9.
  7. 10.
  8. 11.
  9. 12.
  10. 13.
  11. 14.
  12. 16.
  13. 22.

    Listens for commits - builds on every push to any

    branch ! Run unit tests, reports build/test statistics ! If branch == master: - cut release as RPM - increment version number - push RPM to yum repo
  14. 23.
  15. 24.

    provisioning / termination ! release ver upgrades ! host system

    configuration - registration and discovery
  16. 25.

    single repo: roles, tasks, files ! abstract out common tasks

    e.g. ElasticSearch, Riak, Jenkins ! parameterized per env + svc
  17. 26.

    Jenkins: update release tag in Ansible repo ! Source of

    Truth? - correlate builds, releases and environments *
  18. 27.
  19. 29.

    nyt_lb* * naming is hard (also, too bad there’s no

    logo) service registration + discovery ! allow for load balancing internal + external traffic ! lightweight, robust, redundant ! scalable, highly-available
  20. 30.

    RESTful API svc plugins: nginx, haproxy… in-memory db persistence &

    failure recovery distributed systems magic ! gossip + CRDTs
  21. 31.

    nyt_lb nyt_lb nyt_lb all cluster state are CRDTs - node

    membership - registered services - service attributes
  22. 32.

    nyt_lb nyt_lb nyt_lb quorum operations + gossip ! all state

    is monotonic & confluent ! new state converges
  23. 33.

    nyt_lb nyt_lb nyt_lb upon provision and configuration, services register themselves

    ! take themselves out of LBs during upgrades; maintenance; destroy
  24. 36.

    event = { 'host' : ip-10-45-136-116, 'service' : load-average, 'metric'

    : 0.7, 'state' : ok, 'time' : 1413551091.341055, 'tags' : [dev, suggest-api, load] 'ttl' : 10, 'description' : description } event = { 'host' : ip-10-45-136-116, 'service' : load-average, 'metric' : 3.2, 'state' : warning, 'time' : 1413551176.852009, 'tags' : [dev, suggest-api, load] 'ttl' : 10, 'description' : description }
  25. 37.

    operational challenges and failures are a given isolate and identify

    root causes ! check logic belongs close to the thing monitored ! push events ; compute per grp/env + expectation
  26. 38.
  27. 46.

    Lessons Learned and Future(?) Work Lot of work; difficult tradeoff

    for low-barrier to entry + robust system ! Containers are nice, but ecosystem is still too immature ! Correlating application, system, build metrics still manual - maybe emit events from Jenkins —> Riemann —> Datomic - Push button re-deploys of point-in-time environments ! Historical performance metrics as automated regression testing ! Automated security auditing, static code analysis, etc..