Release Engineering from the Ground Up

Slide 1

Slide 1 text

Release Engineering from the Ground Up Tom Santero @tsantero The New York Times Company

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Search http://nytimes.com

Slide 8

Slide 8 text

Listener Rules Engine Idx Mgr Asset Data Directory

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

An Experience Report releng

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Continuous Automated Deploys

Slide 18

Slide 18 text

Single Target Environment

Slide 19

Slide 19 text

Repository Migrations

Slide 20

Slide 20 text

SVN GitHub

Slide 21

Slide 21 text

Master Feature - release by commit - commit

Slide 22

Slide 22 text

Listens for commits - builds on every push to any branch ! Run unit tests, reports build/test statistics ! If branch == master: - cut release as RPM - increment version number - push RPM to yum repo

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

provisioning / termination ! release ver upgrades ! host system conﬁguration - registration and discovery

Slide 25

Slide 25 text

single repo: roles, tasks, ﬁles ! abstract out common tasks e.g. ElasticSearch, Riak, Jenkins ! parameterized per env + svc

Slide 26

Slide 26 text

Jenkins: update release tag in Ansible repo ! Source of Truth? - correlate builds, releases and environments *

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Load Balancer

Slide 29

Slide 29 text

nyt_lb* * naming is hard (also, too bad there’s no logo) service registration + discovery ! allow for load balancing internal + external trafﬁc ! lightweight, robust, redundant ! scalable, highly-available

Slide 30

Slide 30 text

RESTful API svc plugins: nginx, haproxy… in-memory db persistence & failure recovery distributed systems magic ! gossip + CRDTs

Slide 31

Slide 31 text

nyt_lb nyt_lb nyt_lb all cluster state are CRDTs - node membership - registered services - service attributes

Slide 32

Slide 32 text

nyt_lb nyt_lb nyt_lb quorum operations + gossip ! all state is monotonic & conﬂuent ! new state converges

Slide 33

Slide 33 text

nyt_lb nyt_lb nyt_lb upon provision and conﬁguration, services register themselves ! take themselves out of LBs during upgrades; maintenance; destroy

Slide 34

Slide 34 text

what’s up?

Slide 35

Slide 35 text

unique identiﬁers env-level tagging

Slide 36

Slide 36 text

event = { 'host' : ip-10-45-136-116, 'service' : load-average, 'metric' : 0.7, 'state' : ok, 'time' : 1413551091.341055, 'tags' : [dev, suggest-api, load] 'ttl' : 10, 'description' : description } event = { 'host' : ip-10-45-136-116, 'service' : load-average, 'metric' : 3.2, 'state' : warning, 'time' : 1413551176.852009, 'tags' : [dev, suggest-api, load] 'ttl' : 10, 'description' : description }

Slide 37

Slide 37 text

operational challenges and failures are a given isolate and identify root causes ! check logic belongs close to the thing monitored ! push events ; compute per grp/env + expectation

Slide 38

Slide 38 text

Graphite

Slide 39

Slide 39 text

build dev stg prd

Slide 40

Slide 40 text

Test Metrics System Metrics Event Metrics

Slide 41

Slide 41 text

what does a green test really mean, anyway?

Slide 42

Slide 42 text

maybe the build is red because we ﬁxed all the bugs?

Slide 43

Slide 43 text

test coverage as actionable ! becomes a problem of categorization

Slide 44

Slide 44 text

which machines are working harder? ! do failures have a pattern?

Slide 45

Slide 45 text

how often does X happen? ! logging, alerts: indicators

Slide 46

Slide 46 text

Lessons Learned and Future(?) Work Lot of work; diﬃcult tradeoﬀ for low-barrier to entry + robust system ! Containers are nice, but ecosystem is still too immature ! Correlating application, system, build metrics still manual - maybe emit events from Jenkins —> Riemann —> Datomic - Push button re-deploys of point-in-time environments ! Historical performance metrics as automated regression testing ! Automated security auditing, static code analysis, etc..

Slide 47

Slide 47 text

Questions? Tom Santero @tsantero The New York Times Company 8D