Slide 1

Slide 1 text

Continuous Deployment 2007 How we did deploys at Right Media, NYC Nick Galbreath
 [email protected] 2014-10-21

Slide 2

Slide 2 text

The Origins of Madness • The last few years, I was a poster boy for continuous deployment, in particular, for security. • Here's some context on how this came to be • It's unlikely you'll have anything like the environment at Right Media, and remember there is no right way to do deployments or "devops".

Slide 3

Slide 3 text

Right Media • Early online advertising pioneer in New York City • Started doing 10k transactions a day in 2005 • Left when acquired by Yahoo in 2007 • Billion+ transactions a day • 1200+ machines • 3 datacenters

Slide 4

Slide 4 text

Online Advertising • Micropayments at massive scale • Extremely sensitive code. Slight changes can have big impact… days later. • Publishers and Advertisers get really cranky when their ads don't run, displayed incorrectly, or blow out their budgets too quickly • Mistakes cost real money

Slide 5

Slide 5 text

Team Structure • The real-time adserver group (input http request, output links to an advertisement or something) • An classical sysops group (network ops, building machines, etc) • A ops group that did higher-level functionality of moving data in and off the box. • (lots of other groups but outside of this story)

Slide 6

Slide 6 text

Data Driven • Online advertising is almost entirely data-driven. • Meaning its very difficult to determine a server is exploding due to poor code, poor algorithms/ product or bad data. • This make is very hard to make a 'run book' for problems as its very likely not directly an operations problem (maybe a customer lit up a new ad campaign that is causing CPU problems).

Slide 7

Slide 7 text

Responsibility • This made chain of responsibility very difficult. • Something is exploding, but is it a… • Bad machine? (sysops team) • Bad data? (data team) • Bad server? (server team)

Slide 8

Slide 8 text

For better or for worse • It ended up being the server team's responsibility to rule out other issues, and escalate. • Made sense as really no one else understood how the server worked. • And the server was constantly under going changes. • Unrealistic to ask for anyone else to debug it. • (also one of those teams was remote)

Slide 9

Slide 9 text

Developers had to • Test their own code • We did have a QA guy at one time, but turned out to be not useful abstraction step. • Push their own code • Monitor their own code

Slide 10

Slide 10 text

Now it's called DevOps We called it Survival So dramatic!

Slide 11

Slide 11 text

Monitoring

Slide 12

Slide 12 text

Why is… • The server team was often asked why something wasn't working. The existing operational tools didn't really cut it, so we had to find other ways to answer the questions.

Slide 13

Slide 13 text

Ganglia • http://ganglia.sourceforge.net • We used Ganglia for both operational metrics (CPU, etc) and application metrics. • And metrics specific to certain large customers (adserving is sorta a Platform-as-Service - a adserver might be handling requests for 1000s of publishers).

Slide 14

Slide 14 text

Application Metrics • The C/C++ application would keep running tabs on various metrics internally
 (I recall a lot of atomic counters and whatnot) • Every minute it would emit stats via UDP to a upstream server. • You could also ask for stats directly from the server

Slide 15

Slide 15 text

Embedded GMetric • Small C (and pure python/perl versions) library that emits a single statistic via UDP to a server. • https://code.google.com/p/embeddedgmetric/ • First commit in 2007! • It's still around and I think is used by many.

Slide 16

Slide 16 text

gmond • part of ganglia • Centrally collects and stores metrics for each server. • You could can query gmond directly for latest stats of a single machine or all machines. • XML output • Single thread (like node.js) Very fast

Slide 17

Slide 17 text

Before StatsD • Does this sound similar to StatsD? • You'd be correct! Very similar. • Both used UDP. Both used a single-threaded server to collect statistics and dump to storage. • Except our servers did some pre-aggregation on a per-minute basis. • StatsD uses sampling.

Slide 18

Slide 18 text

Unlike StatsD • One cool thing is that it was very easy to ask gnomd for all current stats for a box or all machines. • Don't need to root around in graphite files to get all stats. • Really easy to do real time displays.

Slide 19

Slide 19 text

Large Scale Ganglia UI • Ganglia isn't very good for displaying data for many machines at the same time • We wrote a number of tools that would display a quasi- heat-map of statistics. • Each machine was a box and coloured according to the value a stat • Super easy to find outliers and boxes where on fire. • Sorry no pictures :-(

Slide 20

Slide 20 text

Oddly • We didn't use historical graphs that often • Something more than a few days old was obsolete and not useful. • Thats how fast things were changing.

Slide 21

Slide 21 text

Monitor Everything • Even with 1000+ machine cluster, a single machine could still cause enormous problems (obsolete adservering rules, etc) • One machine in the cluster (our own hardware) "lost" 2G of memory one day (the machine just reported less memory) and chaos followed. • So we had alerts on fixed memory. This is level of detail we had deal with. • Infrastructure assertions

Slide 22

Slide 22 text

Deployment

Slide 23

Slide 23 text

Little Changes, Big Impacts • Sometimes a mistake would show up right away • But other times it was subtle and might take a full day (or more!) to be noticed (long data feedback loop, reporting, etc).

Slide 24

Slide 24 text

Only Small Deploys • Given this, we learned, the hard way, to do only little changes, or isolated changes to production. • There was no other way debug. • With big changes were completely impossible to understand what happened. • Rollback would cause features to go away, making customers insane, and caused other problems.

Slide 25

Slide 25 text

Deployment Rate • We deploy approximately once a day some times more sometimes less depending on feature and customer demands. • Normally during "low tide" (low amount of traffic) • Normally only one feature or change. • Pushing more often wasn't helpful as it took up to a day to see the effects of a bad deploy.

Slide 26

Slide 26 text

Coordination • Free to deploy as often as needed • Again, normally during low-tide to minimize operational impact • Little to none co-ordination with ops and data teams

Slide 27

Slide 27 text

Large Unit Test Framework • This partially since it was C/C++ where every line could cause a big disaster. Actually every character. • Code coverage was crucial. • But also as a way of self-documenting the code. • A large number of "features" were undocumented and the customers using them was unknown. We had to unit test, just to make sure we didn't change anything.

Slide 28

Slide 28 text

Source Control • I think it was SVN • There were no branches. • Everyone committed to mainline. • WHY?

Slide 29

Slide 29 text

No Branches? • If you code wasn't in production with in 3 days, it was very likely it would never go live. • The best projects where ones that could be completed and pushed in the same day. • Customer priorities were so fast and furious, you would be re- tasked on to another project and your code would be orphaned resulting in 100% waste of time. • No doubt later there would be another fire and that project would be needed again, but that might not for a while. • Maintaining branches and constantly merging was a complete waste of time. Get. It. To. Prod. Now.

Slide 30

Slide 30 text

Push it. • If re-tasked, it's much better to get your old code to prod doing poor-man's Feature Flags: • #ifdef or if (0) out anything critical • Let it run through all code quality checks • push it to prod, where it won't actually cause problems. • Any one can now review it easily, no need to merge or branch, since it's in mainline.

Slide 31

Slide 31 text

Continuous Integration • We had some automatic CI process, it was perhaps CruiseControl http://cruisecontrol.sourceforge.net • It built a debug and production version • code coverage reports using lcov
 http://ltp.sourceforge.net/coverage/lcov.php • and automatic documentation via doxygen 
 http://www.stack.nl/~dimitri/doxygen/

Slide 32

Slide 32 text

Security • Did we have any security problems? • Yes we did • But with our system it was easy to push a fix in the same day.

Slide 33

Slide 33 text

Nuts and Bolts • The deployment team did all builds, packaging and deployment. • It was a single binary so not that complicated • Not sure how the bits got moved to all machines. • But our upgrade process was command line based. • Here's how it worked:

Slide 34

Slide 34 text

Ramp up 1% • I seem to recall something that turned off the health check and the load balancer would remove the machine. • One machine was upgraded and put back in rotation. • And then gmond was queried for stats on the cluster • a script then looked for outliers to see if the box acted differently than any others (CPU, various other stats) • This had a hot text graphics with bar graphs

Slide 35

Slide 35 text

Ramp up 10-100% • Once one machine didn't explode the upgrade continued to 10% of the cluster. • Same statistical check as before • Nothing too fancy, just looked to see if some machines were above or below some standard deviation. • If we got scared we could undo.

Slide 36

Slide 36 text

And then we watch graphs.

Slide 37

Slide 37 text

Big thanks to all at Right Media. Vince Tse https://github.com/thelazyenginerd Andy Chung https://twitter.com/whereandy And Special Thanks to my partners in crime: Kai Ju - MIA :-(

Slide 38

Slide 38 text

No content