Continuous Deployment 2007

Continuous Deployment 2007 How we did deploys at Right Media,
NYC Nick Galbreath  [email protected] 2014-10-21

The Origins of Madness • The last few years, I
was a poster boy for continuous deployment, in particular, for security. • Here's some context on how this came to be • It's unlikely you'll have anything like the environment at Right Media, and remember there is no right way to do deployments or "devops".

Right Media • Early online advertising pioneer in New York
City • Started doing 10k transactions a day in 2005 • Left when acquired by Yahoo in 2007 • Billion+ transactions a day • 1200+ machines • 3 datacenters

Online Advertising • Micropayments at massive scale • Extremely sensitive
code. Slight changes can have big impact… days later. • Publishers and Advertisers get really cranky when their ads don't run, displayed incorrectly, or blow out their budgets too quickly • Mistakes cost real money

Team Structure • The real-time adserver group (input http request,
output links to an advertisement or something) • An classical sysops group (network ops, building machines, etc) • A ops group that did higher-level functionality of moving data in and off the box. • (lots of other groups but outside of this story)

Data Driven • Online advertising is almost entirely data-driven. •
Meaning its very difﬁcult to determine a server is exploding due to poor code, poor algorithms/ product or bad data. • This make is very hard to make a 'run book' for problems as its very likely not directly an operations problem (maybe a customer lit up a new ad campaign that is causing CPU problems).

Responsibility • This made chain of responsibility very difﬁcult. •
Something is exploding, but is it a… • Bad machine? (sysops team) • Bad data? (data team) • Bad server? (server team)

For better or for worse • It ended up being
the server team's responsibility to rule out other issues, and escalate. • Made sense as really no one else understood how the server worked. • And the server was constantly under going changes. • Unrealistic to ask for anyone else to debug it. • (also one of those teams was remote)

Developers had to • Test their own code • We
did have a QA guy at one time, but turned out to be not useful abstraction step. • Push their own code • Monitor their own code

Now it's called DevOps We called it Survival So dramatic!

Monitoring

Why is… • The server team was often asked why
something wasn't working. The existing operational tools didn't really cut it, so we had to ﬁnd other ways to answer the questions.

Ganglia • http://ganglia.sourceforge.net • We used Ganglia for both operational
metrics (CPU, etc) and application metrics. • And metrics speciﬁc to certain large customers (adserving is sorta a Platform-as-Service - a adserver might be handling requests for 1000s of publishers).

Application Metrics • The C/C++ application would keep running tabs
on various metrics internally  (I recall a lot of atomic counters and whatnot) • Every minute it would emit stats via UDP to a upstream server. • You could also ask for stats directly from the server

Embedded GMetric • Small C (and pure python/perl versions) library
that emits a single statistic via UDP to a server. • https://code.google.com/p/embeddedgmetric/ • First commit in 2007! • It's still around and I think is used by many.

gmond • part of ganglia • Centrally collects and stores
metrics for each server. • You could can query gmond directly for latest stats of a single machine or all machines. • XML output • Single thread (like node.js) Very fast

Before StatsD • Does this sound similar to StatsD? •
You'd be correct! Very similar. • Both used UDP. Both used a single-threaded server to collect statistics and dump to storage. • Except our servers did some pre-aggregation on a per-minute basis. • StatsD uses sampling.

Unlike StatsD • One cool thing is that it was
very easy to ask gnomd for all current stats for a box or all machines. • Don't need to root around in graphite ﬁles to get all stats. • Really easy to do real time displays.

Large Scale Ganglia UI • Ganglia isn't very good for
displaying data for many machines at the same time • We wrote a number of tools that would display a quasi- heat-map of statistics. • Each machine was a box and coloured according to the value a stat • Super easy to ﬁnd outliers and boxes where on ﬁre. • Sorry no pictures :-(

Oddly • We didn't use historical graphs that often •
Something more than a few days old was obsolete and not useful. • Thats how fast things were changing.

Monitor Everything • Even with 1000+ machine cluster, a single
machine could still cause enormous problems (obsolete adservering rules, etc) • One machine in the cluster (our own hardware) "lost" 2G of memory one day (the machine just reported less memory) and chaos followed. • So we had alerts on ﬁxed memory. This is level of detail we had deal with. • Infrastructure assertions

Deployment

Little Changes, Big Impacts • Sometimes a mistake would show
up right away • But other times it was subtle and might take a full day (or more!) to be noticed (long data feedback loop, reporting, etc).

Only Small Deploys • Given this, we learned, the hard
way, to do only little changes, or isolated changes to production. • There was no other way debug. • With big changes were completely impossible to understand what happened. • Rollback would cause features to go away, making customers insane, and caused other problems.

Deployment Rate • We deploy approximately once a day some
times more sometimes less depending on feature and customer demands. • Normally during "low tide" (low amount of trafﬁc) • Normally only one feature or change. • Pushing more often wasn't helpful as it took up to a day to see the effects of a bad deploy.

Coordination • Free to deploy as often as needed •
Again, normally during low-tide to minimize operational impact • Little to none co-ordination with ops and data teams

Large Unit Test Framework • This partially since it was
C/C++ where every line could cause a big disaster. Actually every character. • Code coverage was crucial. • But also as a way of self-documenting the code. • A large number of "features" were undocumented and the customers using them was unknown. We had to unit test, just to make sure we didn't change anything.

Source Control • I think it was SVN • There
were no branches. • Everyone committed to mainline. • WHY?

No Branches? • If you code wasn't in production with
in 3 days, it was very likely it would never go live. • The best projects where ones that could be completed and pushed in the same day. • Customer priorities were so fast and furious, you would be re- tasked on to another project and your code would be orphaned resulting in 100% waste of time. • No doubt later there would be another ﬁre and that project would be needed again, but that might not for a while. • Maintaining branches and constantly merging was a complete waste of time. Get. It. To. Prod. Now.

Push it. • If re-tasked, it's much better to get
your old code to prod doing poor-man's Feature Flags: • #ifdef or if (0) out anything critical • Let it run through all code quality checks • push it to prod, where it won't actually cause problems. • Any one can now review it easily, no need to merge or branch, since it's in mainline.

Continuous Integration • We had some automatic CI process, it
was perhaps CruiseControl http://cruisecontrol.sourceforge.net • It built a debug and production version • code coverage reports using lcov  http://ltp.sourceforge.net/coverage/lcov.php • and automatic documentation via doxygen   http://www.stack.nl/~dimitri/doxygen/

Security • Did we have any security problems? • Yes
we did • But with our system it was easy to push a ﬁx in the same day.

Nuts and Bolts • The deployment team did all builds,
packaging and deployment. • It was a single binary so not that complicated • Not sure how the bits got moved to all machines. • But our upgrade process was command line based. • Here's how it worked:

Ramp up 1% • I seem to recall something that
turned off the health check and the load balancer would remove the machine. • One machine was upgraded and put back in rotation. • And then gmond was queried for stats on the cluster • a script then looked for outliers to see if the box acted differently than any others (CPU, various other stats) • This had a hot text graphics with bar graphs

Ramp up 10-100% • Once one machine didn't explode the
upgrade continued to 10% of the cluster. • Same statistical check as before • Nothing too fancy, just looked to see if some machines were above or below some standard deviation. • If we got scared we could undo.

And then we watch graphs.

Big thanks to all at Right Media. Vince Tse https://github.com/thelazyenginerd
Andy Chung https://twitter.com/whereandy And Special Thanks to my partners in crime: Kai Ju - MIA :-(

Continuous Deployment 2007

Continuous Deployment 2007

More Decks by Nick Galbreath

Other Decks in Technology

Featured

Transcript