Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Continuous Deployment 2007

Continuous Deployment 2007

How we did Continuous Deployment and Monitoring at Right Media in 2007

Nick Galbreath

October 21, 2014
Tweet

More Decks by Nick Galbreath

Other Decks in Technology

Transcript

  1. The Origins of Madness • The last few years, I

    was a poster boy for continuous deployment, in particular, for security. • Here's some context on how this came to be • It's unlikely you'll have anything like the environment at Right Media, and remember there is no right way to do deployments or "devops".
  2. Right Media • Early online advertising pioneer in New York

    City • Started doing 10k transactions a day in 2005 • Left when acquired by Yahoo in 2007 • Billion+ transactions a day • 1200+ machines • 3 datacenters
  3. Online Advertising • Micropayments at massive scale • Extremely sensitive

    code. Slight changes can have big impact… days later. • Publishers and Advertisers get really cranky when their ads don't run, displayed incorrectly, or blow out their budgets too quickly • Mistakes cost real money
  4. Team Structure • The real-time adserver group (input http request,

    output links to an advertisement or something) • An classical sysops group (network ops, building machines, etc) • A ops group that did higher-level functionality of moving data in and off the box. • (lots of other groups but outside of this story)
  5. Data Driven • Online advertising is almost entirely data-driven. •

    Meaning its very difficult to determine a server is exploding due to poor code, poor algorithms/ product or bad data. • This make is very hard to make a 'run book' for problems as its very likely not directly an operations problem (maybe a customer lit up a new ad campaign that is causing CPU problems).
  6. Responsibility • This made chain of responsibility very difficult. •

    Something is exploding, but is it a… • Bad machine? (sysops team) • Bad data? (data team) • Bad server? (server team)
  7. For better or for worse • It ended up being

    the server team's responsibility to rule out other issues, and escalate. • Made sense as really no one else understood how the server worked. • And the server was constantly under going changes. • Unrealistic to ask for anyone else to debug it. • (also one of those teams was remote)
  8. Developers had to • Test their own code • We

    did have a QA guy at one time, but turned out to be not useful abstraction step. • Push their own code • Monitor their own code
  9. Why is… • The server team was often asked why

    something wasn't working. The existing operational tools didn't really cut it, so we had to find other ways to answer the questions.
  10. Ganglia • http://ganglia.sourceforge.net • We used Ganglia for both operational

    metrics (CPU, etc) and application metrics. • And metrics specific to certain large customers (adserving is sorta a Platform-as-Service - a adserver might be handling requests for 1000s of publishers).
  11. Application Metrics • The C/C++ application would keep running tabs

    on various metrics internally
 (I recall a lot of atomic counters and whatnot) • Every minute it would emit stats via UDP to a upstream server. • You could also ask for stats directly from the server
  12. Embedded GMetric • Small C (and pure python/perl versions) library

    that emits a single statistic via UDP to a server. • https://code.google.com/p/embeddedgmetric/ • First commit in 2007! • It's still around and I think is used by many.
  13. gmond • part of ganglia • Centrally collects and stores

    metrics for each server. • You could can query gmond directly for latest stats of a single machine or all machines. • XML output • Single thread (like node.js) Very fast
  14. Before StatsD • Does this sound similar to StatsD? •

    You'd be correct! Very similar. • Both used UDP. Both used a single-threaded server to collect statistics and dump to storage. • Except our servers did some pre-aggregation on a per-minute basis. • StatsD uses sampling.
  15. Unlike StatsD • One cool thing is that it was

    very easy to ask gnomd for all current stats for a box or all machines. • Don't need to root around in graphite files to get all stats. • Really easy to do real time displays.
  16. Large Scale Ganglia UI • Ganglia isn't very good for

    displaying data for many machines at the same time • We wrote a number of tools that would display a quasi- heat-map of statistics. • Each machine was a box and coloured according to the value a stat • Super easy to find outliers and boxes where on fire. • Sorry no pictures :-(
  17. Oddly • We didn't use historical graphs that often •

    Something more than a few days old was obsolete and not useful. • Thats how fast things were changing.
  18. Monitor Everything • Even with 1000+ machine cluster, a single

    machine could still cause enormous problems (obsolete adservering rules, etc) • One machine in the cluster (our own hardware) "lost" 2G of memory one day (the machine just reported less memory) and chaos followed. • So we had alerts on fixed memory. This is level of detail we had deal with. • Infrastructure assertions
  19. Little Changes, Big Impacts • Sometimes a mistake would show

    up right away • But other times it was subtle and might take a full day (or more!) to be noticed (long data feedback loop, reporting, etc).
  20. Only Small Deploys • Given this, we learned, the hard

    way, to do only little changes, or isolated changes to production. • There was no other way debug. • With big changes were completely impossible to understand what happened. • Rollback would cause features to go away, making customers insane, and caused other problems.
  21. Deployment Rate • We deploy approximately once a day some

    times more sometimes less depending on feature and customer demands. • Normally during "low tide" (low amount of traffic) • Normally only one feature or change. • Pushing more often wasn't helpful as it took up to a day to see the effects of a bad deploy.
  22. Coordination • Free to deploy as often as needed •

    Again, normally during low-tide to minimize operational impact • Little to none co-ordination with ops and data teams
  23. Large Unit Test Framework • This partially since it was

    C/C++ where every line could cause a big disaster. Actually every character. • Code coverage was crucial. • But also as a way of self-documenting the code. • A large number of "features" were undocumented and the customers using them was unknown. We had to unit test, just to make sure we didn't change anything.
  24. Source Control • I think it was SVN • There

    were no branches. • Everyone committed to mainline. • WHY?
  25. No Branches? • If you code wasn't in production with

    in 3 days, it was very likely it would never go live. • The best projects where ones that could be completed and pushed in the same day. • Customer priorities were so fast and furious, you would be re- tasked on to another project and your code would be orphaned resulting in 100% waste of time. • No doubt later there would be another fire and that project would be needed again, but that might not for a while. • Maintaining branches and constantly merging was a complete waste of time. Get. It. To. Prod. Now.
  26. Push it. • If re-tasked, it's much better to get

    your old code to prod doing poor-man's Feature Flags: • #ifdef or if (0) out anything critical • Let it run through all code quality checks • push it to prod, where it won't actually cause problems. • Any one can now review it easily, no need to merge or branch, since it's in mainline.
  27. Continuous Integration • We had some automatic CI process, it

    was perhaps CruiseControl http://cruisecontrol.sourceforge.net • It built a debug and production version • code coverage reports using lcov
 http://ltp.sourceforge.net/coverage/lcov.php • and automatic documentation via doxygen 
 http://www.stack.nl/~dimitri/doxygen/
  28. Security • Did we have any security problems? • Yes

    we did • But with our system it was easy to push a fix in the same day.
  29. Nuts and Bolts • The deployment team did all builds,

    packaging and deployment. • It was a single binary so not that complicated • Not sure how the bits got moved to all machines. • But our upgrade process was command line based. • Here's how it worked:
  30. Ramp up 1% • I seem to recall something that

    turned off the health check and the load balancer would remove the machine. • One machine was upgraded and put back in rotation. • And then gmond was queried for stats on the cluster • a script then looked for outliers to see if the box acted differently than any others (CPU, various other stats) • This had a hot text graphics with bar graphs
  31. Ramp up 10-100% • Once one machine didn't explode the

    upgrade continued to 10% of the cluster. • Same statistical check as before • Nothing too fancy, just looked to see if some machines were above or below some standard deviation. • If we got scared we could undo.
  32. Big thanks to all at Right Media. Vince Tse https://github.com/thelazyenginerd

    Andy Chung https://twitter.com/whereandy And Special Thanks to my partners in crime: Kai Ju - MIA :-(