Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Release Often, Release Safely

Sergejus
October 07, 2011

Release Often, Release Safely

Sergejus

October 07, 2011
Tweet

More Decks by Sergejus

Other Decks in Technology

Transcript

  1. Successful software workflow Your software cannot go down You got

    even more customers You got customers You released software
  2. Dilemma: Innovative or Stable?  Innovative  Often (bi-weekly) releases

    of new features  Higher risk of bugs and downtimes  Stable  Higher uptime and better customer perception  Seasonal releases of new features
  3. We wanted both … … be innovative and agile while

    staying as much stable as possible
  4. Stability in our terms  99.999% uptime for serving ads

     2 datacenters + clouds  500 M requests / day
  5. Challenges we ha(d/ve)  Detect issues in production as soon

    as possible  Test new features in production while reducing impact for customers  Roll-out new features in a controlled manner
  6. Detect issues in production ASAP  Monitoring  Choose monitoring

    system carefully  It took us about 1 year (Zabbix)  First list all your possible monitoring use cases  Prepare your software for monitoring  Logging is a must have!  Performance / SLA counters help to measure and understand software better  Create a clear baseline to compare with after releases
  7. Detect issues in production ASAP  Automated functional tests 

    Designed to detect end-user issues  Differently than unit and integration tests  UI / business logic  Still not as many as we want (Selenium UI / C#)  Ongoing process of unifying automated QA tests  Run after each release and on periodic basis  Very important if you have > 1 server  Huge time saver if tests are repetitive
  8. Though unit tests help in finding bugs during coding, they

    are more vital when software evolves! Finding
  9. Test new features in production  Even ideal staging environment

    is not equal to production environment  Before starting rolling-out new feature it is important to check its  Resource consumption  CPU / RAM / HDD / IO / Network  Performance impact on existing functionality  Response times / SLA  Stability  Errors / memory leaks
  10. Test new features in production Use Case #1: Safely rollout

    new feature that integrates into core data collection pipeline
  11. Test new features in production  Dark releases  Works

    best with brand new features  Release new feature to one or several servers  New feature gets real load, but is not available for customers  Have automated rollback package in case something goes wrong
  12. Test new features in production  Dark release notes from

    our release plan Release Date Release Type Team Project/Product Release Notes 2011.08.03 Dark RnD Topic Modelling Final part of the Topic Model Storage dark release. Changes to pullTransactions procedure on all Collect servers Enabled for Danish, Sweden and English languages 2011.08.02 Dark RnD Topic Modelling Part 2 of the Topic Model Storage dark release. Changes to pullTransactions procedure on Collect2 server Enabled for Danish language only 2011.08.01 Dark RnD Topic Modelling Part 1 of the Topic Model Storage dark release. SQL part of Administration and Collect servers (apart from pullTransactions procedure, this will be in part 2) Windows service part of Proc03 including integration with Amazon
  13. Test new features in production Use Case #2: Safely migrate

    to the new SQL connection pooling mechanism
  14. Test new features in production  Feature flags and switchers

     Works both for brand new features and updates  Feature can be switched on / off any time  if (FeatureEnabled) then …  if (UseNewLogic) then … else …  Can effect existing customers  Possible to test each server one by one by switching feature on / off
  15. Test new features in production Use Case #3: Safely migrate

    to the brand-new intelligent targeting subsystem
  16. Test new features in production  Valves  Very similar

    to switches  Feature can get from 0% to 100% of real load  Very handy to gradually roll-out new features on each server one by one  So far helped us a lot though require extra development effort
  17. Test new features in production  Caveats we had so

    far  Make sure you can turn features on / off without effecting connected users  Create simple interface to display current status of all switches and valves on each affected server  Secure access to switches and valves
  18. Controlling roll-out of new feature  Switches and valves enable

    very smooth and controlled roll-out  Partial roll-out to different datacenters / clouds  Different datacenters / clouds have different version of feature released  Redirect all traffic to the new or old version of feature
  19. Controlling roll-out of new feature  Future research: application level

    load balancing  Load balancer can act as a switches / valve without actually programming load distribution logic  Ability to automatically redirect users to the new version of application while preserving old one
  20. Summary  Monitoring system is very important, but your software

    should be prepared for this  Automated functional tests are functional monitoring of your software  Switches and valves are very powerful concept for testing in production and roll-outs, but require extra development and maintenance time  Dark releases and partial roll-outs are the most cost effective safety mechanism