system carefully It took us about 1 year (Zabbix) First list all your possible monitoring use cases Prepare your software for monitoring Logging is a must have! Performance / SLA counters help to measure and understand software better Create a clear baseline to compare with after releases
Designed to detect end-user issues Differently than unit and integration tests UI / business logic Still not as many as we want (Selenium UI / C#) Ongoing process of unifying automated QA tests Run after each release and on periodic basis Very important if you have > 1 server Huge time saver if tests are repetitive
is not equal to production environment Before starting rolling-out new feature it is important to check its Resource consumption CPU / RAM / HDD / IO / Network Performance impact on existing functionality Response times / SLA Stability Errors / memory leaks
best with brand new features Release new feature to one or several servers New feature gets real load, but is not available for customers Have automated rollback package in case something goes wrong
our release plan Release Date Release Type Team Project/Product Release Notes 2011.08.03 Dark RnD Topic Modelling Final part of the Topic Model Storage dark release. Changes to pullTransactions procedure on all Collect servers Enabled for Danish, Sweden and English languages 2011.08.02 Dark RnD Topic Modelling Part 2 of the Topic Model Storage dark release. Changes to pullTransactions procedure on Collect2 server Enabled for Danish language only 2011.08.01 Dark RnD Topic Modelling Part 1 of the Topic Model Storage dark release. SQL part of Administration and Collect servers (apart from pullTransactions procedure, this will be in part 2) Windows service part of Proc03 including integration with Amazon
Works both for brand new features and updates Feature can be switched on / off any time if (FeatureEnabled) then … if (UseNewLogic) then … else … Can effect existing customers Possible to test each server one by one by switching feature on / off
to switches Feature can get from 0% to 100% of real load Very handy to gradually roll-out new features on each server one by one So far helped us a lot though require extra development effort
far Make sure you can turn features on / off without effecting connected users Create simple interface to display current status of all switches and valves on each affected server Secure access to switches and valves
very smooth and controlled roll-out Partial roll-out to different datacenters / clouds Different datacenters / clouds have different version of feature released Redirect all traffic to the new or old version of feature
load balancing Load balancer can act as a switches / valve without actually programming load distribution logic Ability to automatically redirect users to the new version of application while preserving old one
should be prepared for this Automated functional tests are functional monitoring of your software Switches and valves are very powerful concept for testing in production and roll-outs, but require extra development and maintenance time Dark releases and partial roll-outs are the most cost effective safety mechanism