The bible (Introduction to OpenStack High Availability) The manual process is not simple Automation tools alleviate this (partially) HA is not the same for all components Active/Active Active/Passive There is no single best way to do it OpenStack HA
An OpenStack service that provides HA for underlying components New service graduates from incubation Adding HA is a breeze Eventually... Hopefully… One day… Incubated project Install component OpenStack HA Service Component is HA
Not a smooth process It is getting better (Nova improvements in Icehouse) Not always backward compatible Upgrade paths between older versions don’t always work It is not uncommon to see people running: Cactus, Diablo, Essex, Folsom, Grizzly, Icehouse All in one datacenter. Ready for an upgrade?
Patches are provided for 2 previous releases Perhaps an LTS version in the future? (Redhat are already going in that direction) Introduction of a new release Testing Deployment plan Implementation Stabilize Release Cycles and Why We Are Chasing Our Tails? And there is a new version every 6 months
Backup The management cluster should be relatively simple to rebuild – with automation Tenants and their workloads Is this an issue? Replication Not something that can be easily provided today (There are things in the works) DR Nothing today. Services provided by you today.
General rules for loglevels: Critical: Shit's on fire, yo. Expected, known issue where things will break and bad. Error: Standard unexpected error trap - final, top-level error trap should dump the message to ERROR. Also, known error cases that someone should handle that aren't necessarily "the world is exploding" Warn: expected error conditions that might be an issue, but not huge problems. Example at session: Glance's error at startup that it can't find a storage device ID (which is currently error, should be warn) Info: Standard operational logging: VM request received, scheduled to launch on hypervisor X Debug: What's going on under the hood. So you can trace down origins of errors - shouldn't have to be on by default Trace: Super debug. Method-level logging, or some otherwise extra-detailed info like slightly sanitized api conversations Logging as an example
Auditing & Compliance Who did what And when “detecting the tenants who added "allow all" rules to essentially turn off security groups” Can this workload run in this cluster? If not – then what? Shut it down? Move to correct location? Notify the president????? Services provided by you today.
• There are several gaps that need to be addressed • Great work is being done - there is still more to accomplish • It is all a question of how much you are willing to be flexible? How much responsibility you are willing to take upon yourself? • Not everything should (or can) run in OpenStack