How we operate an OpenStack public cloud

How we operate an OpenStack Public Cloud Jan Peschke ·
Teamlead Cloud @ SysEleven GmbH Steffen Neubauer · Cloud Architect @ SysEleven GmbH

Meltdown Spectre L1 Terminal Fault

Easy ﬁx: Update 3 1. Install update 2. Reboot your
computer 3. Repeat Right?

Customer expectation 4 1. Always available 2. Secure

(Spoiler alert) 2500 reboots per year

Never change a  running system? 6 - Because of security
threats, we have to change it constantly! - Regular small changes are better than rare, huge changes

Humans are smarter than machines? 7 - Humans make mistakes
when doing repetitive tasks - Example: - When accidentally rebooting two servers, storage is down

How to reboot 2500 times?

Always be aware of the cluster state! 9 - Should-be
state: Conﬁguration management - Actual state: Consul

10 Ist-Zustand vs. Soll-Zustand - Soll-Zustand: Konﬁgurationsmanagement

1. Every node checks regularly, if a reboot is necessary
2. If yes, it gets a lock (so nobody else can reboot) 3. If successful, it asks Consul if all cluster services are OK 4. Local tasks are executed (e.g. live-migrate VMs) 5. Create monitoring downtime (consul maintenance) 6. Run reboot command 7. Once the node is back: 8. Remove the downtime 9. Run local tasks (e.g. re-enable nova compute) 10.Check if all services are OK again 11. Release the lock

Open source since today: GitHub.com/syseleven/rebootmgr

How we operate an OpenStack public cloud

How we operate an OpenStack public cloud

Steffen Neubauer

Other Decks in Programming

Featured

Transcript