Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How we operate an OpenStack public cloud

How we operate an OpenStack public cloud

Venue: OpenStack Summit 2018, Berlin

One of OpenStack’s claims is that it can be used as the foundation for public clouds. Technically, this is true: you can manage users, users can start VMs, etc. – but what challenges do you face when you really run a public cloud? In this session, we want to talk about the technical and human challenges we faced. It’s about how to continuously provide (security-) updates to a cloud, continuously evolve a production platform, and finally about one thing: how can we prevent breaking our cloud services? We decided to radically automate the maintenance of our platform. Updates and reboots happen without human control, so engineers have time for more important things. Veteran operations engineers have learned that machines can monitor many more states at the same time and are therefore often more accurate than humans. Developers have learned that great things can happen when they take responsibility for a production platform.

We published the Unattended Reboot Manager on GitHub right before the talk: https://github.com/syseleven/rebootmgr

Steffen Neubauer

November 14, 2018
Tweet

Other Decks in Programming

Transcript

  1. How we operate an OpenStack Public Cloud Jan Peschke ·

    Teamlead Cloud @ SysEleven GmbH Steffen Neubauer · Cloud Architect @ SysEleven GmbH
  2. Never change a
 running system? 6 - Because of security

    threats, we have to change it constantly! - Regular small changes are better than rare, huge changes
  3. Humans are smarter than machines? 7 - Humans make mistakes

    when doing repetitive tasks - Example: - When accidentally rebooting two servers, storage is down
  4. Always be aware of the cluster state! 9 - Should-be

    state: Configuration management - Actual state: Consul
  5. 1. Every node checks regularly, if a reboot is necessary

    2. If yes, it gets a lock (so nobody else can reboot) 3. If successful, it asks Consul if all cluster services are OK 4. Local tasks are executed (e.g. live-migrate VMs) 5. Create monitoring downtime (consul maintenance) 6. Run reboot command 7. Once the node is back: 8. Remove the downtime 9. Run local tasks (e.g. re-enable nova compute) 10.Check if all services are OK again 11. Release the lock
  6. 12