How we operate an OpenStack public cloud

How we operate an OpenStack public cloud

Venue: OpenStack Summit 2018, Berlin

One of OpenStack’s claims is that it can be used as the foundation for public clouds. Technically, this is true: you can manage users, users can start VMs, etc. – but what challenges do you face when you really run a public cloud? In this session, we want to talk about the technical and human challenges we faced. It’s about how to continuously provide (security-) updates to a cloud, continuously evolve a production platform, and finally about one thing: how can we prevent breaking our cloud services? We decided to radically automate the maintenance of our platform. Updates and reboots happen without human control, so engineers have time for more important things. Veteran operations engineers have learned that machines can monitor many more states at the same time and are therefore often more accurate than humans. Developers have learned that great things can happen when they take responsibility for a production platform.

We published the Unattended Reboot Manager on GitHub right before the talk: https://github.com/syseleven/rebootmgr

0d7855cf4a07079745cf2f8db97b86f2?s=128

Steffen Neubauer

November 14, 2018
Tweet

Transcript

  1. How we operate an OpenStack Public Cloud Jan Peschke ·

    Teamlead Cloud @ SysEleven GmbH Steffen Neubauer · Cloud Architect @ SysEleven GmbH
  2. Meltdown Spectre L1 Terminal Fault

  3. Easy fix: Update 3 1. Install update 2. Reboot your

    computer 3. Repeat Right?
  4. Customer expectation 4 1. Always available 2. Secure

  5. (Spoiler alert) 2500 reboots per year

  6. Never change a
 running system? 6 - Because of security

    threats, we have to change it constantly! - Regular small changes are better than rare, huge changes
  7. Humans are smarter than machines? 7 - Humans make mistakes

    when doing repetitive tasks - Example: - When accidentally rebooting two servers, storage is down
  8. How to reboot 2500 times?

  9. Always be aware of the cluster state! 9 - Should-be

    state: Configuration management - Actual state: Consul
  10. 10 Ist-Zustand vs. Soll-Zustand - Soll-Zustand: Konfigurationsmanagement

  11. 1. Every node checks regularly, if a reboot is necessary

    2. If yes, it gets a lock (so nobody else can reboot) 3. If successful, it asks Consul if all cluster services are OK 4. Local tasks are executed (e.g. live-migrate VMs) 5. Create monitoring downtime (consul maintenance) 6. Run reboot command 7. Once the node is back: 8. Remove the downtime 9. Run local tasks (e.g. re-enable nova compute) 10.Check if all services are OK again 11. Release the lock
  12. 12

  13. Open source since today: GitHub.com/syseleven/rebootmgr